Btrfs: make defrag not fragment files when using prealloc extents
When using prealloc extents, a file defragment operation may actually
fragment the file and increase the amount of data space used by the file.
This change fixes that behaviour.
Example:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt
$ cd /mnt
$ xfs_io -f -c 'falloc 0
1048576' foobar && sync
$ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
$ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
$ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync
Before defragmenting the file:
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00
$ btrfs-debug-tree /dev/sdb3
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset 0 nr 4096
item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 4096 nr 102400 ram
1048576
extent compression 0
item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset 106496 nr 90112
item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 196608 nr 106496 ram
1048576
extent compression 0
item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset 303104 nr 593920
item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 897024 nr 106496 ram
1048576
extent compression 0
item 12 key (257 EXTENT_DATA
1003520) itemoff 15492 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset
1003520 nr 45056
(...)
Now defragmenting the file results in more data space used than before:
$ btrfs filesystem defragment -f foobar && sync
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.55MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00
And the corresponding file extent items are now no longer perfectly sequential
as before, and we're now needlessly using more space from data block groups:
$ btrfs-debug-tree /dev/sdb3
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 0 nr 4096 ram
1048576
extent compression 0
item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
extent data disk byte
13893632 nr 102400
extent data offset 0 nr 102400 ram 102400
extent compression 0
item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 106496 nr 90112 ram
1048576
extent compression 0
item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
extent data disk byte
13996032 nr 106496
extent data offset 0 nr 106496 ram 106496
extent compression 0
item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset 303104 nr 593920
item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
extent data disk byte
14102528 nr 106496
extent data offset 0 nr 106496 ram 106496
extent compression 0
item 12 key (257 EXTENT_DATA
1003520) itemoff 15492 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset
1003520 nr 45056 ram
1048576
extent compression 0
(...)
With this change, the above example will no longer cause allocation of new data
space nor change the sequentiality of the file extents, that is, defragment will
be effectless, leaving all extent items pointing to the extent starting at disk
byte
12845056.
In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
400Mb each, initially consisting of a single prealloc extent of 400Mb, having
random writes happening at a low rate, lead to a total of over ~17Gb of data
space used, not far from eventually reaching an ENOSPC state.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>