Chris Mason [Mon, 27 Feb 2017 21:11:53 +0000 (13:11 -0800)]
Merge branch 'for-chris-4.11' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.11
Filipe Manana [Fri, 17 Feb 2017 18:43:57 +0000 (18:43 +0000)]
Btrfs: try harder to migrate items to left sibling before splitting a leaf
Before attempting to split a leaf we try to migrate items from the leaf to
its right and left siblings. We start by trying to move items into the
rigth sibling and, if the new item is meant to be inserted at the end of
our leaf, we try to free from our leaf an amount of bytes equal to the
number of bytes used by the new item, by setting the variable space_needed
to the byte size of that new item. However if we fail to move enough items
to the right sibling due to lack of space in that sibling, we then try
to move items into the left sibling, and in that case we try to free
an amount equal to the size of the new item from our leaf, when we need
only to free an amount corresponding to the size of the new item minus
the current free space of our leaf. So make sure that before we try to
move items to the left sibling we do set the variable space_needed with
a value corresponding to the new item's size minus the leaf's current
free space.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Tue, 14 Feb 2017 16:56:01 +0000 (16:56 +0000)]
Btrfs: fix data loss after truncate when using the no-holes feature
If we have a file with an implicit hole (NO_HOLES feature enabled) that
has an extent following the hole, delayed writes against regions of the
file behind the hole happened before but were not yet flushed and then
we truncate the file to a smaller size that lies inside the hole, we
end up persisting a wrong disk_i_size value for our inode that leads to
data loss after umounting and mounting again the filesystem or after
the inode is evicted and loaded again.
This happens because at inode.c:btrfs_truncate_inode_items() we end up
setting last_size to the offset of the extent that we deleted and that
followed the hole. We then pass that value to btrfs_ordered_update_i_size()
which updates the inode's disk_i_size to a value smaller then the offset
of the buffered (delayed) writes.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x01 0K 32K" /mnt/foo
$ xfs_io -d -c "pwrite -S 0x02 -b 32K 64K 32K" /mnt/foo
$ xfs_io -c "truncate 60K" /mnt/foo
--> inode's disk_i_size updated to 0
$ md5sum /mnt/foo
3c5ca3c3ab42f4b04d7e7eb0b0d4d806 /mnt/foo
$ umount /dev/sdb
$ mount /dev/sdb /mnt
$ md5sum /mnt/foo
d41d8cd98f00b204e9800998ecf8427e /mnt/foo
--> Empty file, all data lost!
Cc: <stable@vger.kernel.org> # 3.14+
Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Tue, 14 Feb 2017 17:56:32 +0000 (17:56 +0000)]
Btrfs: incremental send, fix unnecessary hole writes for sparse files
When using the NO_HOLES feature, during an incremental send we often issue
write operations for holes when we should not, because that range is already
a hole in the destination snapshot. While that does not change the contents
of the file at the receiver, it avoids preservation of file holes, leading
to wasted disk space and extra IO during send/receive.
A couple examples where the holes are not preserved follows.
$ mkfs.btrfs -O no-holes -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo
$ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
# Now add one new extent to our first test file, increasing its size and
# leaving a 1Mb hole between the first extent and this new extent.
$ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo
# Now overwrite the last extent of our second test file.
$ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
/mnt/snap2/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 25088..25095 8 0x2000
1: [8..2055]: hole 2048
2: [2056..2063]: 24576..24583 8 0x2001
$ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
/mnt/snap2/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 25096..25103 8 0x2000
1: [8..2055]: hole 2048
2: [2056..2063]: 24584..24591 8 0x2001
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
# It's not relevant to enable no-holes in the new filesystem.
$ mkfs.btrfs -O no-holes -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive /mnt -f /tmp/1.snap
$ btrfs receive /mnt -f /tmp/2.snap
$ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
/mnt/snap2/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 24576..24583 8 0x2000
1: [8..2063]: 25624..27679 2056 0x1
$ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
/mnt/snap2/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 24584..24591 8 0x2000
1: [8..2063]: 27680..29735 2056 0x1
The holes do not exist in the second filesystem and they were replaced
with extents filled with the byte 0x00, making each file take 1032Kb of
space instead of 8Kb.
So fix this by not issuing the write operations consisting of buffers
filled with the byte 0x00 when the destination snapshot already has a
hole for the respective range.
A test case for fstests will follow soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Sat, 4 Feb 2017 17:12:00 +0000 (17:12 +0000)]
Btrfs: fix use-after-free due to wrong order of destroying work queues
Before we destroy all work queues (and wait for their tasks to complete)
we were destroying the work queues used for metadata I/O operations, which
can result in a use-after-free problem because most tasks from all work
queues do metadata I/O operations. For example, the tasks from the caching
workers work queue (fs_info->caching_workers), which is destroyed only
after the work queue used for metadata reads (fs_info->endio_meta_workers)
is destroyed, do metadata reads, which result in attempts to queue tasks
into the later work queue, triggering a use-after-free with a trace like
the following:
[23114.613543] general protection fault: 0000 [#1] PREEMPT SMP
[23114.614442] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic
acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 crc16
jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug]
[23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 4.9.0-rc7-btrfs-next-36+ #1
[23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs]
[23114.616932] task:
ffff880221d45780 task.stack:
ffffc9000bc50000
[23114.616932] RIP: 0010:[<
ffffffffa037c1bf>] [<
ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs]
[23114.616932] RSP: 0018:
ffff88023f443d60 EFLAGS:
00010246
[23114.616932] RAX:
0000000000000000 RBX:
6b6b6b6b6b6b6b6b RCX:
0000000000000102
[23114.616932] RDX:
ffffffffa0419000 RSI:
ffff88011df534f0 RDI:
ffff880101f01c00
[23114.616932] RBP:
ffff88023f443d80 R08:
00000000000f7000 R09:
000000000000ffff
[23114.616932] R10:
ffff88023f443d48 R11:
0000000000001000 R12:
ffff88011df534f0
[23114.616932] R13:
ffff880135963868 R14:
0000000000001000 R15:
0000000000001000
[23114.616932] FS:
0000000000000000(0000) GS:
ffff88023f440000(0000) knlGS:
0000000000000000
[23114.616932] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[23114.616932] CR2:
00007f0fb9f8e520 CR3:
0000000001a0b000 CR4:
00000000000006e0
[23114.616932] Stack:
[23114.616932]
ffff880101f01c00 ffff88011df534f0 ffff880135963868 0000000000001000
[23114.616932]
ffff88023f443da0 ffffffffa03470af ffff880149b37200 ffff880135963868
[23114.616932]
ffff88023f443db8 ffffffff8125293c ffff880149b37200 ffff88023f443de0
[23114.616932] Call Trace:
[23114.616932] <IRQ> [23114.616932] [<
ffffffffa03470af>] end_workqueue_bio+0xd5/0xda [btrfs]
[23114.616932] [<
ffffffff8125293c>] bio_endio+0x54/0x57
[23114.616932] [<
ffffffffa0377929>] btrfs_end_bio+0xf7/0x106 [btrfs]
[23114.616932] [<
ffffffff8125293c>] bio_endio+0x54/0x57
[23114.616932] [<
ffffffff8125955f>] blk_update_request+0x21a/0x30f
[23114.616932] [<
ffffffffa0022316>] scsi_end_request+0x31/0x182 [scsi_mod]
[23114.616932] [<
ffffffffa00235fc>] scsi_io_completion+0x1ce/0x4c8 [scsi_mod]
[23114.616932] [<
ffffffffa001ba9d>] scsi_finish_command+0x104/0x10d [scsi_mod]
[23114.616932] [<
ffffffffa002311f>] scsi_softirq_done+0x101/0x10a [scsi_mod]
[23114.616932] [<
ffffffff8125fbd9>] blk_done_softirq+0x82/0x8d
[23114.616932] [<
ffffffff814c8a4b>] __do_softirq+0x1ab/0x412
[23114.616932] [<
ffffffff8105b01d>] irq_exit+0x49/0x99
[23114.616932] [<
ffffffff81035135>] smp_call_function_single_interrupt+0x24/0x26
[23114.616932] [<
ffffffff814c7ec9>] call_function_single_interrupt+0x89/0x90
[23114.616932] <EOI> [23114.616932] [<
ffffffffa0023262>] ? scsi_request_fn+0x13a/0x2a1 [scsi_mod]
[23114.616932] [<
ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a
[23114.616932] [<
ffffffff814c596c>] ? _raw_spin_unlock_irq+0x32/0x4a
[23114.616932] [<
ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a
[23114.616932] [<
ffffffffa0023262>] scsi_request_fn+0x13a/0x2a1 [scsi_mod]
[23114.616932] [<
ffffffff8125590e>] __blk_run_queue_uncond+0x22/0x2b
[23114.616932] [<
ffffffff81255930>] __blk_run_queue+0x19/0x1b
[23114.616932] [<
ffffffff8125ab01>] blk_queue_bio+0x268/0x282
[23114.616932] [<
ffffffff81258f44>] generic_make_request+0xbd/0x160
[23114.616932] [<
ffffffff812590e7>] submit_bio+0x100/0x11d
[23114.616932] [<
ffffffff81298603>] ? __this_cpu_preempt_check+0x13/0x15
[23114.616932] [<
ffffffff812a1805>] ? __percpu_counter_add+0x8e/0xa7
[23114.616932] [<
ffffffffa03bfd47>] btrfsic_submit_bio+0x1a/0x1d [btrfs]
[23114.616932] [<
ffffffffa0377db2>] btrfs_map_bio+0x1f4/0x26d [btrfs]
[23114.616932] [<
ffffffffa0348a33>] btree_submit_bio_hook+0x74/0xbf [btrfs]
[23114.616932] [<
ffffffffa03489bf>] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs]
[23114.616932] [<
ffffffffa03697a9>] submit_one_bio+0x6b/0x89 [btrfs]
[23114.616932] [<
ffffffffa036f5be>] read_extent_buffer_pages+0x170/0x1ec [btrfs]
[23114.616932] [<
ffffffffa03471fa>] ? free_root_pointers+0x64/0x64 [btrfs]
[23114.616932] [<
ffffffffa0348adf>] readahead_tree_block+0x3f/0x4c [btrfs]
[23114.616932] [<
ffffffffa032e115>] read_block_for_search.isra.20+0x1ce/0x23d [btrfs]
[23114.616932] [<
ffffffffa032fab8>] btrfs_search_slot+0x65f/0x774 [btrfs]
[23114.616932] [<
ffffffffa036eff1>] ? free_extent_buffer+0x73/0x7e [btrfs]
[23114.616932] [<
ffffffffa0331ba4>] btrfs_next_old_leaf+0xa1/0x33c [btrfs]
[23114.616932] [<
ffffffffa0331e4f>] btrfs_next_leaf+0x10/0x12 [btrfs]
[23114.616932] [<
ffffffffa0336aa6>] caching_thread+0x22d/0x416 [btrfs]
[23114.616932] [<
ffffffffa037bce9>] btrfs_scrubparity_helper+0x187/0x3b6 [btrfs]
[23114.616932] [<
ffffffffa037c036>] btrfs_cache_helper+0xe/0x10 [btrfs]
[23114.616932] [<
ffffffff8106cf96>] process_one_work+0x273/0x4e4
[23114.616932] [<
ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
[23114.616932] [<
ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
[23114.616932] [<
ffffffff81072a81>] kthread+0xd5/0xdd
[23114.616932] [<
ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
[23114.616932] [<
ffffffff814c6257>] ret_from_fork+0x27/0x40
[23114.616932] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b
64 ff 74 04 f0 ff 43 58 49 83 7c 24 08 00 74 2c 4c 8d 6b
[23114.616932] RIP [<
ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs]
[23114.616932] RSP <
ffff88023f443d60>
[23114.689493] ---[ end trace
6e48b6bc707ca34b ]---
[23114.690166] Kernel panic - not syncing: Fatal exception in interrupt
[23114.691283] Kernel Offset: disabled
[23114.691918] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
The following diagram shows the sequence of operations that lead to the
use-after-free problem from the above trace:
CPU 1 CPU 2 CPU 3
caching_thread()
close_ctree()
btrfs_stop_all_workers()
btrfs_destroy_workqueue(
fs_info->endio_meta_workers)
btrfs_search_slot()
read_block_for_search()
readahead_tree_block()
read_extent_buffer_pages()
submit_one_bio()
btree_submit_bio_hook()
btrfs_bio_wq_end_io()
--> sets the bio's
bi_end_io callback
to end_workqueue_bio()
--> bio is submitted
bio completes
and its bi_end_io callback
is invoked
--> end_workqueue_bio()
--> attempts to queue
a task on fs_info->endio_meta_workers
btrfs_destroy_workqueue(
fs_info->caching_workers)
So fix this by destroying the queues used for metadata I/O tasks only
after destroying all the other queues.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Wed, 1 Feb 2017 22:39:50 +0000 (22:39 +0000)]
Btrfs: fix assertion failure when freeing block groups at close_ctree()
At close_ctree() we free the block groups and then only after we wait for
any running worker kthreads to finish and shutdown the workqueues. This
behaviour is racy and it triggers an assertion failure when freeing block
groups because while we are doing it we can have for example a block group
caching kthread running, and in that case the block group's reference
count can still be greater than 1 by the time we assert its reference count
is 1, leading to an assertion failure:
[19041.198004] assertion failed: atomic_read(&block_group->count) == 1, file: fs/btrfs/extent-tree.c, line: 9799
[19041.200584] ------------[ cut here ]------------
[19041.201692] kernel BUG at fs/btrfs/ctree.h:3418!
[19041.202830] invalid opcode: 0000 [#1] PREEMPT SMP
[19041.203929] Modules linked in: btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic ppdev sg psmouse acpi_cpufreq pcspkr parport_pc evdev tpm_tis parport tpm_tis_core i2c_piix4 i2c_core tpm serio_raw processor button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
[19041.208082] CPU: 6 PID: 29051 Comm: umount Not tainted 4.9.0-rc7-btrfs-next-36+ #1
[19041.208082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[19041.208082] task:
ffff88015f028980 task.stack:
ffffc9000ad34000
[19041.208082] RIP: 0010:[<
ffffffffa03e319e>] [<
ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs]
[19041.208082] RSP: 0018:
ffffc9000ad37d60 EFLAGS:
00010286
[19041.208082] RAX:
0000000000000061 RBX:
ffff88015ecb4000 RCX:
0000000000000001
[19041.208082] RDX:
ffff88023f392fb8 RSI:
ffffffff817ef7ba RDI:
00000000ffffffff
[19041.208082] RBP:
ffffc9000ad37d60 R08:
0000000000000001 R09:
0000000000000000
[19041.208082] R10:
ffffc9000ad37cb0 R11:
ffffffff82f2b66d R12:
ffff88023431d170
[19041.208082] R13:
ffff88015ecb40c0 R14:
ffff88023431d000 R15:
ffff88015ecb4100
[19041.208082] FS:
00007f44f3d42840(0000) GS:
ffff88023f380000(0000) knlGS:
0000000000000000
[19041.208082] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[19041.208082] CR2:
00007f65d623b000 CR3:
00000002166f2000 CR4:
00000000000006e0
[19041.208082] Stack:
[19041.208082]
ffffc9000ad37d98 ffffffffa035989f ffff88015ecb4000 ffff88015ecb5630
[19041.208082]
ffff88014f6be000 0000000000000000 00007ffcf0ba6a10 ffffc9000ad37df8
[19041.208082]
ffffffffa0368cd4 ffff88014e9658e0 ffffc9000ad37e08 ffffffff811a634d
[19041.208082] Call Trace:
[19041.208082] [<
ffffffffa035989f>] btrfs_free_block_groups+0x17f/0x392 [btrfs]
[19041.208082] [<
ffffffffa0368cd4>] close_ctree+0x1c5/0x2e1 [btrfs]
[19041.208082] [<
ffffffff811a634d>] ? evict_inodes+0x132/0x141
[19041.208082] [<
ffffffffa034356d>] btrfs_put_super+0x15/0x17 [btrfs]
[19041.208082] [<
ffffffff8118fc32>] generic_shutdown_super+0x6a/0xeb
[19041.208082] [<
ffffffff8119004f>] kill_anon_super+0x12/0x1c
[19041.208082] [<
ffffffffa0343370>] btrfs_kill_super+0x16/0x21 [btrfs]
[19041.208082] [<
ffffffff8118fad1>] deactivate_locked_super+0x3b/0x68
[19041.208082] [<
ffffffff8118fb34>] deactivate_super+0x36/0x39
[19041.208082] [<
ffffffff811a9946>] cleanup_mnt+0x58/0x76
[19041.208082] [<
ffffffff811a99a2>] __cleanup_mnt+0x12/0x14
[19041.208082] [<
ffffffff81071573>] task_work_run+0x6f/0x95
[19041.208082] [<
ffffffff81001897>] prepare_exit_to_usermode+0xa3/0xc1
[19041.208082] [<
ffffffff81001a23>] syscall_return_slowpath+0x16e/0x1d2
[19041.208082] [<
ffffffff814c607d>] entry_SYSCALL_64_fastpath+0xab/0xad
[19041.208082] Code: c7 ae a0 3e a0 48 89 e5 e8 4e 74 d4 e0 0f 0b 55 89 f1 48 c7 c2 0b a4 3e a0 48 89 fe 48 c7 c7 a4 a6 3e a0 48 89 e5 e8 30 74 d4 e0 <0f> 0b 55 31 d2 48 89 e5 e8 d5 b9 f7 ff 5d c3 48 63 f6 55 31 c9
[19041.208082] RIP [<
ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs]
[19041.208082] RSP <
ffffc9000ad37d60>
[19041.279264] ---[ end trace
23330586f16f064d ]---
This started happening as of kernel 4.8, since commit
f3bca8028bd9
("Btrfs: add ASSERT for block group's memory leak") introduced these
assertions.
So fix this by freeing the block groups only after waiting for all
worker kthreads to complete and shutdown the workqueues.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Wed, 1 Feb 2017 14:58:02 +0000 (14:58 +0000)]
Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled
We log holes explicitly by using file extent items, however when replaying
a log tree, if a logged file extent item corresponds to a hole and the
NO_HOLES feature is enabled we do not need to copy the file extent item
into the fs/subvolume tree, as the absence of such file extent items is
the purpose of the NO_HOLES feature. So skip the copying of file extent
items representing holes when the NO_HOLES feature is enabled.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Robbie Ko [Fri, 7 Oct 2016 02:01:29 +0000 (10:01 +0800)]
Btrfs: fix leak of subvolume writers counter
When falling back from a nocow write to a regular cow write, we were
leaking the subvolume writers counter in 2 situations, preventing
snapshot creation from ever completing in the future, as it waits
for that counter to go down to zero before the snapshot creation
starts.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Improved changelog and subject]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Sat, 28 Jan 2017 01:47:56 +0000 (01:47 +0000)]
Btrfs: bulk delete checksum items in the same leaf
Very often we have the checksums for an extent spread in multiple items
in the checksums tree, and currently the algorithm to delete them starts
by looking for them one by one and then deleting them one by one, which
is not optimal since each deletion involves shifting all the other items
in the leaf and when the leaf reaches some low threshold, to move items
off the leaf into its left and right neighbor leafs. Also, after each
item deletion we release our search path and start a new search for other
checksums items.
So optimize this by deleting in bulk all the items in the same leaf that
contain checksums for the extent being freed.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Robbie Ko [Thu, 5 Jan 2017 08:24:58 +0000 (16:24 +0800)]
Btrfs: incremental send, do not issue invalid rmdir operations
When both the parent and send snapshots have a directory inode with the
same number but different generations (therefore they are different
inodes) and both have an entry with the same name, an incremental send
stream will contain an invalid rmdir operation that refers to the
orphanized name of the inode from the parent snapshot.
The following example scenario shows how this happens.
Parent snapshot:
.
|---- d259_old/ (ino 259, gen 9)
| |---- d1/ (ino 258, gen 9)
|
|---- f (ino 257, gen 9)
Send snapshot:
.
|---- d258/ (ino 258, gen 7)
|---- d259/ (ino 259, gen 7)
|---- d1/ (ino 257, gen 7)
When the kernel is processing inode 258 it notices that in both snapshots
there is an inode numbered 259 that is a parent of an inode 258. However
it ignores the fact that the inodes numbered 259 have different generations
in both snapshots, which means they are effectively different inodes.
Then it checks that both inodes 259 have a dentry named "d1" and because
of that it issues a rmdir operation with orphanized name of the inode 258
from the parent snapshot. This happens at send.c:process_record_refs(),
which calls send.c:did_overwrite_first_ref() that returns true and because
of that later on at process_recorded_refs() such rmdir operation is issued
because the inode being currently processed (258) is a directory and it
was deleted in the send snapshot (and replaced with another inode that has
the same number and is a directory too).
Fix this issue by comparing the generations of parent directory inodes
that have the same number and make send.c:did_overwrite_first_ref() when
the generations are different.
The following steps reproduce the problem.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/f
$ mkdir /mnt/d1
$ mkdir /mnt/d259_old
$ mv /mnt/d1 /mnt/d259_old/d1
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/d1
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/d1 /mnt/dir259/d1
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt/ -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inodes from the second snapshot all have
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10. This means all the inodes
# in the first snapshot (snap1) have a generation value of 9.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ btrfs receive /mnt -f /tmp/1.snap
$ btrfs receive -vv /mnt -f /tmp/2.snap
receiving snapshot mysnap2 uuid=
9c03962f-f620-0047-9f98-
32e5a87116d9, ctransid=7 parent_uuid=
d17a6e3f-14e5-df4f-be39-
a7951a5399aa, parent_ctransid=9
utimes
unlink f
mkdir o257-7-0
mkdir o259-7-0
rename o257-7-0 -> o259-7-0/d1
chown o259-7-0/d1 - uid=0, gid=0
chmod o259-7-0/d1 - mode=0755
utimes o259-7-0/d1
rmdir o258-9-0
ERROR: rmdir o258-9-0 failed: No such file or directory
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 11 Jan 2017 02:15:39 +0000 (02:15 +0000)]
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an
inode we not checking if a parent inode that exists in the send and
parent snapshots is really the same inode or not, that is, we are not
comparing the generation number of the parent inode in the send and
parent snapshots. Not only this results in unnecessarily delaying a
rename operation but also can later on make us generate an incorrect
name for a new inode in the send snapshot that has the same number
as another inode in the parent snapshot but a different generation.
Here follows an example where this happens.
Parent snapshot:
. (ino 256, gen 3)
|--- dir258/ (ino 258, gen 7)
| |--- dir257/ (ino 257, gen 7)
|
|--- dir259/ (ino 259, gen 7)
Send snapshot:
. (ino 256, gen 3)
|--- file258 (ino 258, gen 10)
|
|--- new_dir259/ (ino 259, gen 10)
|--- dir257/ (ino 257, gen 7)
The following steps happen when computing the incremental send stream:
1) When processing inode 257, its new parent is created using its orphan
name (o257-21-0), and the rename operation for inode 257 is delayed
because its new parent (inode 259) was not yet processed - this
decision to delay the rename operation does not make much sense
because the inode 259 in the send snapshot is a new inode, it's not
the same as inode 259 in the parent snapshot.
2) When processing inode 258 we end up delaying its rmdir operation,
because inode 257 was not yet renamed (moved away from the directory
inode 258 represents). We also create the new inode 258 using its
orphan name "o258-10-0", then rename it to its final name of "file258"
and then issue a truncate operation for it. However this truncate
operation contains an incorrect name, which corresponds to the orphan
name and not to the final name, which makes the receiver fail. This
happens because when we attempt to compute the inode's current name
we verify that there's another inode with the same number (258) that
has its rmdir operation pending and because of that we generate an
orphan name for the new inode 258 (we do this in the function
get_cur_path()).
Fix this by not delayed the rename operation of an inode if it has parents
with the same number but different generations in both snapshots.
The following steps reproduce this example scenario.
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/dir257
$ mkdir /mnt/dir258
$ mkdir /mnt/dir259
$ mv /mnt/dir257 /mnt/dir258/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ mv /mnt/dir258/dir257 /mnt/dir257
$ rmdir /mnt/dir258
$ rmdir /mnt/dir259
# Remount the filesystem so that the next created inodes will have the
# numbers 258 and 259. This is because when a filesystem is mounted,
# btrfs sets the subvolume's inode counter to a value corresponding to
# the highest inode number in the subvolume plus 1. This inode counter
# is used to assign a unique number to each new inode and it's
# incremented by 1 after very inode creation.
# Note: we unmount and then mount instead of doing a mount with
# "-o remount" because otherwise the inode counter remains at value 260.
$ umount /mnt
$ mount /dev/sdb /mnt
$ touch /mnt/file258
$ mkdir /mnt/new_dir259
$ mv /mnt/dir257 /mnt/new_dir259/dir257
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ btrfs receive /mnt -f /tmo/1.snap
$ btrfs receive /mnt -f /tmo/2.snap -vv
receiving snapshot mysnap2 uuid=
e059b6d1-7f55-f140-8d7c-
9a3039d23c97, ctransid=10 parent_uuid=
77e98cb6-8762-814f-9e05-
e8ba877fc0b0, parent_ctransid=7
utimes
mkdir o259-10-0
rename dir258 -> o258-7-0
utimes
mkfile o258-10-0
rename o258-10-0 -> file258
utimes
truncate o258-10-0 size=0
ERROR: truncate o258-10-0 failed: No such file or directory
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Robbie Ko [Thu, 5 Jan 2017 08:24:55 +0000 (16:24 +0800)]
Btrfs: send, fix failure to rename top level inode due to name collision
Under certain situations, an incremental send operation can fail due to a
premature attempt to create a new top level inode (a direct child of the
subvolume/snapshot root) whose name collides with another inode that was
removed from the send snapshot.
Consider the following example scenario.
Parent snapshot:
. (ino 256, gen 8)
|---- a1/ (ino 257, gen 9)
|---- a2/ (ino 258, gen 9)
Send snapshot:
. (ino 256, gen 3)
|---- a2/ (ino 257, gen 7)
In this scenario, when receiving the incremental send stream, the btrfs
receive command fails like this (ran in verbose mode, -vv argument):
rmdir a1
mkfile o257-7-0
rename o257-7-0 -> a2
ERROR: rename o257-7-0 -> a2 failed: Is a directory
What happens when computing the incremental send stream is:
1) An operation to remove the directory with inode number 257 and
generation 9 is issued.
2) An operation to create the inode with number 257 and generation 7 is
issued. This creates the inode with an orphanized name of "o257-7-0".
3) An operation rename the new inode 257 to its final name, "a2", is
issued. This is incorrect because inode 258, which has the same name
and it's a child of the same parent (root inode 256), was not yet
processed and therefore no rmdir operation for it was yet issued.
The rename operation is issued because we fail to detect that the
name of the new inode 257 collides with inode 258, because their
parent, a subvolume/snapshot root (inode 256) has a different
generation in both snapshots.
So fix this by ignoring the generation value of a parent directory that
matches a root inode (number 256) when we are checking if the name of the
inode currently being processed collides with the name of some other
inode that was not yet processed.
We can achieve this scenario of different inodes with the same number but
different generation values either by mounting a filesystem with the inode
cache option (-o inode_cache) or by creating and sending snapshots across
different filesystems, like in the following example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir /mnt/a1
$ mkdir /mnt/a2
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/a2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inode from the second snapshot has
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ btrfs receive /mnt /tmp/1.snap
# Receive of snapshot snap2 used to fail.
$ btrfs receive /mnt /tmp/2.snap
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Liu Bo [Tue, 21 Feb 2017 20:12:58 +0000 (12:12 -0800)]
Btrfs: use the correct type when creating cow dio extent
'BTRFS_ORDERED_REGULAR' was introduced for the cow case in patch
'Btrfs: specify a new ordered extent type for create_io_em',
but it missed the directIO cow case.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 21 Feb 2017 17:14:52 +0000 (17:14 +0000)]
Btrfs: fix deadlock between dedup on same file and starting writeback
If we are deduping two ranges of the same file we need to make sure that
we lock all pages in ascending order, that is, lock first the pages from
the range with lower offset and then the pages from the other range, as
otherwise we can deadlock with a concurrent task that is starting delalloc
(writeback). Example trace:
[74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds.
[74073.053889] Tainted: G W 4.9.0-rc7-btrfs-next-36+ #1
[74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74073.056696] kworker/u32:10 D 0 17997 2 0x00000000
[74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176)
[74073.061370]
ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240
[74073.064784]
ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000
[74073.068386]
ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240
[74073.071712] Call Trace:
[74073.072884] [<
ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
[74073.075395] [<
ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
[74073.077511] [<
ffffffff814c18d2>] schedule+0x8c/0xa0
[74073.079440] [<
ffffffff814c4b36>] schedule_timeout+0x43/0xff
[74073.081637] [<
ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14
[74073.083809] [<
ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197
[74073.086314] [<
ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
[74073.100654] [<
ffffffff810be048>] ? ktime_get+0x41/0x52
[74073.102619] [<
ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
[74073.104771] [<
ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
[74073.106969] [<
ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
[74073.108954] [<
ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
[74073.110981] [<
ffffffff8112b692>] __lock_page+0x6b/0x6d
[74073.112833] [<
ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
[74073.115010] [<
ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs]
[74073.116999] [<
ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs]
[74073.119243] [<
ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs]
[74073.121636] [<
ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs]
[74073.124229] [<
ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs]
[74073.126372] [<
ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs]
[74073.129371] [<
ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs]
[74073.131440] [<
ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs]
[74073.134303] [<
ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1
[74073.136298] [<
ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
[74073.138248] [<
ffffffff81138200>] do_writepages+0x23/0x2c
[74073.139910] [<
ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
[74073.142003] [<
ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
[74073.136298] [<
ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs]
[74073.138248] [<
ffffffff81138200>] do_writepages+0x23/0x2c
[74073.139910] [<
ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2
[74073.142003] [<
ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1
[74073.143911] [<
ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae
[74073.145787] [<
ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7
[74073.147452] [<
ffffffff811b60cd>] wb_workfn+0x194/0x37d
[74073.149084] [<
ffffffff811b60cd>] ? wb_workfn+0x194/0x37d
[74073.150726] [<
ffffffff8106ce77>] ? process_one_work+0x154/0x4e4
[74073.152694] [<
ffffffff8106cf96>] process_one_work+0x273/0x4e4
[74073.154452] [<
ffffffff8106d6db>] worker_thread+0x1eb/0x2ca
[74073.156138] [<
ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6
[74073.157837] [<
ffffffff81072a81>] kthread+0xd5/0xdd
[74073.159339] [<
ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a
[74073.161088] [<
ffffffff814c6257>] ret_from_fork+0x27/0x40
[74073.162680] INFO: lockdep is turned off.
[74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds.
[74073.181180] Tainted: G W 4.9.0-rc7-btrfs-next-36+ #1
[74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[74073.185296] fdm-stress D 0 30264 29974 0x00000000
[74073.186810]
ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00
[74073.188998]
ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000
[74073.191070]
ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00
[74073.193286] Call Trace:
[74073.193990] [<
ffffffff814c15e1>] ? __schedule+0x48f/0x6f4
[74073.195418] [<
ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f
[74073.196796] [<
ffffffff814c18d2>] schedule+0x8c/0xa0
[74073.198163] [<
ffffffff814c4b36>] schedule_timeout+0x43/0xff
[74073.199621] [<
ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf
[74073.201100] [<
ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32
[74073.202686] [<
ffffffff810be048>] ? ktime_get+0x41/0x52
[74073.204051] [<
ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102
[74073.205585] [<
ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102
[74073.207123] [<
ffffffff814c1ca6>] bit_wait_io+0x1b/0x39
[74073.208238] [<
ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99
[74073.208871] [<
ffffffff8112b692>] __lock_page+0x6b/0x6d
[74073.209430] [<
ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a
[74073.210101] [<
ffffffff8112b800>] lock_page+0x2f/0x32
[74073.210636] [<
ffffffff8112c502>] pagecache_get_page+0x5e/0x153
[74073.211270] [<
ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs]
[74073.212166] [<
ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs]
[74073.213257] [<
ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221
[74073.214086] [<
ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600
[74073.214767] [<
ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d
[74073.215619] [<
ffffffff811a7953>] ? __fget+0x6b/0x77
[74073.216338] [<
ffffffff8119e2d9>] SyS_ioctl+0x57/0x79
[74073.217149] [<
ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad
[74073.218102] [<
ffffffff81109552>] ? time_hardirqs_off+0x9/0x14
[74073.218968] [<
ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa
[74073.219938] INFO: lockdep is turned off.
What happened was the following:
CPU 1 CPU 2
btrfs_dedupe_file_range()
--> using same inode as source
and target
--> src range is [768K, 1Mb[
--> dst range is [0, 256K[
btrfs_cmp_data_prepare()
--> calls gather_extent_pages()
for range [768K, 1Mb[ and
locks all pages in that range
do_writepages()
btrfs_writepages()
extent_writepages()
extent_write_cache_pages()
__extent_writepage()
writepage_delalloc()
find_lock_delalloc_range()
--> finds range [0, 1Mb[
lock_delalloc_pages()
--> locks all pages in the
range [0, 768K[
--> tries to lock page at
offset 768K
--> deadlock
--> calls gather_extent_pages()
to lock pages in the range
[0, 256K[
--> deadlock, task at CPU 1
already locked that
range and it's trying
to lock the range we
locked previously
So fix this by making sure that during a dedup we always lock first the
pages from the range with lower offset.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:34 +0000 (16:28 -0500)]
btrfs: use btrfs_debug instead of pr_debug in transaction abort
Commit
e5d6b12fe14 (Btrfs: don't WARN() in btrfs_transaction_abort() for
IO errors) added a pr_debug call to be printed when a transaction is
aborted with -EIO instead of WARN. btrfs_debug prints which file system
the message is associated with so let's use that instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:32 +0000 (16:28 -0500)]
btrfs: btrfs_truncate_free_space_cache always allocates path
btrfs_truncate_free_space_cache always allocates a btrfs_path structure
but only uses it when the caller passes a block group. Let's move the
allocation and free into the conditional.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:30 +0000 (16:28 -0500)]
btrfs: free-space-cache, clean up unnecessary root arguments
The free space cache APIs accept a root but always use the tree root.
Also, btrfs_truncate_free_space_cache accepts a root AND an inode but
the inode always points to the root anyway, so let's just pass the inode.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:29 +0000 (16:28 -0500)]
btrfs: convert btrfs_inc_block_group_ro to accept fs_info
btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree. Let's convert
to passing an fs_info and using the extent root.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:28 +0000 (16:28 -0500)]
btrfs: flush_space always takes fs_info->fs_root
We don't need to pass a root to flush_space since it always uses
the fs_root.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 15 Feb 2017 21:28:27 +0000 (16:28 -0500)]
btrfs: pass fs_info to (more) routines that are only called with extent_root
Outside of interactions with qgroups, the roots passed in extent-tree.c
are usually passed to ensure that we don't do refcounts on log trees or
to get the allocation profile for an allocation request. Otherwise, it
operates on the extent root. This patch converts some more routines in
extent-tree.c that are always called with the extent root to accept
an fs_info instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 15 Feb 2017 02:43:03 +0000 (10:43 +0800)]
btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
Just as Filipe pointed out, the most time consuming parts of qgroup are
btrfs_qgroup_account_extents() and
btrfs_qgroup_prepare_account_extents().
Which both call btrfs_find_all_roots() to get old_roots and new_roots
ulist.
What makes things worse is, we're calling that expensive
btrfs_find_all_roots() at transaction committing time with
TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.
Such behavior is necessary for @new_roots search as current
btrfs_find_all_roots() can't do it correctly so we do call it just
before switch commit roots.
However for @old_roots search, it's not necessary as such search is
based on commit_root, so it will always be correct and we can move it
out of transaction committing.
This patch moves the @old_roots search part out of
commit_transaction(), so in theory we can half the time qgroup time
consumption at commit_transaction().
But please note that, this won't speedup qgroup overall, the total time
consumption is still the same, just reduce the performance stall.
Cc: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 19:30:23 +0000 (20:30 +0100)]
btrfs: remove unused parameter from adjust_slots_upwards
Never used.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 19:26:24 +0000 (20:26 +0100)]
btrfs: remove unused parameters from __btrfs_write_out_cache
Both unused after the call to update_cache_item has been moved to
__btrfs_wait_cache_io.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 19:23:00 +0000 (20:23 +0100)]
btrfs: remove unused parameter from cleanup_write_cache_enospc
bitmap_list is unused since the io_ctl framework.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 19:20:19 +0000 (20:20 +0100)]
btrfs: remove unused parameter from __add_inode_ref
Unused since the helper has been split, eb used in the caller.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 19:18:49 +0000 (20:18 +0100)]
btrfs: remove unused parameter from clone_copy_inline_extent
Never used.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 19:15:10 +0000 (20:15 +0100)]
btrfs: remove unused parameters from btrfs_cmp_data
After the page locking has been reworked, we get all pages prepared via
cmp_pages.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:57:27 +0000 (19:57 +0100)]
btrfs: remove unused parameter from __add_inline_refs
Never used.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:55:54 +0000 (19:55 +0100)]
btrfs: remove unused parameters from scrub_setup_wr_ctx
Never used.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:54:06 +0000 (19:54 +0100)]
btrfs: remove unused parameter from create_snapshot
The name parameters have never been used, as the name is passed via the
dentry.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:49:01 +0000 (19:49 +0100)]
btrfs: remove unused parameter from init_first_rw_device
The 'device' used to be added in that function, but now it's done by the
caller.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:46:27 +0000 (19:46 +0100)]
btrfs: remove unused parameter from __btrfs_alloc_chunk
We grab fs_info from other parameters.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:44:31 +0000 (19:44 +0100)]
btrfs: remove unused parameter from btrfs_fill_super
Never used for anything meaningful since we have our own superblock
filler.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:38:24 +0000 (19:38 +0100)]
btrfs: remove unused parameter from extent_write_cache_pages
The 'tree' was used to call locking hook that does not exist anymore.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:35:37 +0000 (19:35 +0100)]
btrfs: remove unused parameter from add_pending_csums
Never used.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:33:41 +0000 (19:33 +0100)]
btrfs: remove unused parameter from update_nr_written
The logic has been updated in "Btrfs: make mapping->writeback_index
point to the last written page" (
a91326679f2a0a4c2) and page is not
needed anymore.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:29:38 +0000 (19:29 +0100)]
btrfs: remove unused parameter from submit_extent_page
This used to hold number of maximum pages to allocate, but this is now
limited by BIO_MAX_PAGES. The local are now unused and removed as well.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:25:51 +0000 (19:25 +0100)]
btrfs: remove unused parameter from tree_move_next_or_upnext
Not needed.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:24:53 +0000 (19:24 +0100)]
btrfs: remove unused parameter from tree_move_down
Never needed.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:23:20 +0000 (19:23 +0100)]
btrfs: remove unused parameter from btrfs_check_super_valid
None of the checks need to know the ro/rw status as they're all not
changing the superblock. Moreover, we can access the sb flags directly
if we'd need to decide by the ro/rw status.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:20:56 +0000 (19:20 +0100)]
btrfs: remove unused parameter from btrfs_prepare_extent_commit
Added but never used.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:18:18 +0000 (19:18 +0100)]
btrfs: remove unused parameter from btrfs_subvolume_release_metadata
Unused since qgroup refactoring that split data and metadata accounting,
the btrfs_qgroup_free helper.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:14:36 +0000 (19:14 +0100)]
btrfs: remove unused parameter from __push_leaf_left
Unused since long ago.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:13:06 +0000 (19:13 +0100)]
btrfs: remove unused parameter from __push_leaf_right
Unused since long ago.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 18:04:32 +0000 (19:04 +0100)]
btrfs: merge two superblock writing helpers
write_all_supers and write_ctree_super are almost equal, the parameter
'trans' is unused so we can drop it and have just one helper.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 17:52:06 +0000 (18:52 +0100)]
btrfs: remove unused parameter from write_dev_supers
The barriers are handled by the caller.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 17:49:53 +0000 (18:49 +0100)]
btrfs: remove unused parameter from split_item
Never used.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 17:47:57 +0000 (18:47 +0100)]
btrfs: remove unused parameter from clean_tree_block
Added but never needed.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 17:46:09 +0000 (18:46 +0100)]
btrfs: remove unused parameter from check_async_write
Added but never used.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 10 Feb 2017 17:44:32 +0000 (18:44 +0100)]
btrfs: remove unused parameter from read_block_for_search
Never used in that function.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 15 Feb 2017 15:47:36 +0000 (16:47 +0100)]
btrfs: ulist: rename ulist_fini to ulist_release
Change the name so it matches the naming we already use eg. for
btrfs_path.
Suggested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 14:05:33 +0000 (15:05 +0100)]
btrfs: remove pointless rcu protection from btrfs_qgroup_inherit
There was never need for RCU protection around reading nodesize or other
fairly constant filesystem data.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 13:24:35 +0000 (14:24 +0100)]
btrfs: qgroups: opencode qgroup_free helper
The helper name is not too helpful and is just wrapping a simple call.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 13:07:02 +0000 (14:07 +0100)]
btrfs: remove unnecessary mutex lock in qgroup_account_snapshot
The quota status used to be tracked as a variable, so the mutex was
needed (until "Btrfs: add a flags field to btrfs_fs_info"
afcdd129e05a9).
Since the status is a bit modified atomically and we don't hold the
mutex beyond the check, we can drop it.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 13:05:24 +0000 (14:05 +0100)]
btrfs: check quota status earlier and don't do unnecessary frees
Status of quotas should be the first check in
btrfs_qgroup_account_extent and we can return immediatelly, no need to
do no-op ulist frees.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 12:42:29 +0000 (13:42 +0100)]
btrfs: embed extent_changeset::range_changed to the structure
We can embed range_changed to the extent changeset to address following
problems:
- no need to allocate ulist dynamically, we also get rid of the GFP_NOFS
for free
- fix lack of allocation failure checking in btrfs_qgroup_reserve_data
The stack consuption where extent_changeset is used slightly increases:
before: 16
after: 16 - 8 (for pointer) + 32 (sizeof ulist) = 40
Which is bearable.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 12:40:16 +0000 (13:40 +0100)]
btrfs: ulist: make the finalization function public
Make ulist_fini externally visible so the ulist API is complete.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 12:00:51 +0000 (13:00 +0100)]
btrfs: qgroups: make __del_qgroup_relation static
Internal helper.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 23 Jan 2017 16:28:19 +0000 (17:28 +0100)]
btrfs: make space cache inode readahead failure nonfatal
We do a readahead of the free space cache inode to speed things up but
the failure is not fatal, like in other readahead cases. Proper reads
would need to happen anyway and any errors would be caught there.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 11:41:02 +0000 (12:41 +0100)]
btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation
Qgroup relations are added/deleted from ioctl, we hold the high level
qgroup lock, no deadlocks or recursion from the allocation possible
here.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 10:03:44 +0000 (11:03 +0100)]
btrfs: use GFP_KERNEL in btrfs_quota_enable
We don't need to use GFP_NOFS here as this is called from ioctls an the
only lock held is the subvol_sem, which is of a high level and protects
creation/renames/deletion and is never held in the writeout paths.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 11:10:20 +0000 (12:10 +0100)]
btrfs: use GFP_KERNEL in btrfs_read_qgroup_config
The qgroup config is read during mount, we do not have to use NOFS.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 13 Feb 2017 10:03:44 +0000 (11:03 +0100)]
btrfs: use GFP_KERNEL in create_snapshot
We don't need to use GFP_NOFS here as this is called from ioctls an the
only lock held is the subvol_sem, which is of a high level and protects
creation/renames/deletion and is never held in the writeout paths.
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 13 Feb 2017 23:35:09 +0000 (15:35 -0800)]
Btrfs: specify a new ordered extent type for create_io_em
As 0 refers to an existing type BTRFS_ORDERED_IO_DONE, this specifies a
new type 'REGULAR' for regular IO.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 31 Jan 2017 15:50:22 +0000 (07:50 -0800)]
Btrfs: create a helper to create em for IO
We have similar codes to create and insert extent mapping around IO path,
this merges them into a single helper.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 13 Feb 2017 23:42:21 +0000 (15:42 -0800)]
Btrfs: use helper to get used bytes of space_info
This uses a helper instead of open code around used byte of space_info
everywhere.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 13 Feb 2017 23:42:30 +0000 (15:42 -0800)]
Btrfs: try to avoid acquiring free space ctl's lock
We don't need to take the lock if the block group has not been cached.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 9 Feb 2017 02:45:06 +0000 (10:45 +0800)]
btrfs: Better csum error message for data csum mismatch
The original csum error message only outputs inode number, offset, check
sum and expected check sum.
However no root objectid is outputted, which sometimes makes debugging
quite painful under multi-subvolume case (including relocation).
Also the checksum output is decimal, which seldom makes sense for
users/developers and is hard to read in most time.
This patch will add root objectid, which will be %lld for rootid larger
than LAST_FREE_OBJECTID, and hex csum output for better readability.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Takafumi Kubota [Thu, 9 Feb 2017 08:24:33 +0000 (17:24 +0900)]
Btrfs: add another missing end_page_writeback on submit_extent_page failure
If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.
__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.
For reproducing the hang, we inject a fault like
if (should_failtest()) { // I define should_failtest()
bio = NULL;
}
else {
bio = btrfs_bio_alloc(...);
}
in submit_extent_page.
We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.
Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 9 Feb 2017 15:47:43 +0000 (16:47 +0100)]
btrfs: remove unused ulist members
Commit "btrfs: ulist: Add ulist_del() function" (
d4b804045924d7f8)
removed some debugging code but left the structure defintions.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 10 Feb 2017 15:42:14 +0000 (16:42 +0100)]
Btrfs: use helper to simplify lock/unlock pages
Since we have a helper to set page bits, let lock_delalloc_pages and
__unlock_for_delalloc use it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 10 Feb 2017 15:41:05 +0000 (16:41 +0100)]
btrfs: teach __process_pages_contig about PAGE_LOCK operation
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changes to the helper separated from the following patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Fri, 3 Feb 2017 01:49:22 +0000 (17:49 -0800)]
Btrfs: create helper for processing bits on contiguous pages
This introduces a new helper which can be used to process pages bits.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 30 Jan 2017 20:25:28 +0000 (12:25 -0800)]
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
run_delalloc_nocow has used trans in two places where they don't
actually need @trans.
For btrfs_lookup_file_extent, we search for file extents without COWing
anything, and for btrfs_cross_ref_exist, the only place where we need
@trans is deferencing it in order to get running_transaction which we
could easily get from the global fs_info.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 30 Jan 2017 20:24:37 +0000 (12:24 -0800)]
Btrfs: pass delayed_refs directly to btrfs_find_delayed_ref_head
All we need is @delayed_refs, all callers have get it ahead of calling
btrfs_find_delayed_ref_head since lock needs to be acquired firstly,
there is no reason to deference it again inside the function.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Mon, 30 Jan 2017 20:23:42 +0000 (12:23 -0800)]
Btrfs: remove unused trans in read_block_for_search
@trans is not used at all, this removes it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Thu, 26 Jan 2017 01:15:54 +0000 (17:15 -0800)]
Btrfs: cleanup unused cached_state in __extent_writepage_io
@cached_state is no more required in __extent_writepage_io, also remove
the goto label.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 25 Jan 2017 14:50:33 +0000 (09:50 -0500)]
btrfs: allow unlink to exceed subvolume quota
Once a qgroup limit is exceeded, it's impossible to restore normal
operation to the subvolume without modifying the limit or removing
the subvolume. This is a surprising situation for many users used
to the typical workflow with quotas on other file systems where it's
possible to remove files until the used space is back under the limit.
When we go to unlink a file and start the transaction, we'll hit
the qgroup limit while trying to reserve space for the items we'll
modify while removing the file. We discussed last month how best
to handle this situation and agreed that there is no perfect solution.
The best principle-of-least-surprise solution is to handle it similarly
to how we already handle ENOSPC when unlinking, which is to allow
the operation to succeed with the expectation that it will ultimately
release space under most circumstances.
This patch modifies the transaction start path to select whether to
honor the qgroups limits. btrfs_start_transaction_fallback_global_rsv
is the only caller that skips enforcement. The reservation and tracking
still happens normally -- it just skips the enforcement step.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 24 Jan 2017 23:58:51 +0000 (15:58 -0800)]
Btrfs: fix wrong argument for btrfs_lookup_ordered_range
Commit Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units"
(
d0b7da88) did this, but btrfs_lookup_ordered_range expects a 'length'
rather than a 'page_end'.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 16 Jan 2017 02:23:06 +0000 (10:23 +0800)]
btrfs: raid56: Remove unused variable in lock_stripe_add
Variable 'walk' in lock_stripe_add() is not used. Remove it.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Wed, 18 Jan 2017 07:37:38 +0000 (23:37 -0800)]
Btrfs: refactor btrfs_extent_same() slightly
This was originally a prep patch for changing the behavior on len=0, but
we went another direction with that. This still makes the function
slightly easier to follow.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Wed, 18 Jan 2017 07:24:37 +0000 (23:24 -0800)]
Btrfs: constify struct btrfs_{,disk_}key wherever possible
In a lot of places, it's unclear when it's safe to reuse a struct
btrfs_key after it has been passed to a helper function. Constify these
arguments wherever possible to make it obvious.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Thu, 15 Dec 2016 06:36:05 +0000 (22:36 -0800)]
Btrfs: fix another race between truncate and lockless dio write
Dio writes can update i_size in btrfs_get_blocks_direct when it
writes to offset beyond EOF so that endio can update disk_i_size
correctly (because we don't udpate disk_i_size beyond i_size).
However, when truncating down a file, we firstly update i_size
and then wait for in-flight lockless dio reads/writes, according
to the above, i_size may have been changed in dio writes, and
file extents don't get truncated.
For lockless dio writes are always overwrites, i_size is not
supposed to be changed, so this adds a check to filter out this
case.
The race could be reproduced by fstests/generic/299 with patch
"Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly"
applied.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 13 Dec 2016 20:51:51 +0000 (12:51 -0800)]
Btrfs: clean up btrfs_ordered_update_i_size
Since we have a good helper entry_end, use it for ordered extent.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ whitespace reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 13 Dec 2016 20:15:19 +0000 (12:15 -0800)]
Btrfs: fix comment in btrfs_page_mkwrite
The comment about "page_mkwrite gets called every time the page is
dirtied" in btrfs_page_mkwrite is not correct, it only gets called the
first time the page gets dirtied after the page faults in.
However, we don't need to touch the code because it works well, although
the proper logic is to check if delalloc bits has been set and if so, go
free reserved space, if not, set the delalloc bits for dirty page range.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Thu, 1 Dec 2016 21:01:02 +0000 (13:01 -0800)]
Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly
btrfs_ordered_update_i_size can be called by truncate and endio, but
only endio takes ordered_extent which contains the completed IO.
while truncating down a file, if there are some in-flight IOs,
btrfs_ordered_update_i_size in endio will set disk_i_size to
@orig_offset that is zero. If truncating-down fails somehow, we try to
recover in memory isize with this zero'd disk_i_size.
Fix it by only updating disk_i_size with @orig_offset when
btrfs_ordered_update_i_size is not called from endio while truncating
down and waiting for in-flight IOs completing their work before recover
in-memory size.
Besides fixing the above issue, add an assertion for last_size to double
check we truncate down to the desired size.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 20 Jan 2017 13:54:07 +0000 (14:54 +0100)]
btrfs: fix over-80 lines introduced by previous cleanups
This goes as a separate patch because fixing that inside the patches
caused too many many conflicts.
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:50 +0000 (00:31 +0200)]
btrfs: Make count_inode_refs take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:49 +0000 (00:31 +0200)]
btrfs: Make count_inode_extrefs take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:48 +0000 (00:31 +0200)]
btrfs: Make btrfs_log_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:47 +0000 (00:31 +0200)]
btrfs: Make log_inode_item take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:46 +0000 (00:31 +0200)]
btrfs: Make __add_inode_ref take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:45 +0000 (00:31 +0200)]
btrfs: Make drop_one_dir_item take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:44 +0000 (00:31 +0200)]
btrfs: Make btrfs_unlink_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:43 +0000 (00:31 +0200)]
btrfs: Make log_new_dir_dentries take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:42 +0000 (00:31 +0200)]
btrfs: Make log_directory_changes take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:41 +0000 (00:31 +0200)]
btrfs: Make log_dir_items take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:40 +0000 (00:31 +0200)]
btrfs: Make btrfs_log_changed_extents take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:39 +0000 (00:31 +0200)]
btrfs: Make btrfs_get_logged_extents take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 17 Jan 2017 22:31:38 +0000 (00:31 +0200)]
btrfs: Make btrfs_log_trailing_hole take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>