Filipe Manana [Wed, 27 Jan 2016 19:17:20 +0000 (19:17 +0000)]
Btrfs: remove no longer used function extent_read_full_page_nolock()
Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 27 Jan 2016 18:37:47 +0000 (18:37 +0000)]
Btrfs: fix page reading in extent_same ioctl leading to csum errors
In the extent_same ioctl, we were grabbing the pages (locked) and
attempting to read them without bothering about any concurrent IO
against them. That is, we were not checking for any ongoing ordered
extents nor waiting for them to complete, which leads to a race where
the extent_same() code gets a checksum verification error when it
reads the pages, producing a message like the following in dmesg
and making the operation fail to user space with -ENOMEM:
[18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum
685204116 expected csum
1515870868
Fix this by using btrfs_readpage() for reading the pages instead of
extent_read_full_page_nolock(), which waits for any concurrent ordered
extents to complete and locks the io range. Also do better error handling
and don't treat all failures as -ENOMEM, as that's clearly misleasing,
becoming identical to the checks and operation of prepare_uptodate_page().
The use of extent_read_full_page_nolock() was required before
commit
f441460202cb ("btrfs: fix deadlock with extent-same and readpage"),
as we had the range locked in an inode's io tree before attempting to
read the pages.
Fixes: f441460202cb ("btrfs: fix deadlock with extent-same and readpage")
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 27 Jan 2016 10:20:58 +0000 (10:20 +0000)]
Btrfs: fix invalid page accesses in extent_same (dedup) ioctl
In the extent_same ioctl we are getting the pages for the source and
target ranges and unlocking them immediately after, which is incorrect
because later we attempt to map them (with kmap_atomic) and access their
contents at btrfs_cmp_data(). When we do such access the pages might have
been relocated or removed from memory, which leads to an invalid memory
access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
which produces a trace like the following:
186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
unloaded: btrfs]
[186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G W 4.4.0-rc6-btrfs-next-18+ #1
[186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[186736.681319] task:
ffff880132600400 ti:
ffff880362284000 task.ti:
ffff880362284000
[186736.681319] RIP: 0010:[<
ffffffff81264d00>] [<
ffffffff81264d00>] memcmp+0xb/0x22
[186736.681319] RSP: 0018:
ffff880362287d70 EFLAGS:
00010287
[186736.681319] RAX:
000002c002468acf RBX:
0000000012345678 RCX:
0000000000000000
[186736.681319] RDX:
0000000000001000 RSI:
0005d129c5cf9000 RDI:
0005d129c5cf9000
[186736.681319] RBP:
ffff880362287d70 R08:
0000000000000000 R09:
0000000000001000
[186736.681319] R10:
ffff880000000000 R11:
0000000000000476 R12:
0000000000001000
[186736.681319] R13:
ffff8802f91d4c88 R14:
ffff8801f2a77830 R15:
ffff880352e83e40
[186736.681319] FS:
00007f27b37fe700(0000) GS:
ffff88043dda0000(0000) knlGS:
0000000000000000
[186736.681319] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[186736.681319] CR2:
00007f27a406a000 CR3:
0000000217421000 CR4:
00000000001406e0
[186736.681319] Stack:
[186736.681319]
ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
[186736.681319]
0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
[186736.681319]
00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
[186736.681319] Call Trace:
[186736.681319] [<
ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
[186736.681319] [<
ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
[186736.681319] [<
ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
[186736.681319] [<
ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
[186736.681319] [<
ffffffff8118b4f3>] ? __fget_light+0x62/0x71
[186736.681319] [<
ffffffff8118240e>] SyS_ioctl+0x57/0x79
[186736.681319] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0
(gdb) list *(btrfs_ioctl+0x24cb)
0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
2967 dst_addr = kmap_atomic(dst_page);
2968
2969 flush_dcache_page(src_page);
2970 flush_dcache_page(dst_page);
2971
2972 if (memcmp(addr, dst_addr, cmp_len))
2973 ret = BTRFS_SAME_DATA_DIFFERS;
2974
2975 kunmap_atomic(addr);
2976 kunmap_atomic(dst_addr);
So fix this by making sure we keep the pages locked and respect the same
locking order as everywhere else: get and lock the pages first and then
lock the range in the inode's io tree (like for example at
__btrfs_buffered_write() and extent_readpages()). If an ordered extent
is found after locking the range in the io tree, unlock the range,
unlock the pages, wait for the ordered extent to complete and repeat the
entire locking process until no overlapping ordered extents are found.
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Chris Mason [Fri, 29 Jan 2016 16:19:37 +0000 (08:19 -0800)]
Revert "btrfs: synchronize incompat feature bits with sysfs files"
This reverts commit
14e46e04958df740c6c6a94849f176159a333f13.
This ends up doing sysfs operations from deep in balance (where we
should be GFP_NOFS) and under heavy balance load, we're making races
against sysfs internals.
Revert it for now while we figure things out.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 27 Jan 2016 14:38:45 +0000 (06:38 -0800)]
btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc
This was copied incorrectly from the __vmalloc call.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 27 Jan 2016 13:48:23 +0000 (05:48 -0800)]
Merge branch 'dev/fst-followup' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
David Sterba [Wed, 27 Jan 2016 13:06:29 +0000 (14:06 +0100)]
btrfs: sysfs: check initialization state before updating features
If the mount phase is not finished, we can't update the sysfs files.
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Mon, 25 Jan 2016 10:02:06 +0000 (11:02 +0100)]
Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
This reverts commit
696249132158014d594896df3a81390616069c5c. The
cleaner thread can block freezing when there's a snapshot cleaning in
progress and the other threads get suspended first. From the logs
provided by Martin we're waiting for reading extent pages:
kernel: PM: Syncing filesystems ... done.
kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
kernel: Freezing remaining freezable tasks ...
kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
kernel: btrfs-cleaner D
ffff88033dd13bc0 0 152 2 0x00000000
kernel:
ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
kernel:
ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
kernel:
7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
kernel: Call Trace:
kernel: [<
ffffffff814a58df>] ? bit_wait+0x2c/0x2c
kernel: [<
ffffffff814a5451>] ? schedule+0x6f/0x7c
kernel: [<
ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8
kernel: [<
ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e
kernel: [<
ffffffff81077603>] ? ktime_get+0x36/0x44
kernel: [<
ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
kernel: [<
ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
kernel: [<
ffffffff814a590b>] ? bit_wait_io+0x2c/0x30
kernel: [<
ffffffff814a5694>] ? __wait_on_bit+0x41/0x73
kernel: [<
ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72
kernel: [<
ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a
kernel: [<
ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203
kernel: [<
ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c
kernel: [<
ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
kernel: [<
ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45
kernel: [<
ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b
kernel: [<
ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a
kernel: [<
ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736
kernel: [<
ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7
kernel: [<
ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae
kernel: [<
ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4
kernel: [<
ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664
kernel: [<
ffffffff8104ff21>] ? finish_task_switch+0x126/0x167
kernel: [<
ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
kernel: [<
ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b
kernel: [<
ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33
kernel: [<
ffffffff8104d256>] ? kthread+0x95/0x9d
kernel: [<
ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
kernel: [<
ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70
kernel: [<
ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
As this affects a released kernel (4.4) we need a minimal fix for
stable kernels.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361
Reported-by: Martin Ziegler <ziegler@uni-freiburg.de>
CC: stable@vger.kernel.org # 4.4
CC: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Fri, 22 Jan 2016 01:28:38 +0000 (09:28 +0800)]
btrfs: async-thread: Fix a use-after-free error for trace
Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
So no one use use that pointer after queue_work().
Fix the user-after-free bug by move the trace line before queue_work().
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 21 Jan 2016 10:17:54 +0000 (10:17 +0000)]
Btrfs: fix race between fsync and lockless direct IO writes
An fsync, using the fast path, can race with a concurrent lockless direct
IO write and end up logging a file extent item that points to an extent
that wasn't written to yet. This is because the fast fsync path collects
ordered extents into a local list and then collects all the new extent
maps to log file extent items based on them, while the direct IO write
path creates the new extent map before it creates the corresponding
ordered extent (and submitting the respective bio(s)).
So fix this by making the direct IO write path create ordered extents
before the extent maps and make the fast fsync path collect any new
ordered extents after it collects the extent maps.
Note that making the fsync handler call inode_dio_wait() (after acquiring
the inode's i_mutex) would not work and lead to a deadlock when doing
AIO, as through AIO we end up in a path where the fsync handler is called
(through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
before the inode's dio counter is decremented (inode_dio_wait() waits
for this counter to have a value of zero).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Tue, 26 Jan 2016 00:43:13 +0000 (16:43 -0800)]
Merge branch 'fix/fst-sysfs' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Mon, 25 Jan 2016 15:47:10 +0000 (16:47 +0100)]
btrfs: add free space tree to the cow-only list
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 25 Jan 2016 15:30:22 +0000 (16:30 +0100)]
btrfs: add free space tree to lockdep classes
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 22 Jan 2016 16:16:18 +0000 (17:16 +0100)]
btrfs: tweak free space tree bitmap allocation
The requested bitmap size varies, observed numbers were < 4K up to 16K.
Using vmalloc unconditionally would be too heavy, we'll try contiguous
allocations first and fall back to vmalloc if there's no contig memory.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 22 Jan 2016 09:28:24 +0000 (10:28 +0100)]
btrfs: tests: switch to GFP_KERNEL
There's no reason to do GFP_NOFS in tests, it's not data-heavy and
memory allocation failures would affect only developers or testers.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 21 Jan 2016 17:54:41 +0000 (18:54 +0100)]
btrfs: synchronize incompat feature bits with sysfs files
The files under /sys/fs/UUID/features get out of sync with the actual
incompat bits set for the filesystem if they change after mount (eg. the
LZO compression).
Synchronize the feature bits with the sysfs files representing them
right after we set/clear them.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 21 Jan 2016 17:50:40 +0000 (18:50 +0100)]
btrfs: sysfs: introduce helper for syncing bits with sysfs files
The files under /sys/fs/UUID/features get out of sync with the actual
incompat bits set for the filesystem if they change after mount. We're
going to sync them and need a helper to do that.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 21 Jan 2016 17:36:46 +0000 (18:36 +0100)]
btrfs: sysfs: add free-space-tree bit attribute
The incompat bit representing the newly added free space tree feature is
missing. Right now it will be listed only among features supported by
the module, not per-fs.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 20 Jan 2016 18:07:04 +0000 (19:07 +0100)]
btrfs: sysfs: fix typo in compat_ro attribute definition
Signed-off-by: David Sterba <dsterba@suse.com>
Zhao Lei [Tue, 12 Jan 2016 09:52:13 +0000 (17:52 +0800)]
btrfs: raid56: Use raid_write_end_io for scrub
No need to create additional end_io function for scrub, it increased
code size and introduced some un-unified lines, as:
raid_write_parity_end_io():
int err = bio->bi_error;
if (bio->bi_error)
raid_write_end_io():
int err = bio->bi_error;
if (err)
This patch combines them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 12 Jan 2016 09:22:13 +0000 (17:22 +0800)]
btrfs: Remove unnecessary ClearPageUptodate for raid56
PageUptodate flag already initialized to 0 for new page,
no need to set it again.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 3 Mar 2015 12:42:48 +0000 (20:42 +0800)]
btrfs: use rbio->nr_pages to reduce calculation
We can use rbio->stripe_npages to reduce unnecessary calculation in
many code place.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 3 Mar 2015 12:38:46 +0000 (20:38 +0800)]
btrfs: Use unified stripe_page's index calculation
We are using different index calculation method for stripe_page in
current code:
1: (rbio->stripe_len / PAGE_CACHE_SIZE) * stripe_index + page_index
2: DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE) * stripe_index + page_index
3: DIV_ROUND_UP(rbio->stripe_len * stripe_index, PAGE_CACHE_SIZE) + page_index
...
They can get same result when stripe_len align to PAGE_CACHE_SIZE,
this is why current code can work, intruduce and use a common function
for calculation is a better choose.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Mon, 8 Dec 2014 11:55:57 +0000 (19:55 +0800)]
btrfs: Fix calculation of rbio->dbitmap's size calculation
Current code is trying to calculate rbio->dbitmap's size to make it
align to sizeof(long), but implement haven't achived this object,
it is align to sizeof(char) instead.
This patch fixed above calculation, and use sizeof(long) instead of
fixed "8" to increate compatibility.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 1 Dec 2015 10:39:40 +0000 (18:39 +0800)]
btrfs: Fix no_space in write and rm loop
I see no_space in v4.4-rc1 again in xfstests generic/102.
It happened randomly in some node only.
(one of 4 phy-node, and a kvm with non-virtio block driver)
By bisect, we can found the first-bad is:
commit
bdced438acd8 ("block: setup bi_phys_segments after splitting")'
But above patch only triggered the bug by making bio operation
faster(or slower).
Main reason is in our space_allocating code, we need to commit
page writeback before wait it complish, this patch fixed above
bug.
BTW, there is another reason for generic/102 fail, caused by
disable default mixed-blockgroup, I'll fix it in xfstests.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Wed, 6 Jan 2016 10:56:36 +0000 (18:56 +0800)]
btrfs: merge functions for wait snapshot creation
wait_for_snapshot_creation() is in same group with oher two:
btrfs_start_write_no_snapshoting()
btrfs_end_write_no_snapshoting()
Rename wait_for_snapshot_creation() and move it into same place
with other two.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Wed, 6 Jan 2016 10:47:31 +0000 (18:47 +0800)]
btrfs: delete unused argument in btrfs_copy_from_user
size_t write_bytes is not necessary for btrfs_copy_from_user(),
delete it.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Tue, 15 Dec 2015 10:18:09 +0000 (18:18 +0800)]
btrfs: Use direct way to determine raid56 write/recover mode
Old code used bbio->raid_map to determine whether in raid56
write/recover operation, because we didn't't have bbio->map_type.
Now we have direct way for this condition, rid of using
the function-relative data, and make the code more readable.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Zhao Lei [Wed, 9 Dec 2015 13:03:49 +0000 (21:03 +0800)]
btrfs: Small cleanup for get index_srcdev loop
1: Adjust condition in loop to make less TAB
2: Move btrfs_put_bbio()'s line for combine, and makes logic clean.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Tue, 15 Dec 2015 01:14:37 +0000 (09:14 +0800)]
btrfs: Enhance chunk validation check
Enhance chunk validation:
1) Num_stripes
We already have such check but it's only in super block sys chunk
array.
Now check all on-disk chunks.
2) Chunk logical
It should be aligned to sector size.
This behavior should be *DOUBLE CHECKED* for 64K sector size like
PPC64 or AArch64.
Maybe we can found some hidden bugs.
3) Chunk length
Same as chunk logical, should be aligned to sector size.
4) Stripe length
It should be power of 2.
5) Chunk type
Any bit out of TYPE_MAS | PROFILE_MASK is invalid.
With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause kernel panic.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Tue, 15 Dec 2015 01:14:36 +0000 (09:14 +0800)]
btrfs: Enhance super validation check
Enhance btrfs_check_super_valid() function by the following points:
1) Restrict sector/node size check
Not the old max/min valid check, but also check if it's a power of 2.
So some bogus number like 12K node size won't pass now.
2) Super flag check
For now, there is still some inconsistency between kernel and
btrfs-progs super flags.
And considering btrfs-progs may add new flags for super block, this
check will only output warning.
3) Better root alignment check
Now root bytenr is checked against sector size.
4) Move some check into btrfs_check_super_valid().
Like node size vs leaf size check, and PAGESIZE vs sectorsize check.
And magic number check.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Fri, 15 Jan 2016 11:05:12 +0000 (11:05 +0000)]
Btrfs: fix deadlock running delayed iputs at transaction commit time
While running a stress test I ran into a deadlock when running the delayed
iputs at transaction time, which produced the following report and trace:
[ 886.399989] =============================================
[ 886.400871] [ INFO: possible recursive locking detected ]
[ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
[ 886.402384] ---------------------------------------------
[ 886.403182] fio/8277 is trying to acquire lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] but task is already holding lock:
[ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] other info that might help us debug this:
[ 886.403568] Possible unsafe locking scenario:
[ 886.403568]
[ 886.403568] CPU0
[ 886.403568] ----
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568] lock(&fs_info->delayed_iput_sem);
[ 886.403568]
[ 886.403568] *** DEADLOCK ***
[ 886.403568]
[ 886.403568] May be due to missing lock nesting notation
[ 886.403568]
[ 886.403568] 3 locks held by fio/8277:
[ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<
ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
[ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<
ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
[ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.403568]
[ 886.403568] stack backtrace:
[ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 886.403568]
0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
[ 886.403568]
ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
[ 886.403568]
0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
[ 886.403568] Call Trace:
[ 886.403568] [<
ffffffff8125d4fd>] dump_stack+0x4e/0x79
[ 886.403568] [<
ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
[ 886.403568] [<
ffffffff810c22db>] ? __module_address+0xdf/0x108
[ 886.403568] [<
ffffffff8108eb77>] lock_acquire+0x10d/0x194
[ 886.403568] [<
ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
[ 886.403568] [<
ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffff8148556b>] down_read+0x3e/0x4d
[ 886.489542] [<
ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<
ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
[ 886.489542] [<
ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
[ 886.489542] [<
ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
[ 886.489542] [<
ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
[ 886.489542] [<
ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
[ 886.489542] [<
ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
[ 886.489542] [<
ffffffff81188e31>] evict+0xa7/0x15c
[ 886.489542] [<
ffffffff81189878>] iput+0x1d3/0x266
[ 886.489542] [<
ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
[ 886.489542] [<
ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
[ 886.489542] [<
ffffffff81085096>] ? signal_pending_state+0x31/0x31
[ 886.489542] [<
ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
[ 886.489542] [<
ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 886.489542] [<
ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 886.489542] [<
ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 886.489542] [<
ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 886.489542] [<
ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 886.489542] [<
ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 886.489542] [<
ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 886.489542] [<
ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 886.489542] [<
ffffffff811734cc>] SyS_write+0x50/0x7e
[ 886.489542] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
[ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.863227] fio D
ffff880213f9bb28 0 8244 8240 0x00000000
[ 1081.868719]
ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
[ 1081.872499]
ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
[ 1081.876834]
ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
[ 1081.880782] Call Trace:
[ 1081.881793] [<
ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.883340] [<
ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.895525] [<
ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
[ 1081.897419] [<
ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1081.899251] [<
ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1081.901063] [<
ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1081.902365] [<
ffffffff814855bd>] down_write+0x43/0x57
[ 1081.903846] [<
ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.906078] [<
ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1081.908846] [<
ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
[ 1081.910409] [<
ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
[ 1081.912482] [<
ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
[ 1081.914597] [<
ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
[ 1081.919037] [<
ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
[ 1081.920754] [<
ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
[ 1081.922496] [<
ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
[ 1081.923922] [<
ffffffff8117279e>] __vfs_write+0x7c/0xa5
[ 1081.925275] [<
ffffffff81172cda>] vfs_write+0xa0/0xe4
[ 1081.926584] [<
ffffffff811734cc>] SyS_write+0x50/0x7e
[ 1081.927968] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1081.985293] INFO: lockdep is turned off.
[ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
[ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1
[ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1081.990147] fio D
ffff880218febbb8 0 8249 8240 0x00000000
[ 1081.991626]
ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
[ 1081.993258]
ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
[ 1081.994850]
ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
[ 1081.996485] Call Trace:
[ 1081.997037] [<
ffffffff81482ba4>] schedule+0x7f/0x97
[ 1081.998017] [<
ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
[ 1081.999241] [<
ffffffff810852a5>] ? finish_wait+0x6d/0x76
[ 1082.000306] [<
ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
[ 1082.001533] [<
ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
[ 1082.002776] [<
ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
[ 1082.003995] [<
ffffffff814855bd>] down_write+0x43/0x57
[ 1082.005000] [<
ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.007403] [<
ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
[ 1082.008988] [<
ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
[ 1082.010193] [<
ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
[ 1082.011280] [<
ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.012265] [<
ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
[ 1082.013021] [<
ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
[ 1082.013738] [<
ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
[ 1082.014778] [<
ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
[ 1082.015778] [<
ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
[ 1082.016806] [<
ffffffff8118b4de>] ? __fget_light+0x4d/0x71
[ 1082.017789] [<
ffffffff8118240e>] SyS_ioctl+0x57/0x79
[ 1082.018706] [<
ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
This happens because we can recursively acquire the semaphore
fs_info->delayed_iput_sem when attempting to allocate space to satisfy
a file write request as shown in the first trace above - when committing
a transaction we acquire (down_read) the semaphore before running the
delayed iputs, and when running a delayed iput() we can end up calling
an inode's eviction handler, which in turn commits another transaction
and attempts to acquire (down_read) again the semaphore to run more
delayed iput operations.
This results in a deadlock because if a task acquires multiple times a
semaphore it should invoke down_read_nested() with a different lockdep
class for each level of recursion.
Fix this by simplifying the implementation and use a mutex instead that
is acquired by the cleaner kthread before it runs the delayed iputs
instead of always acquiring a semaphore before delayed references are
run from anywhere.
Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput)
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Fri, 15 Jan 2016 10:56:15 +0000 (10:56 +0000)]
Btrfs: fix typo in log message when starting a balance
The recent change titled "Btrfs: Check metadata redundancy on balance"
(already in linux-next) left a typo in a message for users:
metatdata -> metadata.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 20 Jan 2016 02:21:30 +0000 (18:21 -0800)]
Merge branch 'misc-for-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Chris Mason [Wed, 20 Jan 2016 02:21:00 +0000 (18:21 -0800)]
Merge branch 'misc-cleanups-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Colin Ian King [Tue, 19 Jan 2016 00:05:28 +0000 (00:05 +0000)]
btrfs: remove duplicate const specifier
duplicate const is redundant so remove it
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sebastian Andrzej Siewior [Fri, 15 Jan 2016 13:37:15 +0000 (14:37 +0100)]
btrfs: initialize the seq counter in struct btrfs_device
I managed to trigger this:
| INFO: trying to register non-static key.
| the code is fine but needs lockdep annotation.
| turning off the locking correctness validator.
| CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14
| Hardware name: ARM-Versatile Express
| [<
80307cec>] (dump_stack)
| [<
80070e98>] (__lock_acquire)
| [<
8007184c>] (lock_acquire)
| [<
80287800>] (btrfs_ioctl)
| [<
8012a8d4>] (do_vfs_ioctl)
| [<
8012ac14>] (SyS_ioctl)
so I think that btrfs_device_data_ordered_init() is not invoked behind
a macro somewhere.
Fixes: 7cc8e58d53cd ("Btrfs: fix unprotected device's variants on 32bits machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dan Carpenter [Wed, 13 Jan 2016 12:21:17 +0000 (15:21 +0300)]
Btrfs: clean up an error code in btrfs_init_space_info()
If we return 1 here, then the caller treats it as an error and returns
-EINVAL. It causes a static checker warning to treat positive returns
as an error.
Fixes: 1aba86d67f34 ('Btrfs: fix easily get into ENOSPC in mixed case')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Wed, 13 Jan 2016 14:08:01 +0000 (22:08 +0800)]
btrfs: fix iterator with update error in backref.c
Fix the following error:
fs/btrfs/backref.c:565:1-20: iterator with update on line 577
Fixes: a7ca422('btrfs: use list_for_each_entry* in backref.c')
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tsutomu Itoh [Wed, 6 Jan 2016 08:03:40 +0000 (17:03 +0900)]
Btrfs: fix output of compression message in btrfs_parse_options()
The compression message might not be correctly output.
Fix it.
[[before fix]]
# mount -o compress /dev/sdb3 /test3
[ 996.874264] BTRFS info (device sdb3): disk space caching is enabled
[ 996.874268] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress-force /dev/sdb3 /test3
[ 1035.075017] BTRFS info (device sdb3): force zlib compression
[ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress /dev/sdb3 /test3
[ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
[[after fix]]
# mount -o compress /dev/sdb3 /test3
[ 401.021753] BTRFS info (device sdb3): use zlib compression
[ 401.021758] BTRFS info (device sdb3): disk space caching is enabled
[ 401.021760] BTRFS: has skinny extents
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress-force /dev/sdb3 /test3
[ 439.824624] BTRFS info (device sdb3): force zlib compression
[ 439.824629] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
# mount -o remount,compress /dev/sdb3 /test3
[ 459.918430] BTRFS info (device sdb3): use zlib compression
[ 459.918434] BTRFS info (device sdb3): disk space caching is enabled
# mount | grep /test3
/dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chandan Rajendra [Thu, 7 Jan 2016 13:26:59 +0000 (18:56 +0530)]
Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
The following call trace is seen when btrfs/031 test is executed in a loop,
[ 158.661848] ------------[ cut here ]------------
[ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
[ 158.664102] BTRFS: Transaction aborted (error -2)
[ 158.664774] Modules linked in:
[ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted
4.4.0-rc6-g511711a #2
[ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 158.667392]
ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
[ 158.668515]
ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
[ 158.669647]
ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
[ 158.670769] Call Trace:
[ 158.671153] [<
ffffffff81431fc8>] dump_stack+0x44/0x5c
[ 158.671884] [<
ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0
[ 158.672769] [<
ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50
[ 158.673620] [<
ffffffff813bc98d>] create_subvol+0x3d1/0x6ea
[ 158.674440] [<
ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520
[ 158.675376] [<
ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50
[ 158.676235] [<
ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180
[ 158.677268] [<
ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70
[ 158.678183] [<
ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90
[ 158.678975] [<
ffffffff81144b8e>] ? vma_merge+0xee/0x300
[ 158.679751] [<
ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170
[ 158.680599] [<
ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70
[ 158.681686] [<
ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0
[ 158.682581] [<
ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490
[ 158.683399] [<
ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60
[ 158.684297] [<
ffffffff8117b9d4>] SyS_ioctl+0x74/0x80
[ 158.685051] [<
ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 158.685958] ---[ end trace
4b63312de5a2cb76 ]---
[ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
[ 158.709508] BTRFS info (device loop0): forced readonly
[ 158.737113] BTRFS info (device loop0): disk space caching is enabled
[ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
[ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30
This occurs because,
Mount filesystem
Create subvol with ID 257
Unmount filesystem
Mount filesystem
Delete subvol with ID 257
btrfs_drop_snapshot()
Add root corresponding to subvol 257 into
btrfs_transaction->dropped_roots list
Create new subvol (i.e. create_subvol())
257 is returned as the next free objectid
btrfs_read_fs_root_no_name()
Finds the btrfs_root instance corresponding to the old subvol with ID 257
in btrfs_fs_info->fs_roots_radix.
Returns error since btrfs_root_item->refs has the value of 0.
To fix the issue the commit initializes tree root's and subvolume root's
highest_objectid when loading the roots from disk.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 3 Jun 2015 14:55:48 +0000 (10:55 -0400)]
btrfs: cleanup, stop casting for extent_map->lookup everywhere
Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Chris Mason [Mon, 11 Jan 2016 16:39:28 +0000 (08:39 -0800)]
Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Mon, 11 Jan 2016 14:08:37 +0000 (06:08 -0800)]
Merge branch 'misc-cleanups-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Mon, 11 Jan 2016 13:59:32 +0000 (05:59 -0800)]
Merge branch 'misc-for-4.5' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Filipe Manana [Wed, 6 Jan 2016 22:42:35 +0000 (22:42 +0000)]
Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).
Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.
A reproducer test case for xfstests follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
# Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
# reserved for a boot loader to use (GRUB for example) and btrfs should never
# use them - neither for allocating metadata/data nor should trim/discard them.
# The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
$XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
$XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
# Now mount the filesystem and perform a fitrim against it.
_scratch_mount
_require_batched_discard $SCRATCH_MNT
$FSTRIM_PROG $SCRATCH_MNT
# Now unmount the filesystem and verify the content of the ranges was not
# modified (no trim/discard happened on them).
_scratch_unmount
echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
status=0
exit
Reported-by: Vincent Petry <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49f0 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Sam Tygier [Wed, 6 Jan 2016 08:46:12 +0000 (08:46 +0000)]
Btrfs: Check metadata redundancy on balance
When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single
Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk>
[ minor message reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Sat, 10 Oct 2015 15:59:53 +0000 (17:59 +0200)]
btrfs: statfs: report zero available if metadata are exhausted
There is one ENOSPC case that's very confusing. There's Available
greater than zero but no file operation succeds (besides removing
files). This happens when the metadata are exhausted and there's no
possibility to allocate another chunk.
In this scenario it's normal that there's still some space in the data
chunk and the calculation in df reflects that in the Avail value.
To at least give some clue about the ENOSPC situation, let statfs report
zero value in Avail, even if there's still data space available.
Current:
/dev/sdb1 4.0G 3.3G 719M 83% /mnt/test
New:
/dev/sdb1 4.0G 3.3G 0 100% /mnt/test
We calculate the remaining metadata space minus global reserve. If this
is (supposedly) smaller than zero, there's no space. But this does not
hold in practice, the exhausted state happens where's still some
positive delta. So we apply some guesswork and compare the delta to a 4M
threshold. (Practically observed delta was 2M.)
We probably cannot calculate the exact threshold value because this
depends on the internal reservations requested by various operations, so
some operations that consume a few metadata will succeed even if the
Avail is zero. But this is better than the other way around.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:54:03 +0000 (18:54 +0100)]
btrfs: preallocate path for snapshot creation at ioctl time
We can also preallocate btrfs_path that's used during pending snapshot
creation and avoid another late ENOMEM failure.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:54:00 +0000 (18:54 +0100)]
btrfs: allocate root item at snapshot ioctl time
The actual snapshot creation is delayed until transaction commit. If we
cannot get enough memory for the root item there, we have to fail the
whole transaction commit which is bad. So we'll allocate the memory at
the ioctl call and pass it along with the pending_snapshot struct. The
potential ENOMEM will be returned to the caller of snapshot ioctl.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 10 Nov 2015 17:53:56 +0000 (18:53 +0100)]
btrfs: do an allocation earlier during snapshot creation
We can allocate pending_snapshot earlier and do not have to do cleanup
in case of failure.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:45 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path locks
The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:
* overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
* better packing in a slab page +6 objects
* the whole structure now fits to 2 cachelines
* slight decrease in code size:
text data bss dec hex filename
938731 43670 23144
1005545 f57e9 fs/btrfs/btrfs.ko.before
938203 43670 23144
1005017 f55d9 fs/btrfs/btrfs.ko.after
(and the generated assembly does not change much)
The main purpose is to decrease the size of the structure without
affecting performance. The byte access is usually well behaving accross
arches, the locks are not accessed frequently and sometimes just
compared to zero.
Note for further size reduction attempts: the slots could be made u16
but this might generate worse code on some arches (non-byte and non-int
access). Also the range of operations on slots is wider compared to
locks and the potential performance drop should be evaluated first.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:42 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path lowest_level
The level is 0..7, we can use smaller type. The size of btrfs_path is now
136 bytes from 144, which is +2 objects that fit into a 4k slab.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:38 +0000 (16:31 +0100)]
btrfs: use smaller type for btrfs_path reada
The possible values for reada are all positive and bounded, we can later
save some bytes by storing it in u8.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Nov 2015 15:31:35 +0000 (16:31 +0100)]
btrfs: cleanup, use enum values for btrfs_path reada
Replace the integers by enums for better readability. The value 2 does
not have any meaning since
a717531942f488209dded30f6bc648167bcefa72
"Btrfs: do less aggressive btree readahead" (2009-01-22).
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:31 +0000 (11:42 +0100)]
btrfs: constify static arrays
There are a few statically initialized arrays that can be made const.
The remaining (like file_system_type, sysfs attributes or prop handlers)
do not allow that due to type mismatch when passed to the APIs or
because the structures are modified through other members.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:28 +0000 (11:42 +0100)]
btrfs: constify remaining structs with function pointers
* struct extent_io_ops
* struct btrfs_free_space_op
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 10:42:24 +0000 (11:42 +0100)]
btrfs tests: replace whole ops structure for free space tests
Preparatory work for making btrfs_free_space_op constant. In
test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with
own version thus preventing constification. We can rework it so we
replace the whole structure with the correct function pointers.
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Mon, 21 Dec 2015 15:50:23 +0000 (23:50 +0800)]
btrfs: use list_for_each_entry* in backref.c
Use list_for_each_entry*() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Fri, 18 Dec 2015 14:17:00 +0000 (22:17 +0800)]
btrfs: use list_for_each_entry_safe in free-space-cache.c
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Geliang Tang [Fri, 18 Dec 2015 14:16:59 +0000 (22:16 +0800)]
btrfs: use list_for_each_entry* in check-integrity.c
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Byongho Lee [Mon, 14 Dec 2015 16:42:10 +0000 (01:42 +0900)]
Btrfs: use linux/sizes.h to represent constants
We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '
268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.
So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 10:02:31 +0000 (11:02 +0100)]
btrfs: cleanup, remove stray return statements
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 25 Oct 2015 20:15:06 +0000 (20:15 +0000)]
btrfs: zero out delayed node upon allocation
It's slightly cleaner to zero-out the delayed node upon allocation
than to do it by hand in btrfs_init_delayed_node() for a few members
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 25 Oct 2015 19:35:44 +0000 (19:35 +0000)]
btrfs: pass proper enum type to start_transaction()
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Alexandru Moise [Sun, 18 Oct 2015 21:35:41 +0000 (21:35 +0000)]
btrfs: switch __btrfs_fs_incompat return type from int to bool
Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly
returning boolean not int.
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Byongho Lee [Tue, 19 May 2015 14:46:45 +0000 (23:46 +0900)]
btrfs: remove unused inode argument from uncompress_inline()
The inode argument is never used from the beginning, so remove it.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 8 Dec 2015 13:39:32 +0000 (14:39 +0100)]
btrfs: don't use slab cache for struct btrfs_delalloc_work
Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.
The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.
The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Dec 2015 17:09:12 +0000 (18:09 +0100)]
btrfs: drop duplicate prefix from scrub workqueues
The helper btrfs_alloc_workqueue will add the "btrfs-" prefix.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 16:27:09 +0000 (17:27 +0100)]
btrfs: verbose error when we find an unexpected item in sys_array
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 16:27:06 +0000 (17:27 +0100)]
btrfs: handle invalid num_stripes in sys_array
We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.
A crafted or corrupted image crashes at mount time:
BTRFS: device fsid
9006933e-2a9a-44f0-917f-
514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted
4.2.5-00657-ge047887-dirty #25
Stack:
637af890 60062489 602aeb2e 604192ba
60387961 00000011 637af8a0 6038a835
637af9c0 6038776b 634ef32b 00000000
Call Trace:
[<
6001c86d>] show_stack+0xfe/0x15b
[<
6038a835>] dump_stack+0x2a/0x2c
[<
6038776b>] panic+0x13e/0x2b3
[<
6020f099>] btrfs_read_sys_array+0x25d/0x2ff
[<
601cfbbe>] open_ctree+0x192d/0x27af
[<
6019c2c1>] btrfs_mount+0x8f5/0xb9a
[<
600bc9a7>] mount_fs+0x11/0xf3
[<
600d5167>] vfs_kern_mount+0x75/0x11a
[<
6019bcb0>] btrfs_mount+0x2e4/0xb9a
[<
600bc9a7>] mount_fs+0x11/0xf3
[<
600d5167>] vfs_kern_mount+0x75/0x11a
[<
600d710b>] do_mount+0xa35/0xbc9
[<
600d7557>] SyS_mount+0x95/0xc8
[<
6001e884>] handle_syscall+0x6b/0x8e
Reported-by: Jiri Slaby <jslaby@suse.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org # 3.19+
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 30 Nov 2015 15:51:29 +0000 (16:51 +0100)]
btrfs: better packing of btrfs_delayed_extent_op
btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
and has 8 unused bytes. Reducing the level type to u8 makes it possible
to squeeze it to the padding byte after key. The bitfields were switched
to bool as there's space to store the full byte without increasing the
whole structure, besides that the generated assembly is smaller.
struct btrfs_delayed_extent_op {
struct btrfs_disk_key key; /* 0 17 */
u8 level; /* 17 1 */
bool update_key; /* 18 1 */
bool update_flags; /* 19 1 */
bool is_data; /* 20 1 */
/* XXX 3 bytes hole, try to pack */
u64 flags_to_set; /* 24 8 */
/* size: 32, cachelines: 1, members: 6 */
/* sum members: 29, holes: 1, sum holes: 3 */
/* last cacheline: 32 bytes */
};
The final size is 32 bytes which gives +26 object per slab page.
text data bss dec hex filename
938811 43670 23144
1005625 f5839 fs/btrfs/btrfs.ko.before
938747 43670 23144
1005561 f57f9 fs/btrfs/btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 19 Nov 2015 13:15:51 +0000 (14:15 +0100)]
btrfs: put delayed item hook into inode
Inodes for delayed iput allocate a trivial helper structure, let's place
the list hook directly into the inode and save a kmalloc (killing a
__GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
The inode can be put into the delayed_iputs list more than once and we
have to keep the count. This means we can't use the list_splice to
process a bunch of inodes because we'd lost track of the count if the
inode is put into the delayed iputs again while it's processed.
Signed-off-by: David Sterba <dsterba@suse.com>
Zhao Lei [Thu, 19 Nov 2015 09:26:22 +0000 (17:26 +0800)]
btrfs: Support convert to -d dup for btrfs-convert
Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.
This patch remove limitation of above case.
Tested by following script:
(combination of dup conversion with fsck):
export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'
do_dup_test()
{
local m_from="$1"
local d_from="$2"
local m_to="$3"
local d_to="$4"
echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
umount "$TEST_DIR" &>/dev/null
./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
mount "$TEST_DEV" "$TEST_DIR" || return 1
cp -a /sbin/* "$TEST_DIR"
[[ "$m_from" != "$m_to" ]] && {
./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
}
[[ "$d_from" != "$d_to" ]] && {
local opt=()
[[ "$d_to" == single ]] && opt+=("-f")
./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
}
umount "$TEST_DIR" || return 1
./btrfsck "$TEST_DEV" || return 1
echo
return 0
}
test_all()
{
for m_from in single dup; do
for d_from in single dup; do
for m_to in single dup; do
for d_to in single dup; do
do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
done
done
done
done
}
test_all
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 22 Oct 2015 19:05:09 +0000 (15:05 -0400)]
Btrfs: igrab inode in writepage
We hit this panic on a few of our boxes this week where we have an
ordered_extent with an NULL inode. We do an igrab() of the inode in writepages,
but weren't doing it in writepage which can be called directly from the VM on
dirty pages. If the inode has been unlinked then we could have I_FREEING set
which means igrab() would return NULL and we get this panic. Fix this by trying
to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 7 Oct 2015 09:23:23 +0000 (17:23 +0800)]
Btrfs: add missing brelse when superblock checksum fails
Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 5 Jan 2016 16:24:05 +0000 (16:24 +0000)]
Btrfs: fix transaction handle leak on failure to create hard link
If we failed to create a hard link we were not always releasing the
the transaction handle we got before, resulting in a memory leak and
preventing any other tasks from being able to commit the current
transaction.
Fix this by always releasing our transaction handle.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Filipe Manana [Thu, 31 Dec 2015 18:16:29 +0000 (18:16 +0000)]
Btrfs: fix number of transaction units required to create symlink
We weren't accounting for the insertion of an inline extent item for the
symlink inode nor that we need to update the parent inode item (through
the call to btrfs_add_nondir()). So fix this by including two more
transaction units.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 31 Dec 2015 18:08:24 +0000 (18:08 +0000)]
Btrfs: don't leave dangling dentry if symlink creation failed
When we are creating a symlink we might fail with an error after we
created its inode and added the corresponding directory indexes to its
parent inode. In this case we end up never removing the directory indexes
because the inode eviction handler, called for our symlink inode on the
final iput(), only removes items associated with the symlink inode and
not with the parent inode.
Example:
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt
$ touch /mnt/foo
$ ln -s /mnt/foo /mnt/bar
ln: failed to create symbolic link ‘bar’: Cannot allocate memory
$ umount /mnt
$ btrfsck /dev/sdi
Checking filesystem on /dev/sdi
UUID:
d5acb5ba-31bd-42da-b456-
89dca2e716e1
checking extents
checking free space cache
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
found 131073 bytes used err is 1
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 124305
file data blocks allocated: 262144
referenced 262144
btrfs-progs v4.2.3
So fix this by adding the directory index entries as the very last
step of symlink creation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Thu, 31 Dec 2015 18:07:59 +0000 (18:07 +0000)]
Btrfs: send, don't BUG_ON() when an empty symlink is found
When a symlink is successfully created it always has an inline extent
containing the source path. However if an error happens when creating
the symlink, we can leave in the subvolume's tree a symlink inode without
any such inline extent item - this happens if after btrfs_symlink() calls
btrfs_end_transaction() and before it calls the inode eviction handler
(through the final iput() call), the transaction gets committed and a
crash happens before the eviction handler gets called, or if a snapshot
of the subvolume is made before the eviction handler gets called. Sadly
we can't just avoid this by making btrfs_symlink() call
btrfs_end_transaction() after it calls the eviction handler, because the
later can commit the current transaction before it removes any items from
the subvolume tree (if it encounters ENOSPC errors while reserving space
for removing all the items).
So make send fail more gracefully, with an -EIO error, and print a
message to dmesg/syslog informing that there's an empty symlink inode,
so that the user can delete the empty symlink or do something else
about it.
Reported-by: Stephen R. van den Berg <srb@cuci.nl>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Filipe Manana [Wed, 30 Dec 2015 02:42:30 +0000 (02:42 +0000)]
Btrfs: fix race between free space endio workers and space cache writeout
While running a stress test I ran into the following trace/transaction
abort:
[471626.672243] ------------[ cut here ]------------
[471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
[471626.675492] BTRFS: Transaction aborted (error -2)
[471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix
[471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1
[471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[471626.691901]
0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38
[471626.695009]
ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe
[471626.697490]
ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90
[471626.699201] Call Trace:
[471626.699804] [<
ffffffff812566f4>] dump_stack+0x4e/0x79
[471626.701049] [<
ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[471626.702542] [<
ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.704326] [<
ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[471626.705636] [<
ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
[471626.707048] [<
ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
[471626.708616] [<
ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs]
[471626.709950] [<
ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs]
[471626.711286] [<
ffffffff81081c61>] ? signal_pending_state+0x31/0x31
[471626.712611] [<
ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
[471626.715610] [<
ffffffff811962a2>] ? SyS_tee+0x226/0x226
[471626.716718] [<
ffffffff811962c2>] sync_fs_one_sb+0x20/0x22
[471626.717672] [<
ffffffff8116fc01>] iterate_supers+0x75/0xc2
[471626.718800] [<
ffffffff8119669a>] sys_sync+0x52/0x80
[471626.719990] [<
ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[471626.721835] ---[ end trace
baf57f43d76693f4 ]---
[471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry
This is a very rare situation and it happened due to a race between a free
space endio worker and writing the space caches for dirty block groups at
a transaction's commit critical section. The steps leading to this are:
1) A task calls btrfs_commit_transaction() and starts the writeout of the
space caches for all currently dirty block groups (i.e. it calls
btrfs_start_dirty_block_groups());
2) The previous step starts writeback for space caches;
3) When the writeback finishes it queues jobs for free space endio work
queue (fs_info->endio_freespace_worker) that execute
btrfs_finish_ordered_io();
4) The task committing the transaction sets the transaction's state
to TRANS_STATE_COMMIT_DOING and shortly after calls
btrfs_write_dirty_block_groups();
5) A free space endio job joins the transaction, through
btrfs_join_transaction_nolock(), and updates a free space inode item
in the root tree through btrfs_update_inode_fallback();
6) Updating the free space inode item resulted in COWing one or more
nodes/leaves of the root tree, and that resulted in creating a new
metadata block group, which gets added to the transaction's list
of dirty block groups (this is a very rare case);
7) The free space endio job has not released yet its transaction handle
at this point, so the new metadata block group was not yet fully
created (didn't go through btrfs_create_pending_block_groups() yet);
8) The transaction commit task sees the new metadata block group in
the transaction's list of dirty block groups and processes it.
When it attempts to update the block group's block group item in
the extent tree, through write_one_cache_group(), it isn't able
to find it and aborts the transaction with error -ENOENT - this
is because the free space endio job hasn't yet released its
transaction handle (which calls btrfs_create_pending_block_groups())
and therefore the block group item was not yet added to the extent
tree.
Fix this waiting for free space endio jobs if we fail to find a block
group item in the extent tree and then retry once updating the block
group item.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Chris Mason [Wed, 30 Dec 2015 15:52:35 +0000 (07:52 -0800)]
btrfs: don't run delayed references while we are creating the free space tree
This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.
Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 30 Dec 2015 15:37:26 +0000 (07:37 -0800)]
btrfs: fix compiling with CONFIG_BTRFS_DEBUG enabled.
Merging in the free space tree deleted a variable needed when
CONFIG_BTRFS_DEBUG=y
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:30:51 +0000 (13:30 -0800)]
btrfs: fix warning on uninit variable in btrfs_finish_chunk_alloc
map->num_stripes really can't be zero, but just in case.
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:29:09 +0000 (13:29 -0800)]
Merge branch 'freespace-4.5' into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:28:35 +0000 (13:28 -0800)]
Merge branch 'for-chris-4.5' of git://git./linux/kernel/git/fdmanana/linux into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:17:42 +0000 (13:17 -0800)]
Merge branch 'dev/simplify-set-bit' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Wed, 23 Dec 2015 21:11:27 +0000 (13:11 -0800)]
Merge branch 'dev/gfp-flags' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Chris Mason [Wed, 23 Dec 2015 21:10:26 +0000 (13:10 -0800)]
Merge branch 'cleanup/misc-simplify' of git://git./linux/kernel/git/kdave/linux into for-linus-4.5
Filipe Manana [Fri, 18 Dec 2015 03:02:48 +0000 (03:02 +0000)]
Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups
We call btrfs_write_dirty_block_groups() in the critical section of a
transaction's commit, when no other tasks can join the transaction and
add more block groups to the transaction's list of dirty block groups,
so we not taking the dirty block groups spinlock when checking for the
list's emptyness, grabbing its first element or deleting elements from
it.
However there's a special and rare case where we can have a concurrent
task adding elements to this list. We trigger writeback for space
caches before at btrfs_start_dirty_block_groups() and in past iterations
of the loop at btrfs_write_dirty_block_groups(), this means that when
the writeback finishes (which happens asynchronously) it creates a
task for the endio free space work queue that executes
btrfs_finish_ordered_io() - this function is able to join the transaction,
through btrfs_join_transaction_nolock(), and update the free space cache's
inode item in the root tree, which can result in COWing nodes of this tree
and therefore allocation of a new block group can happen, which gets added
to the transaction's list of dirty block groups while the transaction
commit task is operating on it concurrently.
So fix this by taking the dirty block groups spinlock before doing
operations on the dirty block groups list at
btrfs_write_dirty_block_groups().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Linus Torvalds [Mon, 21 Dec 2015 00:06:09 +0000 (16:06 -0800)]
Linux 4.4-rc6
Linus Torvalds [Sun, 20 Dec 2015 18:01:11 +0000 (10:01 -0800)]
Merge tag 'rtc-4.4-3' of git://git./linux/kernel/git/abelloni/linux
Pull RTC fixes from Alexandre Belloni:
"Late fixes for the RTC subsystem for 4.4:
A fix for a nasty hardware bug in rk808 and an initialization
reordering in da9063 to fix a possible crash"
* tag 'rtc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: da9063: fix access ordering error during RTC interrupt at system power on
rtc: rk808: Compensate for Rockchip calendar deviation on November 31st
Steve Twiss [Tue, 8 Dec 2015 16:28:39 +0000 (16:28 +0000)]
rtc: da9063: fix access ordering error during RTC interrupt at system power on
This fix alters the ordering of the IRQ and device registrations in the RTC
driver probe function. This change will apply to the RTC driver that supports
both DA9063 and DA9062 PMICs.
A problem could occur with the existing RTC driver if:
A system is started from a cold boot using the PMIC RTC IRQ to initiate a
power on operation. For instance, if an RTC alarm is used to start a
platform from power off.
The existing driver IRQ is requested before the device has been properly
registered.
i.e.
ret = devm_request_threaded_irq()
comes before
rtc->rtc_dev = devm_rtc_device_register();
In this case, the interrupt can be called before the device has been
registered and the handler can be called immediately. The IRQ handler
da9063_alarm_event() contains the function call
rtc_update_irq(rtc->rtc_dev, 1, RTC_IRQF | RTC_AF);
which in turn tries to access the unavailable rtc->rtc_dev.
The fix is to reorder the functions inside the RTC probe. The IRQ is
requested after the RTC device resource has been registered so that
get_irq_byname is the last thing to happen.
Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Julius Werner [Tue, 15 Dec 2015 23:02:49 +0000 (15:02 -0800)]
rtc: rk808: Compensate for Rockchip calendar deviation on November 31st
In A.D. 1582 Pope Gregory XIII found that the existing Julian calendar
insufficiently represented reality, and changed the rules about
calculating leap years to account for this. Similarly, in A.D. 2013
Rockchip hardware engineers found that the new Gregorian calendar still
contained flaws, and that the month of November should be counted up to
31 days instead. Unfortunately it takes a long time for calendar changes
to gain widespread adoption, and just like more than 300 years went by
before the last Protestant nation implemented Greg's proposal, we will
have to wait a while until all religions and operating system kernels
acknowledge the inherent advantages of the Rockchip system. Until then
we need to translate dates read from (and written to) Rockchip hardware
back to the Gregorian format.
This patch works by defining Jan 1st, 2016 as the arbitrary anchor date
on which Rockchip and Gregorian calendars are in sync. From that we can
translate arbitrary later dates back and forth by counting the number
of November/December transitons since the anchor date to determine the
offset between the calendars. We choose this method (rather than trying
to regularly "correct" the date stored in hardware) since it's the only
way to ensure perfect time-keeping even if the system may be shut down
for an unknown number of years. The drawback is that other software
reading the same hardware (e.g. mainboard firmware) must use the same
translation convention (including the same anchor date) to be able to
read and write correct timestamps from/to the RTC.
Signed-off-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Linus Torvalds [Sun, 20 Dec 2015 01:44:19 +0000 (17:44 -0800)]
Merge tag 'tty-4.4-rc6' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some tty/serial driver fixes for 4.4-rc6 that resolve some
reported problems. All of these have been in linux-next. The details
are in the shortlog"
* tag 'tty-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: Fix GPF in flush_to_ldisc()
serial: earlycon: Add missing spinlock initialization
serial: sh-sci: Fix length of scatterlist
n_tty: Fix poll() after buffer-limited eof push read
serial: 8250_uniphier: fix dl_read and dl_write functions
Linus Torvalds [Sun, 20 Dec 2015 01:33:58 +0000 (17:33 -0800)]
Merge tag 'usb-4.4-rc6' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB and PHY fixes for 4.4-rc6. All of them resolve some
reported problems. Full details in the shortlog"
* tag 'usb-4.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: fix invalid memory access in hub_activate()
USB: ipaq.c: fix a timeout loop
phy: core: Get a refcount to phy in devm_of_phy_get_by_index()
phy: cygnus: pcie: add missing of_node_put
phy: miphy365x: add missing of_node_put
phy: miphy28lp: add missing of_node_put
phy: rockchip-usb: add missing of_node_put
phy: berlin-sata: add missing of_node_put
phy: mt65xx-usb3: add missing of_node_put
phy: brcmstb-sata: add missing of_node_put
phy: sun9i-usb: add USB dependency
Linus Torvalds [Sun, 20 Dec 2015 00:46:46 +0000 (16:46 -0800)]
Merge tag 'md/4.4-rc5-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Four fixes for md:
- two recently introduced regressions fixed.
- one older bug in RAID10 - tagged for -stable since 4.2
- one minor sysfs api improvement"
* tag 'md/4.4-rc5-fixes' of git://neil.brown.name/md:
Fix remove_and_add_spares removes drive added as spare in slot_store
md: fix bug due to nested suspend
MD: change journal disk role to disk 0
md/raid10: fix data corruption and crash during resync
Linus Torvalds [Sun, 20 Dec 2015 00:40:48 +0000 (16:40 -0800)]
Merge tag 'powerpc-4.4-5' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Partial revert of "powerpc: Individual System V IPC system calls"
- pr_warn_once on unsupported OPAL_MSG type from Stewart
- Fix deadlock in opal-irqchip introduced by "Fix double endian
conversion" from Alistair
* tag 'powerpc-4.4-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/opal-irqchip: Fix deadlock introduced by "Fix double endian conversion"
powerpc/powernv: pr_warn_once on unsupported OPAL_MSG type
Partial revert of "powerpc: Individual System V IPC system calls"
Linus Torvalds [Sat, 19 Dec 2015 18:10:43 +0000 (10:10 -0800)]
Merge tag 'spi-fix-v4.4-rc5' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A couple of reference counting bugs here, one in spidev and one with
holding an extra reference in the core that we never freed if we
removed a device, plus a driver specific fix. Both of the refcounting
bugs are very old but they've only been found by observation so
hopefully their impact has been low"
* tag 'spi-fix-v4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: fix parent-device reference leak
spi: spidev: Hold spi_lock over all defererences of spi in release()
spi-fsl-dspi: Fix CTAR Register access
Linus Torvalds [Sat, 19 Dec 2015 18:05:00 +0000 (10:05 -0800)]
Merge tag 'gpio-v4.4-3' of git://git./linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"Some GPIO fixes for the v4.4 series. Most prominent: I revert the
error propagation from the .get() function until we can fix up all the
drivers properly for v4.5.
- Revert the error number propagation from the .get() vtable entry
temporarily, until we make the proper fixes to all drivers.
- Fix the clamping behaviour in the generic GPIO driver.
- Driver fix for the ath79 driver"
* tag 'gpio-v4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: revert get() to non-errorprogating behaviour
gpio: generic: clamp values from bgpio_get_set()
gpio: ath79: Fix the logic to clear offset bit of AR71XX_GPIO_REG_OE register