David Sterba [Fri, 9 Aug 2019 15:12:38 +0000 (17:12 +0200)]
btrfs: define separate btrfs_set/get_XX helpers
There are helpers for all type widths defined via macro and optionally
can use a token which is a cached pointer to avoid repeated mapping of
the extent buffer.
The token value is known at compile time, when it's valid it's always
address of a local variable, otherwise it's NULL passed by the
token-less helpers.
This can be utilized to remove some branching as the helpers are used
frequenlty.
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 27 Aug 2019 11:46:29 +0000 (14:46 +0300)]
btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref
btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_extref or NULL it it didn't find anything. This
streamlines the function calling convention.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 27 Aug 2019 11:46:28 +0000 (14:46 +0300)]
btrfs: Make btrfs_find_name_in_backref return btrfs_inode_ref struct
btrfs_find_name_in_backref returns either 0/1 depending on whether it
found a backref for the given name. If it returns true then the actual
inode_ref struct is returned in one of its parameters. That's pointless,
instead refactor the function such that it returns either a pointer
to the btrfs_inode_ref or NULL it it didn't find anything. This
streamlines the function calling convention.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 21 Aug 2019 18:05:32 +0000 (20:05 +0200)]
btrfs: move dev_stats helpers to volumes.c
The other dev stats functions are already there and the helpers are not
used by anything else.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 21 Aug 2019 17:57:04 +0000 (19:57 +0200)]
btrfs: move struct io_ctl to free-space-cache.h
The io_ctl structure is used for free space management, and used only by
the v1 space cache code, but unfortunatlly the full definition is
required by block-group.h so it can't be moved to free-space-cache.c
without additional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 21 Aug 2019 17:12:59 +0000 (19:12 +0200)]
btrfs: move functions for tree compare to send.c
Send is the only user of tree_compare, we can move it there along with
the other helpers and definitions.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 21 Aug 2019 17:16:27 +0000 (19:16 +0200)]
btrfs: rename and export read_node_slot
Preparatory work for code that will be moved out of ctree and uses this
function.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 21 Aug 2019 17:06:17 +0000 (19:06 +0200)]
btrfs: move private raid56 definitions from ctree.h
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 21 Aug 2019 16:54:28 +0000 (18:54 +0200)]
btrfs: move math functions to misc.h
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 21 Aug 2019 16:48:25 +0000 (18:48 +0200)]
btrfs: move cond_wake_up functions out of ctree
The file ctree.h serves as a header for everything and has become quite
bloated. Split some helpers that are generic and create a new file that
should be the catch-all for code that's not btrfs-specific.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 27 Aug 2019 07:40:45 +0000 (15:40 +0800)]
btrfs: use proper error values on allocation failure in clone_fs_devices
Fix the fake ENOMEM return error code to the actual error in
clone_fs_devices().
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 27 Aug 2019 07:40:44 +0000 (15:40 +0800)]
btrfs: proper error handling when invalid device is found in find_next_devid
In a corrupted tree, if search for next devid finds the device with
devid = -1, then report the error -EUCLEAN back to the parent function
to fail gracefully.
The tree checker will not catch this in case the devids are created
using the following script:
umount /btrfs
dev1=/dev/sdb
dev2=/dev/sdc
mkfs.btrfs -fq -dsingle -msingle $dev1
mount $dev1 /btrfs
_fail()
{
echo $1
exit 1
}
while true; do
btrfs dev add -f $dev2 /btrfs || _fail "add failed"
btrfs dev del $dev1 /btrfs || _fail "del failed"
dev_tmp=$dev1
dev1=$dev2
dev2=$dev_tmp
done
With output:
BTRFS critical (device sdb): corrupt leaf: root=3 block=
313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506]
BTRFS error (device sdb): block=
313739198464 write time tree block corruption detected
BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction)
BTRFS warning (device sdb): Skipping commit of aborted transaction.
BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ add script and messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
Christophe Leroy [Wed, 21 Aug 2019 15:05:55 +0000 (15:05 +0000)]
btrfs: fix allocation of free space cache v1 bitmap pages
Various notifications of type "BUG kmalloc-4096 () : Redzone
overwritten" have been observed recently in various parts of the kernel.
After some time, it has been made a relation with the use of BTRFS
filesystem and with SLUB_DEBUG turned on.
[ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten
[ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
[ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
[ 22.811193] __slab_alloc.constprop.26+0x44/0x70
[ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec
[ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs]
[ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs]
[ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs]
[ 22.812321] find_free_extent+0x680/0x12a4 [btrfs]
[ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs]
[ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
[ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs]
[ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs]
[ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs]
[ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs]
[ 22.813973] close_ctree+0xf8/0x2a4 [btrfs]
[ 22.814107] generic_shutdown_super+0x80/0x110
[ 22.814250] kill_anon_super+0x18/0x30
[ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs]
[ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
[ 22.814841] proc_cgroup_show+0xc0/0x248
[ 22.814967] proc_single_show+0x54/0x98
[ 22.815086] seq_read+0x278/0x45c
[ 22.815190] __vfs_read+0x28/0x17c
[ 22.815289] vfs_read+0xa8/0x14c
[ 22.815381] ksys_read+0x50/0x94
[ 22.815475] ret_from_syscall+0x0/0x38
Commit
69d2480456d1 ("btrfs: use copy_page for copying pages instead of
memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
have the size of a page, they were allocated with kzalloc().
Most of the time, kzalloc() allocates aligned blocks of memory, so
copy_page() can be used. But when some debug options like SLAB_DEBUG are
activated, kzalloc() may return unaligned pointer.
On powerpc, memcpy(), copy_page() and other copying functions use
'dcbz' instruction which provides an entire zeroed cacheline to avoid
memory read when the intention is to overwrite a full line. Functions
like memcpy() are writen to care about partial cachelines at the start
and end of the destination, but copy_page() assumes it gets pages. As
pages are naturally cache aligned, copy_page() doesn't care about
partial lines. This means that when copy_page() is called with a
misaligned pointer, a few leading bytes are zeroed.
To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
The cache pool is created with PAGE_SIZE alignment constraint.
Reported-by: Erhard F. <erhard_f@mailbox.org>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
Fixes: 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy")
Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_free_space_bitmap ]
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 22 Aug 2019 02:14:15 +0000 (10:14 +0800)]
btrfs: Detect unbalanced tree with empty leaf before crashing btree operations
[BUG]
With crafted image, btrfs will panic at btree operations:
kernel BUG at fs/btrfs/ctree.c:3894!
invalid opcode: 0000 [#1] SMP PTI
CPU: 0 PID: 1138 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
RIP: 0010:__push_leaf_left+0x6b6/0x6e0
RSP: 0018:
ffffc0bd4128b990 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
ffffa0a4ab8f0e38 RCX:
0000000000000000
RDX:
ffffa0a280000000 RSI:
0000000000000000 RDI:
ffffa0a4b3814000
RBP:
ffffc0bd4128ba38 R08:
0000000000001000 R09:
ffffc0bd4128b948
R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000240
R13:
ffffa0a4b556fb60 R14:
ffffa0a4ab8f0af0 R15:
ffffa0a4ab8f0af0
FS:
0000000000000000(0000) GS:
ffffa0a4b7a00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007f2461c80020 CR3:
000000022b32a006 CR4:
00000000000206f0
Call Trace:
? _cond_resched+0x1a/0x50
push_leaf_left+0x179/0x190
btrfs_del_items+0x316/0x470
btrfs_del_csums+0x215/0x3a0
__btrfs_free_extent.isra.72+0x5a7/0xbe0
__btrfs_run_delayed_refs+0x539/0x1120
btrfs_run_delayed_refs+0xdb/0x1b0
btrfs_commit_transaction+0x52/0x950
? start_transaction+0x94/0x450
transaction_kthread+0x163/0x190
kthread+0x105/0x140
? btrfs_cleanup_transaction+0x560/0x560
? kthread_destroy_worker+0x50/0x50
ret_from_fork+0x35/0x40
Modules linked in:
---[ end trace
c2425e6e89b5558f ]---
[CAUSE]
The offending csum tree looks like this:
checksum tree key (CSUM_TREE ROOT_ITEM 0)
node
29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
...
key (EXTENT_CSUM EXTENT_CSUM
85975040) block
29630464 gen 17
key (EXTENT_CSUM EXTENT_CSUM
89911296) block
29642752 gen 17 <<<
key (EXTENT_CSUM EXTENT_CSUM
92274688) block
29646848 gen 17
...
leaf
29630464 items 6 free space 1 generation 17 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM
85975040) itemoff 3987 itemsize 8
range start
85975040 end
85983232 length 8192
...
leaf
29642752 items 0 free space 3995 generation 17 owner 0
^ empty leaf invalid owner ^
leaf
29646848 items 1 free space 602 generation 17 owner CSUM_TREE
item 0 key (EXTENT_CSUM EXTENT_CSUM
92274688) itemoff 627 itemsize 3368
range start
92274688 end
95723520 length
3448832
So we have a corrupted csum tree where one tree leaf is completely
empty, causing unbalanced btree, thus leading to unexpected btree
balance error.
[FIX]
For this particular case, we handle it in two directions to catch it:
- Check if the tree block is empty through btrfs_verify_level_key()
So that invalid tree blocks won't be read out through
btrfs_search_slot() and its variants.
- Check 0 tree owner in tree checker
NO tree is using 0 as its tree owner, detect it and reject at tree
block read time.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202821
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 26 Aug 2019 14:34:24 +0000 (17:34 +0300)]
btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag
Support for asynchronous snapshot creation was originally added in
72fd032e9424 ("Btrfs: add SNAP_CREATE_ASYNC ioctl") to cater for
ceph's backend needs. However, since Ceph has deprecated support for
btrfs there is no longer need for that support in btrfs. Additionally,
this was never supported by btrfs-progs, the official userspace tools.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 22 Aug 2019 14:24:20 +0000 (17:24 +0300)]
btrfs: improve error handling in run_delalloc_nocow
Correctly handle failure cases when adding an ordered extents in case
of REGULAR or PREALLOC extents. Remove the BUG_ON.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 22 Aug 2019 14:25:23 +0000 (17:25 +0300)]
btrfs: comment and minor simplifications in run_delalloc_nocow
Add a comment explaining why we keep the BUG also use the already read
and cached value of extent ram bytes stored in 'ram_bytes'.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 5 Aug 2019 14:47:06 +0000 (17:47 +0300)]
btrfs: streamline code in run_delalloc_nocow in case of inline extents
The extent range check right after the "out_check" label is redundant,
because the only way it can trigger is if we have an inline extent. In
this case it makes more sense to actually move it in the branch
explictly dealing with inlines extents.
What's more, the nested 'if (nocow)' can never be true because for
inline extents we always do COW and there is no chance 'nocow' can be
true, just remove that check.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Mon, 5 Aug 2019 14:47:05 +0000 (17:47 +0300)]
btrfs: simplify extent type checks in run_delalloc_nocow
There is no point in checking the type of the extent again just to set
the 'type' variable, when this check has already been performed before.
Instead, extend the original if branch with an 'else' clause. This
allows to remove one local variable and make it obvious how the code
flow differs for prealloc/regular extents.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 21 Aug 2019 07:42:57 +0000 (10:42 +0300)]
btrfs: improve comments around nocow path
run_delalloc_nocow contains numerous, somewhat subtle, checks when
figuring out whether a particular extent should be CoW'ed or not. This
patch explicitly states the assumptions those checks verify. As a
result also document 2 of the more subtle checks in check_committed_ref
as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 21 Aug 2019 07:42:03 +0000 (10:42 +0300)]
btrfs: refactor variable scope in run_delalloc_nocow
Of the 22 (!!!) local variables declared in this function only 9 have
function-wide context. Of the remaining 13, 12 are needed in the main
while loop of the function and 1 is needed in a tiny if branch, only in
case we have prealloc extent. This commit reduces the lifespan of every
variable to its bare minimum. It also renames the 'nolock' boolean to
freespace_inode to clearly indicate its purpose.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 22 Aug 2019 19:14:34 +0000 (15:14 -0400)]
btrfs: only reserve metadata_size for inodes
Historically we reserved worst case for every btree operation, and
generally speaking we want to do that in cases where it could be the
worst case. However for updating inodes we know the inode items are
already in the tree, so it will only be an update operation and never an
insert operation. This allows us to always reserve only the
metadata_size amount for inode updates rather than the
insert_metadata_size amount.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 22 Aug 2019 19:14:33 +0000 (15:14 -0400)]
btrfs: rename the btrfs_calc_*_metadata_size helpers
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
it doesn't take into account any splitting at the levels, because
truncate will never split nodes. However truncate _and_ changing will
never split nodes, so rename btrfs_calc_trunc_metadata_size to
btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely
for inserting items, so rename this to btrfs_calc_insert_metadata_size.
Making these clearer will help when I start using them differently in
upcoming patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 9 Aug 2019 01:24:24 +0000 (09:24 +0800)]
btrfs: tree-checker: Add EXTENT_DATA_REF check
EXTENT_DATA_REF is a little like DIR_ITEM which contains hash in its
key->offset.
This patch will check the following contents:
- Key->objectid
Basic alignment check.
- Hash
Hash of each extent_data_ref item must match key->offset.
- Offset
Basic alignment check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 9 Aug 2019 01:24:23 +0000 (09:24 +0800)]
btrfs: tree-checker: Add simple keyed refs check
For TREE_BLOCK_REF, SHARED_DATA_REF and SHARED_BLOCK_REF we need to
check:
| TREE_BLOCK_REF | SHARED_BLOCK_REF | SHARED_BLOCK_REF
--------------+----------------+-----------------+------------------
key->objectid | Alignment | Alignment | Alignment
key->offset | Any value | Alignment | Alignment
item_size | 0 | 0 | sizeof(le32) (*)
*: sizeof(struct btrfs_shared_data_ref)
So introduce a check to check all these 3 key types together.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 9 Aug 2019 01:24:22 +0000 (09:24 +0800)]
btrfs: tree-checker: Add EXTENT_ITEM and METADATA_ITEM check
This patch introduces the ability to check extent items.
This check involves:
- key->objectid check
Basic alignment check.
- key->type check
Against btrfs_extent_item::type and SKINNY_METADATA feature.
- key->offset alignment check for EXTENT_ITEM
- key->offset check for METADATA_ITEM
- item size check
Both against minimal size and stepping check.
- btrfs_extent_item check
Checks its flags and generation.
- btrfs_extent_inline_ref checks
Against 4 types inline ref.
Checks bytenr alignment and tree level.
- btrfs_extent_item::refs check
Check against total refs found in inline refs.
This check would be the most complex single item check due to its nature
of inlined items.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dan Carpenter [Fri, 9 Aug 2019 14:07:39 +0000 (17:07 +0300)]
btrfs: fix error pointer check in __btrfs_map_block()
The btrfs_get_chunk_map() never returns NULL, it returns error pointers.
Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 21 Aug 2019 09:26:32 +0000 (17:26 +0800)]
btrfs: dev stat drop useless goto
In the function btrfs_init_dev_stats() goto out is not needed, because the
alloc has failed. So just return -ENOMEM.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 21 Aug 2019 09:26:33 +0000 (17:26 +0800)]
btrfs: dev stats item key conversion per cpu type is not needed
%found_key is not used, drop it since it hasn't been used since the
beginning in
733f4fbbc108 ("Btrfs: read device stats on mount, write
modified ones during commit").
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 21 Aug 2019 13:38:15 +0000 (16:38 +0300)]
btrfs: Make reada_tree_block_flagged private
This function is used only for the readahead machinery. It makes no
sense to keep it external to reada.c file. Place it above its sole
caller and make it static. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 9 Aug 2019 14:49:06 +0000 (16:49 +0200)]
btrfs: compression: replace set_level callbacks by a common helper
The set_level callbacks do not do anything special and can be replaced
by a helper that uses the levels defined in the tables.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 9 Aug 2019 14:25:34 +0000 (16:25 +0200)]
btrfs: define compression levels statically
The maximum and default levels do not change and can be defined
directly. The set_level callback was a temporary solution and will be
removed.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 12 Aug 2019 18:14:29 +0000 (19:14 +0100)]
Btrfs: fix use-after-free when using the tree modification log
At ctree.c:get_old_root(), we are accessing a root's header owner field
after we have freed the respective extent buffer. This results in an
use-after-free that can lead to crashes, and when CONFIG_DEBUG_PAGEALLOC
is set, results in a stack trace like the following:
[ 3876.799331] stack segment: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[ 3876.799363] CPU: 0 PID: 15436 Comm: pool Not tainted 5.3.0-rc3-btrfs-next-54 #1
[ 3876.799385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[ 3876.799433] RIP: 0010:btrfs_search_old_slot+0x652/0xd80 [btrfs]
(...)
[ 3876.799502] RSP: 0018:
ffff9f08c1a2f9f0 EFLAGS:
00010286
[ 3876.799518] RAX:
ffff8dd300000000 RBX:
ffff8dd85a7a9348 RCX:
000000038da26000
[ 3876.799538] RDX:
0000000000000000 RSI:
ffffe522ce368980 RDI:
0000000000000246
[ 3876.799559] RBP:
dae1922adadad000 R08:
0000000008020000 R09:
ffffe522c0000000
[ 3876.799579] R10:
ffff8dd57fd788c8 R11:
000000007511b030 R12:
ffff8dd781ddc000
[ 3876.799599] R13:
ffff8dd9e6240578 R14:
ffff8dd6896f7a88 R15:
ffff8dd688cf90b8
[ 3876.799620] FS:
00007f23ddd97700(0000) GS:
ffff8dda20200000(0000) knlGS:
0000000000000000
[ 3876.799643] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 3876.799660] CR2:
00007f23d4024000 CR3:
0000000710bb0005 CR4:
00000000003606f0
[ 3876.799682] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 3876.799703] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 3876.799723] Call Trace:
[ 3876.799735] ? do_raw_spin_unlock+0x49/0xc0
[ 3876.799749] ? _raw_spin_unlock+0x24/0x30
[ 3876.799779] resolve_indirect_refs+0x1eb/0xc80 [btrfs]
[ 3876.799810] find_parent_nodes+0x38d/0x1180 [btrfs]
[ 3876.799841] btrfs_check_shared+0x11a/0x1d0 [btrfs]
[ 3876.799870] ? extent_fiemap+0x598/0x6e0 [btrfs]
[ 3876.799895] extent_fiemap+0x598/0x6e0 [btrfs]
[ 3876.799913] do_vfs_ioctl+0x45a/0x700
[ 3876.799926] ksys_ioctl+0x70/0x80
[ 3876.799938] ? trace_hardirqs_off_thunk+0x1a/0x20
[ 3876.799953] __x64_sys_ioctl+0x16/0x20
[ 3876.799965] do_syscall_64+0x62/0x220
[ 3876.799977] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3876.799993] RIP: 0033:0x7f23e0013dd7
(...)
[ 3876.800056] RSP: 002b:
00007f23ddd96ca8 EFLAGS:
00000246 ORIG_RAX:
0000000000000010
[ 3876.800078] RAX:
ffffffffffffffda RBX:
00007f23d80210f8 RCX:
00007f23e0013dd7
[ 3876.800099] RDX:
00007f23d80210f8 RSI:
00000000c020660b RDI:
0000000000000003
[ 3876.800626] RBP:
000055fa2a2a2440 R08:
0000000000000000 R09:
00007f23ddd96d7c
[ 3876.801143] R10:
00007f23d8022000 R11:
0000000000000246 R12:
00007f23ddd96d80
[ 3876.801662] R13:
00007f23ddd96d78 R14:
00007f23d80210f0 R15:
00007f23ddd96d80
(...)
[ 3876.805107] ---[ end trace
e53161e179ef04f9 ]---
Fix that by saving the root's header owner field into a local variable
before freeing the root's extent buffer, and then use that local variable
when needed.
Fixes: 30b0463a9394d9 ("Btrfs: fix accessing the root pointer in tree mod log functions")
CC: stable@vger.kernel.org # 3.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 8 Aug 2019 04:32:44 +0000 (12:32 +0800)]
btrfs: replace: BTRFS_DEV_REPLACE_ITEM_STATE_x defines should go
The BTRFS_DEV_REPLACE_ITEM_STATE_x defines, as shown in [1], are
unused in both kernel and btrfs-progs (except for one instance of
BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED in kernel).
[1]
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED 2
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED 3
btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED 4
Further these define-values are different form its counterpart
BTRFS_IOCTL_DEV_REPLACE_STATE_x series as shown in [2].
[2]
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED 2
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED 3
btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED 4
So this patch deletes the BTRFS_DEV_REPLACE_ITEM_STATE_x altogether, and
one instance of BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED is replaced
with BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED in the kernel.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 1 Aug 2019 22:19:37 +0000 (18:19 -0400)]
btrfs: introduce an evict flushing state
We have this weird space flushing loop inside inode.c for evict where
we'll do the normal LIMIT flush, and then commit the transaction and
hope we get our space. This is super janky, and in fact there's really
nothing stopping us from using FLUSH_ALL except that we run delayed
iputs, which means we could deadlock. So introduce a new flush state
for eviction that does the normal priority flushing with all of the
states that are safe for eviction.
The nice side-effect of this is that we'll try harder for evictions.
Previously if (for example generic/269) you had a bunch of other
operations happening on the fs you could race with those reservations
when committing the transaction, and eventually miss getting a
reservation for the evict. With this code we'll have our ticket in
place through the transaction commit, so any pinned bytes will go to our
pending evictions first.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 1 Aug 2019 22:19:36 +0000 (18:19 -0400)]
btrfs: refactor priority_reclaim_metadata_space
With the eviction flushing stuff we'll want to allow for different
states, but still work basically the same way that
priority_reclaim_metadata_space works currently. Refactor this to take
the flushing states and size as an argument so we can use the same logic
for limit flushing and eviction flushing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 1 Aug 2019 22:19:35 +0000 (18:19 -0400)]
btrfs: factor out the ticket flush handling
We're going to make this logic a little more complicated for evict, so
factor the ticket flushing/waiting code out of __reserve_metadata_bytes.
This has no functional change.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 1 Aug 2019 22:19:34 +0000 (18:19 -0400)]
btrfs: unify error handling for ticket flushing
Currently we handle the cleanup of errored out tickets in both the
priority flush path and the normal flushing path. This is the same code
in both places, so just refactor so we don't duplicate the cleanup work.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 1 Aug 2019 22:19:33 +0000 (18:19 -0400)]
btrfs: add a flush step for delayed iputs
Delayed iputs could very well free up enough space without needing to
commit the transaction, so make this step it's own step. This will
allow us to skip the step for evictions in a later patch.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:38:07 +0000 (15:38 -0400)]
btrfs: unexport the temporary exported functions
These were renamed and exported to facilitate logical migration of
different code chunks into block-group.c. Now that all the users are in
one file go ahead and rename them back, move the code around, and make
them static.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:38:06 +0000 (15:38 -0400)]
btrfs: migrate the block group cleanup code
This can now be easily migrated as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh on top of sysfs cleanups ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:38:05 +0000 (15:38 -0400)]
btrfs: migrate the alloc_profile helpers
These feel more at home in block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, adjust btrfs_get_alloc_profile exports ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:38:04 +0000 (15:38 -0400)]
btrfs: migrate the chunk allocation code
This feels more at home in block-group.c than in extent-tree.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>i
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:38:02 +0000 (15:38 -0400)]
btrfs: migrate the block group space accounting helpers
We can now easily migrate this code as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:38:01 +0000 (15:38 -0400)]
btrfs: export block group accounting helpers
Want to move these functions into block-group.c, so export them.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:38:00 +0000 (15:38 -0400)]
btrfs: migrate the dirty bg writeout code
This can be easily migrated over now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:59 +0000 (15:37 -0400)]
btrfs: migrate inc/dec_block_group_ro code
This can easily be moved now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:58 +0000 (15:37 -0400)]
btrfs: temporarily export btrfs_get_restripe_target
This gets used by a few different logical chunks of the block group
code, export it while we move things around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:57 +0000 (15:37 -0400)]
btrfs: migrate the block group read/creation code
All of the prep work has been done so we can now cleanly move this chunk
over.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, add btrfs_get_alloc_profile export, comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:55 +0000 (15:37 -0400)]
btrfs: migrate the block group removal code
This is the removal code and the unused bgs code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ refresh, move clear_incompat_bg_bits ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:54 +0000 (15:37 -0400)]
btrfs: temporarily export inc_block_group_ro
This is used in a few logical parts of the block group code, temporarily
export it so we can move things in pieces.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Tue, 6 Aug 2019 14:43:19 +0000 (16:43 +0200)]
btrfs: migrate the block group caching code
We can now just copy it over to block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 2 Aug 2019 10:52:48 +0000 (12:52 +0200)]
btrfs: sysfs: move helper macros to sysfs.c
None of the macros is used outside of sysfs.c.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 2 Aug 2019 11:07:38 +0000 (13:07 +0200)]
btrfs: sysfs: move type conversion helpers to sysfs.c
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 17:46:20 +0000 (19:46 +0200)]
btrfs: cleanup kobject.h includes
The kobject should be pulled in via sysfs.h and that needs to include it
because it needs various definitions like kobj_attribute or kobject.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 16:50:16 +0000 (18:50 +0200)]
btrfs: factor out sysfs code for updating sprout fsid
Wrap the fsid renaming code and move it to sysfs.c.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 16:50:16 +0000 (18:50 +0200)]
btrfs: factor out sysfs code for deleting block group and space infos
The helpers to create block group and space info directories already
live in sysfs.c, move the deletion part there too.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 16:50:16 +0000 (18:50 +0200)]
btrfs: factor out sysfs code for sending device uevent
The device uevent belongs to the sysfs API.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 17:07:55 +0000 (19:07 +0200)]
btrfs: sysfs: replace direct access to feature set names with a helper
In order to unexport the feature type array, add a helper for the
enum-to-string conversion.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 15:55:55 +0000 (17:55 +0200)]
btrfs: sysfs: unexport space_info_ktype
The last non-sysfs usage of space_info_ktype has been moved to a private
helper in previous patch so the variable can be made static.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 16:50:16 +0000 (18:50 +0200)]
btrfs: factor out sysfs code for creating space infos
Move creation of data/metadata/system space info directories to sysfs.c.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 15:55:55 +0000 (17:55 +0200)]
btrfs: sysfs: unexport btrfs_raid_ktype
The last non-sysfs usage of btrfs_raid_ktype has been moved to a private
helper in previous patch so the variable can be made static.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 15:49:55 +0000 (17:49 +0200)]
btrfs: factor sysfs code out of link_block_group
The part of link_block_group that just creates the sysfs object is
independent and can be factored out to a helper.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 15:34:41 +0000 (17:34 +0200)]
btrfs: move sysfs declarations out of ctree.h
As the header for sysfs code already exists, use it to clean up ctree.h.
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 7 Aug 2019 08:21:20 +0000 (16:21 +0800)]
btrfs: opencode reset of all device stats
__btrfs_reset_dev_stats() is a small helper function to reset devices stat
values, and is used only once, instead just open code it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 7 Aug 2019 08:21:19 +0000 (16:21 +0800)]
btrfs: reset device stat using btrfs_dev_stat_set
btrfs_dev_stat_reset() is an overdo in terms of wrapping. So this patch
open codes btrfs_dev_stat_reset().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 6 Aug 2019 14:05:07 +0000 (22:05 +0800)]
btrfs: qgroup: Try our best to delete qgroup relations
When we try to delete qgroups, we're pretty cautious, we make sure both
qgroups exist and there is a relationship between them, then try to
delete the relation.
This behavior is OK, but the problem is we need to two relation items,
and if we failed the first item deletion, we error out, leaving the
other relation item in qgroup tree.
Sometimes the error from del_qgroup_relation_item() could just be
-ENOENT, thus we can ignore that error and continue without any problem.
Further more, such cautious behavior makes qgroup relation deletion
impossible for orphan relation items.
This patch will enhance __del_qgroup_relation():
- If both qgroups and their relation items exist
Go the regular deletion routine and update their accounting if needed.
- If any qgroup or relation item doesn't exist
Then we still try to delete the orphan items anyway, but don't trigger
the accounting update.
By this, we try our best to remove relation items, and can handle orphan
relation items properly, while still keep the existing behavior for good
qgroup tree.
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Hans van Kranenburg [Sat, 3 Aug 2019 21:36:34 +0000 (23:36 +0200)]
btrfs: clarify btrfs_ioctl_get_dev_stats padding
In commit
c11d2c236cc26 ("Btrfs: add ioctl to get and reset the device
stats") the get_dev_stats ioctl was added.
Shortly thereafter, in commit
b27f7c0c150f7 ("btrfs: join DEV_STATS
ioctls to one") , the flags field was added. However, the calculation
for unused padding space was not updated, which also invalidated the
comment.
Clarify what happened to reduce confusion and wasted time for anyone
implementing this.
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 5 Aug 2019 09:57:41 +0000 (10:57 +0100)]
Btrfs: make test_find_first_clear_extent_bit fail on incorrect results
If any call to find_first_clear_extent_bit() returns an unexpected result,
the test should fail and not just print an error message, otherwise it
makes detection of regressions much harder to notice.
Fixes: 1eaebb341d2b41 ("btrfs: Don't trim returned range based on input value in find_first_clear_extent_bit")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Sat, 3 Aug 2019 08:53:16 +0000 (09:53 +0100)]
Btrfs: fix memory leaks in the test test_find_first_clear_extent_bit
The test creates an extent io tree and sets several ranges with the
CHUNK_ALLOCATED and CHUNK_TRIMMED bits, resulting in the allocation of
several extent state structures. However the test never clears those
ranges, resulting in memory leaks of the extent state structures.
This is detected when CONFIG_BTRFS_DEBUG is set once we remove the
btrfs module (rmmod btrfs):
[57399.787918] BTRFS: state leak: start
67108864 end
75497471 state 1 in tree 1 refs 1
[57399.790155] BTRFS: state leak: start
33554432 end
67108863 state 33 in tree 1 refs 1
[57399.791941] BTRFS: state leak: start
1048576 end
4194303 state 33 in tree 1 refs 1
[57399.793753] BTRFS: state leak: start
67108864 end
75497471 state 1 in tree 1 refs 1
[57399.795188] BTRFS: state leak: start
33554432 end
67108863 state 33 in tree 1 refs 1
[57399.796453] BTRFS: state leak: start
1048576 end
4194303 state 33 in tree 1 refs 1
[57399.797765] BTRFS: state leak: start
67108864 end
75497471 state 1 in tree 1 refs 1
[57399.799049] BTRFS: state leak: start
33554432 end
67108863 state 33 in tree 1 refs 1
[57399.800142] BTRFS: state leak: start
1048576 end
4194303 state 33 in tree 1 refs 1
[57399.801126] BTRFS: state leak: start
67108864 end
75497471 state 1 in tree 1 refs 1
[57399.802106] BTRFS: state leak: start
33554432 end
67108863 state 33 in tree 1 refs 1
[57399.803119] BTRFS: state leak: start
1048576 end
4194303 state 33 in tree 1 refs 1
[57399.804153] BTRFS: state leak: start
67108864 end
75497471 state 1 in tree 1 refs 1
[57399.805196] BTRFS: state leak: start
33554432 end
67108863 state 33 in tree 1 refs 1
[57399.806191] BTRFS: state leak: start
1048576 end
4194303 state 33 in tree 1 refs 1
The start and end offsets reported correspond exactly to the ranges
used by the test.
So fix that by clearing all the ranges when the test finishes.
Fixes: 1eaebb341d2b41 ("btrfs: Don't trim returned range based on input value in find_first_clear_extent_bit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 13 Jun 2019 15:27:36 +0000 (17:27 +0200)]
btrfs: delete debugfs code
Replaced by the sysfs exports that provide a more fine grained interface
for filesystem debugging.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 13 Jun 2019 15:23:02 +0000 (17:23 +0200)]
btrfs: sysfs: add debugging exports
Add 'debug' directories to global sysfs and per-filesystem. This will
replace the debugfs directory. The sysfs location is simpler and builds
on top of the existing file hierarchy so there will hopefully be no more
questions about the sample debugfs file.
The directory is called 'debug' and only present under
CONFIG_BTRFS_DEBUG so this will not affect productions builds.
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:52 +0000 (15:37 -0400)]
btrfs: make caching_thread use btrfs_find_next_key
extent-tree.c has a find_next_key that just walks up the path to find
the next key, but it is used for both the caching stuff and the snapshot
delete stuff. The snapshot deletion stuff is special so it can't really
use btrfs_find_next_key, but the caching thread stuff can. We just need
to fix btrfs_find_next_key to deal with ->skip_locking and then it works
exactly the same as the private find_next_key helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:51 +0000 (15:37 -0400)]
btrfs: temporarily export fragment_free_space
This is used in caching and reading block groups, so export it while we
move these chunks independently.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:50 +0000 (15:37 -0400)]
btrfs: export the caching control helpers
Man a lot of people use this stuff.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:49 +0000 (15:37 -0400)]
btrfs: export the excluded extents helpers
We'll need this to move the caching stuff around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:48 +0000 (15:37 -0400)]
btrfs: export the block group caching helpers
This will make it so we can move them easily.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ coding style updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:47 +0000 (15:37 -0400)]
btrfs: migrate nocow and reservation helpers
These are relatively straightforward as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:46 +0000 (15:37 -0400)]
btrfs: migrate the block group ref counting stuff
Another easy set to move over to block-group.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:45 +0000 (15:37 -0400)]
btrfs: migrate the block group lookup code
Move these bits first as they are the easiest to move. Export two of
the helpers so they can be moved all at once.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:44 +0000 (15:37 -0400)]
btrfs: move basic block_group definitions to their own header
This is prep work for moving all of the block group cache code into its
own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 20 Jun 2019 19:37:43 +0000 (15:37 -0400)]
btrfs: move btrfs_add_free_space out of a header file
This is prep work for moving block_group_cache around. Having this in
the header file makes the header file include need to be in a certain
order, which is awkward, so just move it into free-space-cache.c and
then we can re-arrange later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 12:50:35 +0000 (14:50 +0200)]
btrfs: tree-log: use symbolic name for first replay stage
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 12:50:33 +0000 (14:50 +0200)]
btrfs: async-thread: convert defines to enums
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 12:50:30 +0000 (14:50 +0200)]
btrfs: tree-log: convert defines to enums
Used only for in-memory state tracking.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 17:53:04 +0000 (19:53 +0200)]
btrfs: remove unused key type set/get helpers
The switch to open coded set/get has happend long time ago in
962a298f3511 ("btrfs: kill the key type accessor helpers"), remove the
stray helpers.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 1 Aug 2019 17:53:02 +0000 (19:53 +0200)]
btrfs: remove unused btrfs_device::flush_bio_sent
The status of flush bio is tracked as a status bit, changed in commit
1c3063b6dbfa ("btrfs: cleanup device states define
BTRFS_DEV_STATE_FLUSH_SENT"), the flush_bio_sent was forgotten.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 2 Jul 2019 14:23:07 +0000 (15:23 +0100)]
Btrfs: remove unnecessary condition in btrfs_clone() to avoid too much nesting
The bulk of the work done when cloning extents, at ioctl.c:btrfs_clone(),
is done inside an if statement that checks if the found key has the type
BTRFS_EXTENT_DATA_KEY. That if statement is redundant however, because
btrfs_search_slot() always leaves us in a leaf slot that points to a key
that is always greater then or equals to the search key, and our search
key here has that type, and because we bail out before that if statement
if the key at the given leaf slot is greater then BTRFS_EXTENT_DATA_KEY.
Therefore just remove that if statement, not only because it is useless
but mostly because it increases the nesting/indentation level in this
function which is quite big and makes things a bit awkward whenever I need
to fix something that requires changing btrfs_clone() (and it has been
like that for many years already).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 3 Jul 2019 12:32:59 +0000 (15:32 +0300)]
btrfs: Refactor btrfs_calc_avail_data_space
Simplify the code by removing variables that don't bring any real value
as well as simplifying the checks when buidling the candidate list of
devices. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 23 May 2019 11:51:26 +0000 (14:51 +0300)]
btrfs: Remove unnecessary check from join_running_log_trans
join_running_log_trans checks btrfs_root::log_root outside of
btrfs_root::log_mutex to avoid contention on the mutex. Turns out this
check is not necessary because the two callers of join_running_log_trans
(both of which deal with removing entries from the tree-log during
unlink) explicitly check whether the respective inode has been logged in
the current transaction.
If it hasn't then it won't have any items in the tree-log and call path
will return before calling join_running_log_trans. If the check passes,
however, then it's guaranteed that btrfs_root::log_root is set because
the inode is logged.
Those guarantees allows us to remove the speculative as well as the
implicity and tricky memory barrier.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 4 Jul 2019 15:25:00 +0000 (16:25 +0100)]
Btrfs: wake up inode cache waiters sooner to reduce waiting time
If we need to start an inode caching thread, because none currently exists
on disk, we can wake up all waiters as soon as we mark the range starting
at root's highest objectid + 1 and ending at BTRFS_LAST_FREE_OBJECTID as
free, so that they don't need to wait for the caching thread to start and
do some progress. We follow the same approach within the caching thread,
since as soon as it finds a free range and marks it as free space in the
cache, it wakes up all waiters. So improve this by adding such a wakeup
call after marking that initial range as free space.
Fixes: a47d6b70e28040 ("Btrfs: setup free ino caching in a more asynchronous way")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 4 Jul 2019 15:24:44 +0000 (16:24 +0100)]
Btrfs: fix inode cache waiters hanging on path allocation failure
If the caching thread fails to allocate a path, it returns without waking
up any cache waiters, leaving them hang forever. Fix this by following the
same approach as when we fail to start the caching thread: print an error
message, disable inode caching and make the wakers fallback to non-caching
mode behaviour (calling btrfs_find_free_objectid()).
Fixes: 581bb050941b4f ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 4 Jul 2019 15:24:32 +0000 (16:24 +0100)]
Btrfs: fix inode cache waiters hanging on failure to start caching thread
If we fail to start the inode caching thread, we print an error message
and disable the inode cache, however we never wake up any waiters, so they
hang forever waiting for the caching to finish. Fix this by waking them
up and have them fallback to a call to btrfs_find_free_objectid().
Fixes: e60efa84252c05 ("Btrfs: avoid triggering bug_on() when we fail to start inode caching task")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 4 Jul 2019 15:24:19 +0000 (16:24 +0100)]
Btrfs: fix inode cache block reserve leak on failure to allocate data space
If we failed to allocate the data extent(s) for the inode space cache, we
were bailing out without releasing the previously reserved metadata. This
was triggering the following warnings when unmounting a filesystem:
$ cat -n fs/btrfs/inode.c
(...)
9268 void btrfs_destroy_inode(struct inode *inode)
9269 {
(...)
9276 WARN_ON(BTRFS_I(inode)->block_rsv.reserved);
9277 WARN_ON(BTRFS_I(inode)->block_rsv.size);
(...)
9281 WARN_ON(BTRFS_I(inode)->csum_bytes);
9282 WARN_ON(BTRFS_I(inode)->defrag_bytes);
(...)
Several fstests test cases triggered this often, such as generic/083,
generic/102, generic/172, generic/269 and generic/300 at least, producing
stack traces like the following in dmesg/syslog:
[82039.079546] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9276 btrfs_destroy_inode+0x203/0x270 [btrfs]
(...)
[82039.081543] CPU: 2 PID: 13167 Comm: umount Tainted: G W 5.2.0-rc4-btrfs-next-50 #1
[82039.081912] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[82039.082673] RIP: 0010:btrfs_destroy_inode+0x203/0x270 [btrfs]
(...)
[82039.083913] RSP: 0018:
ffffac0b426a7d30 EFLAGS:
00010206
[82039.084320] RAX:
ffff8ddf77691158 RBX:
ffff8dde29b34660 RCX:
0000000000000002
[82039.084736] RDX:
0000000000000000 RSI:
0000000000000001 RDI:
ffff8dde29b34660
[82039.085156] RBP:
ffff8ddf5fbec000 R08:
0000000000000000 R09:
0000000000000000
[82039.085578] R10:
ffffac0b426a7c90 R11:
ffffffffb9aad768 R12:
ffffac0b426a7db0
[82039.086000] R13:
ffff8ddf5fbec0a0 R14:
dead000000000100 R15:
0000000000000000
[82039.086416] FS:
00007f8db96d12c0(0000) GS:
ffff8de036b00000(0000) knlGS:
0000000000000000
[82039.086837] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[82039.087253] CR2:
0000000001416108 CR3:
00000002315cc001 CR4:
00000000003606e0
[82039.087672] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[82039.088089] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[82039.088504] Call Trace:
[82039.088918] destroy_inode+0x3b/0x70
[82039.089340] btrfs_free_fs_root+0x16/0xa0 [btrfs]
[82039.089768] btrfs_free_fs_roots+0xd8/0x160 [btrfs]
[82039.090183] ? wait_for_completion+0x65/0x1a0
[82039.090607] close_ctree+0x172/0x370 [btrfs]
[82039.091021] generic_shutdown_super+0x6c/0x110
[82039.091427] kill_anon_super+0xe/0x30
[82039.091832] btrfs_kill_super+0x12/0xa0 [btrfs]
[82039.092233] deactivate_locked_super+0x3a/0x70
[82039.092636] cleanup_mnt+0x3b/0x80
[82039.093039] task_work_run+0x93/0xc0
[82039.093457] exit_to_usermode_loop+0xfa/0x100
[82039.093856] do_syscall_64+0x162/0x1d0
[82039.094244] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[82039.094634] RIP: 0033:0x7f8db8fbab37
(...)
[82039.095876] RSP: 002b:
00007ffdce35b468 EFLAGS:
00000246 ORIG_RAX:
00000000000000a6
[82039.096290] RAX:
0000000000000000 RBX:
0000560d20b00060 RCX:
00007f8db8fbab37
[82039.096700] RDX:
0000000000000001 RSI:
0000000000000000 RDI:
0000560d20b00240
[82039.097110] RBP:
0000560d20b00240 R08:
0000560d20b00270 R09:
0000000000000015
[82039.097522] R10:
00000000000006b4 R11:
0000000000000246 R12:
00007f8db94bce64
[82039.097937] R13:
0000000000000000 R14:
0000000000000000 R15:
00007ffdce35b6f0
[82039.098350] irq event stamp: 0
[82039.098750] hardirqs last enabled at (0): [<
0000000000000000>] 0x0
[82039.099150] hardirqs last disabled at (0): [<
ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
[82039.099545] softirqs last enabled at (0): [<
ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
[82039.099925] softirqs last disabled at (0): [<
0000000000000000>] 0x0
[82039.100292] ---[ end trace
f2521afa616ddccc ]---
[82039.100707] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9277 btrfs_destroy_inode+0x1ac/0x270 [btrfs]
(...)
[82039.103050] CPU: 2 PID: 13167 Comm: umount Tainted: G W 5.2.0-rc4-btrfs-next-50 #1
[82039.103428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[82039.104203] RIP: 0010:btrfs_destroy_inode+0x1ac/0x270 [btrfs]
(...)
[82039.105461] RSP: 0018:
ffffac0b426a7d30 EFLAGS:
00010206
[82039.105866] RAX:
ffff8ddf77691158 RBX:
ffff8dde29b34660 RCX:
0000000000000002
[82039.106270] RDX:
0000000000000000 RSI:
0000000000000001 RDI:
ffff8dde29b34660
[82039.106673] RBP:
ffff8ddf5fbec000 R08:
0000000000000000 R09:
0000000000000000
[82039.107078] R10:
ffffac0b426a7c90 R11:
ffffffffb9aad768 R12:
ffffac0b426a7db0
[82039.107487] R13:
ffff8ddf5fbec0a0 R14:
dead000000000100 R15:
0000000000000000
[82039.107894] FS:
00007f8db96d12c0(0000) GS:
ffff8de036b00000(0000) knlGS:
0000000000000000
[82039.108309] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[82039.108723] CR2:
0000000001416108 CR3:
00000002315cc001 CR4:
00000000003606e0
[82039.109146] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[82039.109567] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[82039.109989] Call Trace:
[82039.110405] destroy_inode+0x3b/0x70
[82039.110830] btrfs_free_fs_root+0x16/0xa0 [btrfs]
[82039.111257] btrfs_free_fs_roots+0xd8/0x160 [btrfs]
[82039.111675] ? wait_for_completion+0x65/0x1a0
[82039.112101] close_ctree+0x172/0x370 [btrfs]
[82039.112519] generic_shutdown_super+0x6c/0x110
[82039.112988] kill_anon_super+0xe/0x30
[82039.113439] btrfs_kill_super+0x12/0xa0 [btrfs]
[82039.113861] deactivate_locked_super+0x3a/0x70
[82039.114278] cleanup_mnt+0x3b/0x80
[82039.114685] task_work_run+0x93/0xc0
[82039.115083] exit_to_usermode_loop+0xfa/0x100
[82039.115476] do_syscall_64+0x162/0x1d0
[82039.115863] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[82039.116254] RIP: 0033:0x7f8db8fbab37
(...)
[82039.117463] RSP: 002b:
00007ffdce35b468 EFLAGS:
00000246 ORIG_RAX:
00000000000000a6
[82039.117882] RAX:
0000000000000000 RBX:
0000560d20b00060 RCX:
00007f8db8fbab37
[82039.118330] RDX:
0000000000000001 RSI:
0000000000000000 RDI:
0000560d20b00240
[82039.118743] RBP:
0000560d20b00240 R08:
0000560d20b00270 R09:
0000000000000015
[82039.119159] R10:
00000000000006b4 R11:
0000000000000246 R12:
00007f8db94bce64
[82039.119574] R13:
0000000000000000 R14:
0000000000000000 R15:
00007ffdce35b6f0
[82039.119987] irq event stamp: 0
[82039.120387] hardirqs last enabled at (0): [<
0000000000000000>] 0x0
[82039.120787] hardirqs last disabled at (0): [<
ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
[82039.121182] softirqs last enabled at (0): [<
ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
[82039.121563] softirqs last disabled at (0): [<
0000000000000000>] 0x0
[82039.121933] ---[ end trace
f2521afa616ddccd ]---
[82039.122353] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9278 btrfs_destroy_inode+0x1bc/0x270 [btrfs]
(...)
[82039.124606] CPU: 2 PID: 13167 Comm: umount Tainted: G W 5.2.0-rc4-btrfs-next-50 #1
[82039.125008] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[82039.125801] RIP: 0010:btrfs_destroy_inode+0x1bc/0x270 [btrfs]
(...)
[82039.126998] RSP: 0018:
ffffac0b426a7d30 EFLAGS:
00010202
[82039.127399] RAX:
ffff8ddf77691158 RBX:
ffff8dde29b34660 RCX:
0000000000000002
[82039.127803] RDX:
0000000000000001 RSI:
0000000000000001 RDI:
ffff8dde29b34660
[82039.128206] RBP:
ffff8ddf5fbec000 R08:
0000000000000000 R09:
0000000000000000
[82039.128611] R10:
ffffac0b426a7c90 R11:
ffffffffb9aad768 R12:
ffffac0b426a7db0
[82039.129020] R13:
ffff8ddf5fbec0a0 R14:
dead000000000100 R15:
0000000000000000
[82039.129428] FS:
00007f8db96d12c0(0000) GS:
ffff8de036b00000(0000) knlGS:
0000000000000000
[82039.129846] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[82039.130261] CR2:
0000000001416108 CR3:
00000002315cc001 CR4:
00000000003606e0
[82039.130684] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[82039.131142] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[82039.131561] Call Trace:
[82039.131990] destroy_inode+0x3b/0x70
[82039.132417] btrfs_free_fs_root+0x16/0xa0 [btrfs]
[82039.132844] btrfs_free_fs_roots+0xd8/0x160 [btrfs]
[82039.133262] ? wait_for_completion+0x65/0x1a0
[82039.133688] close_ctree+0x172/0x370 [btrfs]
[82039.134157] generic_shutdown_super+0x6c/0x110
[82039.134575] kill_anon_super+0xe/0x30
[82039.134997] btrfs_kill_super+0x12/0xa0 [btrfs]
[82039.135415] deactivate_locked_super+0x3a/0x70
[82039.135832] cleanup_mnt+0x3b/0x80
[82039.136239] task_work_run+0x93/0xc0
[82039.136637] exit_to_usermode_loop+0xfa/0x100
[82039.137029] do_syscall_64+0x162/0x1d0
[82039.137418] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[82039.137812] RIP: 0033:0x7f8db8fbab37
(...)
[82039.139059] RSP: 002b:
00007ffdce35b468 EFLAGS:
00000246 ORIG_RAX:
00000000000000a6
[82039.139475] RAX:
0000000000000000 RBX:
0000560d20b00060 RCX:
00007f8db8fbab37
[82039.139890] RDX:
0000000000000001 RSI:
0000000000000000 RDI:
0000560d20b00240
[82039.140302] RBP:
0000560d20b00240 R08:
0000560d20b00270 R09:
0000000000000015
[82039.140719] R10:
00000000000006b4 R11:
0000000000000246 R12:
00007f8db94bce64
[82039.141138] R13:
0000000000000000 R14:
0000000000000000 R15:
00007ffdce35b6f0
[82039.141597] irq event stamp: 0
[82039.142043] hardirqs last enabled at (0): [<
0000000000000000>] 0x0
[82039.142443] hardirqs last disabled at (0): [<
ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
[82039.142839] softirqs last enabled at (0): [<
ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
[82039.143220] softirqs last disabled at (0): [<
0000000000000000>] 0x0
[82039.143588] ---[ end trace
f2521afa616ddcce ]---
[82039.167472] WARNING: CPU: 3 PID: 13167 at fs/btrfs/extent-tree.c:10120 btrfs_free_block_groups+0x30d/0x460 [btrfs]
(...)
[82039.173800] CPU: 3 PID: 13167 Comm: umount Tainted: G W 5.2.0-rc4-btrfs-next-50 #1
[82039.174847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[82039.177031] RIP: 0010:btrfs_free_block_groups+0x30d/0x460 [btrfs]
(...)
[82039.180397] RSP: 0018:
ffffac0b426a7dd8 EFLAGS:
00010206
[82039.181574] RAX:
ffff8de010a1db40 RBX:
ffff8de010a1db40 RCX:
0000000000170014
[82039.182711] RDX:
ffff8ddff4380040 RSI:
ffff8de010a1da58 RDI:
0000000000000246
[82039.183817] RBP:
ffff8ddf5fbec000 R08:
0000000000000000 R09:
0000000000000000
[82039.184925] R10:
ffff8de036404380 R11:
ffffffffb8a5ea00 R12:
ffff8de010a1b2b8
[82039.186090] R13:
ffff8de010a1b2b8 R14:
0000000000000000 R15:
dead000000000100
[82039.187208] FS:
00007f8db96d12c0(0000) GS:
ffff8de036b80000(0000) knlGS:
0000000000000000
[82039.188345] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[82039.189481] CR2:
00007fb044005170 CR3:
00000002315cc006 CR4:
00000000003606e0
[82039.190674] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[82039.191829] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[82039.192978] Call Trace:
[82039.194160] close_ctree+0x19a/0x370 [btrfs]
[82039.195315] generic_shutdown_super+0x6c/0x110
[82039.196486] kill_anon_super+0xe/0x30
[82039.197645] btrfs_kill_super+0x12/0xa0 [btrfs]
[82039.198696] deactivate_locked_super+0x3a/0x70
[82039.199619] cleanup_mnt+0x3b/0x80
[82039.200559] task_work_run+0x93/0xc0
[82039.201505] exit_to_usermode_loop+0xfa/0x100
[82039.202436] do_syscall_64+0x162/0x1d0
[82039.203339] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[82039.204091] RIP: 0033:0x7f8db8fbab37
(...)
[82039.206360] RSP: 002b:
00007ffdce35b468 EFLAGS:
00000246 ORIG_RAX:
00000000000000a6
[82039.207132] RAX:
0000000000000000 RBX:
0000560d20b00060 RCX:
00007f8db8fbab37
[82039.207906] RDX:
0000000000000001 RSI:
0000000000000000 RDI:
0000560d20b00240
[82039.208621] RBP:
0000560d20b00240 R08:
0000560d20b00270 R09:
0000000000000015
[82039.209285] R10:
00000000000006b4 R11:
0000000000000246 R12:
00007f8db94bce64
[82039.209984] R13:
0000000000000000 R14:
0000000000000000 R15:
00007ffdce35b6f0
[82039.210642] irq event stamp: 0
[82039.211306] hardirqs last enabled at (0): [<
0000000000000000>] 0x0
[82039.211971] hardirqs last disabled at (0): [<
ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
[82039.212643] softirqs last enabled at (0): [<
ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
[82039.213304] softirqs last disabled at (0): [<
0000000000000000>] 0x0
[82039.213875] ---[ end trace
f2521afa616ddccf ]---
Fix this by releasing the reserved metadata on failure to allocate data
extent(s) for the inode cache.
Fixes: 69fe2d75dd91d0 ("btrfs: make the delalloc block rsv per inode")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 4 Jul 2019 15:24:09 +0000 (16:24 +0100)]
Btrfs: fix hang when loading existing inode cache off disk
If we are able to load an existing inode cache off disk, we set the state
of the cache to BTRFS_CACHE_FINISHED, but we don't wake up any one waiting
for the cache to be available. This means that anyone waiting for the
cache to be available, waiting on the condition that either its state is
BTRFS_CACHE_FINISHED or its available free space is greather than zero,
can hang forever.
This could be observed running fstests with MOUNT_OPTIONS="-o inode_cache",
in particular test case generic/161 triggered it very frequently for me,
producing a trace like the following:
[63795.739712] BTRFS info (device sdc): enabling inode map caching
[63795.739714] BTRFS info (device sdc): disk space caching is enabled
[63795.739716] BTRFS info (device sdc): has skinny extents
[64036.653886] INFO: task btrfs-transacti:3917 blocked for more than 120 seconds.
[64036.654079] Not tainted 5.2.0-rc4-btrfs-next-50 #1
[64036.654143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64036.654232] btrfs-transacti D 0 3917 2 0x80004000
[64036.654239] Call Trace:
[64036.654258] ? __schedule+0x3ae/0x7b0
[64036.654271] schedule+0x3a/0xb0
[64036.654325] btrfs_commit_transaction+0x978/0xae0 [btrfs]
[64036.654339] ? remove_wait_queue+0x60/0x60
[64036.654395] transaction_kthread+0x146/0x180 [btrfs]
[64036.654450] ? btrfs_cleanup_transaction+0x620/0x620 [btrfs]
[64036.654456] kthread+0x103/0x140
[64036.654464] ? kthread_create_worker_on_cpu+0x70/0x70
[64036.654476] ret_from_fork+0x3a/0x50
[64036.654504] INFO: task xfs_io:3919 blocked for more than 120 seconds.
[64036.654568] Not tainted 5.2.0-rc4-btrfs-next-50 #1
[64036.654617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[64036.654685] xfs_io D 0 3919 3633 0x00000000
[64036.654691] Call Trace:
[64036.654703] ? __schedule+0x3ae/0x7b0
[64036.654716] schedule+0x3a/0xb0
[64036.654756] btrfs_find_free_ino+0xa9/0x120 [btrfs]
[64036.654764] ? remove_wait_queue+0x60/0x60
[64036.654809] btrfs_create+0x72/0x1f0 [btrfs]
[64036.654822] lookup_open+0x6bc/0x790
[64036.654849] path_openat+0x3bc/0xc00
[64036.654854] ? __lock_acquire+0x331/0x1cb0
[64036.654869] do_filp_open+0x99/0x110
[64036.654884] ? __alloc_fd+0xee/0x200
[64036.654895] ? do_raw_spin_unlock+0x49/0xc0
[64036.654909] ? do_sys_open+0x132/0x220
[64036.654913] do_sys_open+0x132/0x220
[64036.654926] do_syscall_64+0x60/0x1d0
[64036.654933] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix this by adding a wake_up() call right after setting the cache state to
BTRFS_CACHE_FINISHED, at start_caching(), when we are able to load the
cache from disk.
Fixes: 82d5902d9c681b ("Btrfs: Support reading/writing on disk free ino cache")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 16 Jul 2019 09:00:34 +0000 (17:00 +0800)]
btrfs: tree-checker: Add ROOT_ITEM check
This patch will introduce ROOT_ITEM check, which includes:
- Key->objectid and key->offset check
Currently only some easy check, e.g. 0 as rootid is invalid.
- Item size check
Root item size is fixed.
- Generation checks
Generation, generation_v2 and last_snapshot should not be greater than
super generation + 1
- Level and alignment check
Level should be in [0, 7], and bytenr must be aligned to sector size.
- Flags check
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203261
Reported-by: Jungyeon Yoon <jungyeon.yoon@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 16 Jul 2019 09:00:33 +0000 (17:00 +0800)]
btrfs: extent-tree: Make sure we only allocate extents from block groups with the same type
[BUG]
With fuzzed image and MIXED_GROUPS super flag, we can hit the following
BUG_ON():
kernel BUG at fs/btrfs/delayed-ref.c:491!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 1849 Comm: sync Tainted: G O 5.2.0-custom #27
RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs]
Call Trace:
add_delayed_ref_head+0x20c/0x2d0 [btrfs]
btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs]
btrfs_free_tree_block+0x123/0x380 [btrfs]
__btrfs_cow_block+0x435/0x500 [btrfs]
btrfs_cow_block+0x110/0x240 [btrfs]
btrfs_search_slot+0x230/0xa00 [btrfs]
? __lock_acquire+0x105e/0x1e20
btrfs_insert_empty_items+0x67/0xc0 [btrfs]
alloc_reserved_file_extent+0x9e/0x340 [btrfs]
__btrfs_run_delayed_refs+0x78e/0x1240 [btrfs]
? kvm_clock_read+0x18/0x30
? __sched_clock_gtod_offset+0x21/0x50
btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs]
btrfs_run_delayed_refs+0x23/0x30 [btrfs]
btrfs_commit_transaction+0x53/0x9f0 [btrfs]
btrfs_sync_fs+0x7c/0x1c0 [btrfs]
? __ia32_sys_fdatasync+0x20/0x20
sync_fs_one_sb+0x23/0x30
iterate_supers+0x95/0x100
ksys_sync+0x62/0xb0
__ia32_sys_sync+0xe/0x20
do_syscall_64+0x65/0x240
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
This situation is caused by several factors:
- Fuzzed image
The extent tree of this fs missed one backref for extent tree root.
So we can allocated space from that slot.
- MIXED_BG feature
Super block has MIXED_BG flag.
- No mixed block groups exists
All block groups are just regular ones.
This makes data space_info->block_groups[] contains metadata block
groups. And when we reserve space for data, we can use space in
metadata block group.
Then we hit the following file operations:
- fallocate
We need to allocate data extents.
find_free_extent() choose to use the metadata block to allocate space
from, and choose the space of extent tree root, since its backref is
missing.
This generate one delayed ref head with is_data = 1.
- extent tree update
We need to update extent tree at run_delayed_ref time.
This generate one delayed ref head with is_data = 0, for the same
bytenr of old extent tree root.
Then we trigger the BUG_ON().
[FIX]
The quick fix here is to check block_group->flags before using it.
The problem can only happen for MIXED_GROUPS fs. Regular filesystems
won't have space_info with DATA|METADATA flag, and no way to hit the
bug.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255
Reported-by: Jungyeon Yoon <jungyeon.yoon@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 16 Jul 2019 09:00:32 +0000 (17:00 +0800)]
btrfs: delayed-inode: Kill the BUG_ON() in btrfs_delete_delayed_dir_index()
There is one report of fuzzed image which leads to BUG_ON() in
btrfs_delete_delayed_dir_index().
Although that fuzzed image can already be addressed by enhanced
extent-tree error handler, it's still better to hunt down more BUG_ON().
This patch will hunt down two BUG_ON()s in
btrfs_delete_delayed_dir_index():
- One for error from btrfs_delayed_item_reserve_metadata()
Instead of BUG_ON(), we output an error message and free the item.
And return the error.
All callers of this function handles the error by aborting current
trasaction.
- One for possible EEXIST from __btrfs_add_delayed_deletion_item()
That function can return -EEXIST.
We already have a good enough error message for that, only need to
clean up the reserved metadata space and allocated item.
To help above cleanup, also modifiy __btrfs_remove_delayed_item() called
in btrfs_release_delayed_item(), to skip unassociated item.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203253
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 19 Jul 2019 06:51:44 +0000 (14:51 +0800)]
btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate()
[BUG]
Test case btrfs/156 fails since commit
302167c50b32 ("btrfs: don't end
the transaction for delayed refs in throttle") with ENOSPC.
[CAUSE]
The ENOSPC is reported from btrfs_can_relocate().
This function will check:
- If this block group is empty, we can relocate
- If we can enough free space, we can relocate
Above checks are valid but the following check is vague due to its
implementation:
- If and only if we can allocated a new block group to contain all the
used space, we can relocate
This design itself is OK, but the way to determine if we can allocate a
new block group is problematic.
btrfs_can_relocate() uses find_free_dev_extent() to find free space on a
device.
However find_free_dev_extent() only searches commit root and excludes
dev extents allocated in current trans, this makes it unable to use dev
extent just freed in current transaction.
So for the following example, btrfs_can_relocate() will report ENOSPC:
The example block group layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
// = Used bg, consider all bg is 100% used for easy calculation.
And all block groups are SINGLE, on-disk bytenr is the same as the
logical bytenr.
1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100
1M 129M 257M 385M 513M 550M
|///////| |//////////|/////////|
In transid 100, bg in [129M, 257M) get relocated to [385M, 513M)
However transid 100 is not committed yet, so in dev commit tree, we
still have the old dev extents layout:
1M 129M 257M 385M 513M 550M
|///////|///////////|//////////| | |
2) Try to relocate bg [257M, 385M)
We goes into btrfs_can_relocate(), no free space in current bgs, so we
check if we can find large enough free dev extents.
The first slot is [385M, 513M), but that is already used by new bg at
[385M, 513M), so we continue search.
The remaining slot is [512M, 550M), smaller than the bg's length 128M.
So btrfs_can_relocate report ENOSPC.
However this is over killed, in fact if we just skip btrfs_can_relocate()
check, and go into regular relocation routine, at extent reservation time,
if we can't find free extent, then we fallback to commit transaction,
which will free up the dev extents and allow new block group to be created.
[FIX]
The fix here is to remove btrfs_can_relocate() completely.
If we hit the false ENOSPC case just like btrfs/156, extent allocator
will push harder by committing transaction and we will have space for
new block group, avoiding the false ENOSPC.
If we really ran out of space, we will hit ENOSPC at
relocate_block_group(), and btrfs will just reports the ENOSPC error as
usual.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 19 Jul 2019 06:51:43 +0000 (14:51 +0800)]
btrfs: extent-tree: Add comment for inc_block_group_ro()
inc_block_group_ro() is only designed to mark one block group read-only,
it doesn't really care if other block groups have enough free space to
contain the used space in the block group.
However due to the close connection between this function and
relocation, sometimes we can be confused and think this function is
responsible for balance space reservation, which is not true.
Add some comment to make the functionality clear.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>