Josef Bacik [Mon, 24 Jul 2017 19:14:26 +0000 (15:14 -0400)]
btrfs: increase ctx->pos for delayed dir index
Our dir_context->pos is supposed to hold the next position we're
supposed to look. If we successfully insert a delayed dir index we
could end up with a duplicate entry because we don't increase ctx->pos
after doing the dir_emit.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Mon, 24 Jul 2017 19:14:25 +0000 (15:14 -0400)]
btrfs: fix readdir deadlock with pagefault
Readdir does dir_emit while under the btree lock. dir_emit can trigger
the page fault which means we can deadlock. Fix this by allocating a
buffer on opening a directory and copying the readdir into this buffer
and doing dir_emit from outside of the tree lock.
Thread A
readdir <holding tree lock>
dir_emit
<page fault>
down_read(mmap_sem)
Thread B
mmap write
down_write(mmap_sem)
page_mkwrite
wait_ordered_extents
Process C
finish_ordered_extent
insert_reserved_file_extent
try to lock leaf <hang>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy the deadlock scenario to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 22 Jun 2017 13:51:48 +0000 (09:51 -0400)]
btrfs: Simplify math in should_alloc chunk
Currently should_alloc_chunk uses ->total_bytes - ->bytes_readonly to
signify the total amount of bytes in this space info. However, given
Jeff's patch which adds bytes_pinned and bytes_may_use to the calculation
of num_allocated it becomes a lot more clear to just eliminate num_bytes
altogether and add the bytes_readonly to the amount of used space. That
way we don't change the results of the following statements. In the
process also start using btrfs_space_info_used.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Thu, 22 Jun 2017 13:51:47 +0000 (09:51 -0400)]
btrfs: account for pinned bytes in should_alloc_chunk
In a heavy write scenario, we can end up with a large number of pinned bytes.
This can translate into (very) premature ENOSPC because pinned bytes
must be accounted for when allowing a reservation but aren't accounted for
when deciding whether to create a new chunk.
This patch adds the accounting to should_alloc_chunk so that we can
create the chunk.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 17 Jul 2017 16:11:10 +0000 (18:11 +0200)]
btrfs: prepare for extensions in compression options
This is a minimal patch intended to be backported to older kernels.
We're going to extend the string specifying the compression method and
this would fail on kernels before that change (the string is compared
exactly).
Relax the string matching only to the prefix, ie. ignoring anything that
goes after "zlib" or "lzo", regardless of th format extension we decide
to use. This applies to the mount options and properties.
That way, patched old kernels could be booted on systems already
utilizing the new compression spec.
Applicable since commit
63541927c8d11, v3.14.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 17 Jul 2017 18:42:03 +0000 (20:42 +0200)]
btrfs: allow defrag compress to override NOCOMPRESS attribute
Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a
given file, except when the mount is force-compress. As users have
reported on IRC, this will also prevent compression when requested by
defrag (btrfs fi defrag -c file).
The nocompress flag is set automatically by filesystem when the ratios
are bad and the user would have to manually drop the bit in order to
make defrag -c work. This is not good from the usability perspective.
This patch will raise priority for the defrag -c over nocompress, ie.
any file with NOCOMPRESS bit set will get defragmented. The bit will
remain untouched.
Alternate option was to also drop the nocompress bit and keep the
decision logic as is, but I think this is not the right solution.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 17 Jul 2017 18:01:59 +0000 (20:01 +0200)]
btrfs: defrag: cleanup checking for compression status
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 17 Jul 2017 17:41:31 +0000 (19:41 +0200)]
btrfs: separate defrag and property compression
Add new value for compression to distinguish between defrag and
property. Previously, a single variable was used and this caused clashes
when the per-file 'compression' was set and a defrag -c was called.
The property-compression is loaded when the file is open, defrag will
overwrite the same variable and reset to 0 (ie. NONE) at when the file
defragmentaion is finished. That's considered a usability bug.
Now we won't touch the property value, use the defrag-compression. The
precedence of defrag is higher than for property (and whole-filesystem).
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 17 Jul 2017 17:17:20 +0000 (19:17 +0200)]
btrfs: rename variable holding per-inode compression type
This is preparatory for separating inode compression requested by defrag
and set via properties. This will fix a usability bug when defrag will
reset compression type to NONE. If the file has compression set via
property, it will not apply anymore (until next mount or reset through
command line).
We're going to fix that by adding another variable just for the defrag
call and won't touch the property. The defrag will have higher priority
when deciding whether to compress the data.
Signed-off-by: David Sterba <dsterba@suse.com>
Timofey Titovets [Mon, 17 Jul 2017 13:52:58 +0000 (16:52 +0300)]
Btrfs: add skeleton code for compression heuristic
Add skeleton code for compresison heuristics. Now it iterates over all
the pages, but in the end always says "yes, compress please", ie it does
not change the current behaviour.
In the future we're going to add various heuristics to analyze the data.
This patch can be used as a baseline for measuring if the effectivness
and performance.
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhanced changelog, modified comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 19 Jul 2017 17:30:41 +0000 (19:30 +0200)]
btrfs: account that we're waiting for IO in scrub_submit_raid56_bio_wait
Correctly account for IO when waiting for a submitted bio in scrub. This
only for the accounting purposes and should not change other behaviour.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 19 Jul 2017 17:26:45 +0000 (19:26 +0200)]
btrfs: account that we're waiting for DIO read
Correctly account for IO when waiting for a submitted DIO read, the case
when we're retrying. This only for the accounting purposes and should
not change other behaviour.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Jun 2017 01:59:40 +0000 (03:59 +0200)]
btrfs: drop chunk locks at the end of close_ctree
The pinned chunks might be left over so we clean them but at this point
of close_ctree, there's noone to race with, the locking can be removed.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Jun 2017 01:35:28 +0000 (03:35 +0200)]
btrfs: remove trivial wrapper btrfs_force_ra
It's a simple call page_cache_sync_readahead, same arguments in the same
order.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Jun 2017 01:28:55 +0000 (03:28 +0200)]
btrfs: drop ancient page flag mappings
There's no PageFsMisc. Added by patch
4881ee5a2e995 in 2008, the flag is
not present in current kernels.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Jun 2017 00:19:11 +0000 (02:19 +0200)]
btrfs: fix spelling of snapshotting
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 25 Jul 2017 14:48:28 +0000 (17:48 +0300)]
btrfs: Make flush_space return void
The return value of flush_space was used to have significance in the
early days when the code was first introduced and before the ticketed
enospc rework. Since the latter got introduced the return value lost any
significance whatsoever to its callers. So let's remove it. While at it
also remove the unused ticket variable in
btrfs_async_reclaim_metadata_space. It was used in the initial version
of the ticketed ENOSPC work, however Wang Xiaoguang detected a problem
with this and fixed it in
ce129655c9d9 ("btrfs: introduce tickets_id to
determine whether asynchronous metadata reclaim work makes progress").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 26 Jul 2017 08:26:28 +0000 (11:26 +0300)]
btrfs: Deprecate userspace transaction ioctls
Userspace transactions were introduced in commit
6bf13c0cc833 ("Btrfs:
transaction ioctls") to provide semantics that Ceph's object store
required. However, things have changed significantly since then, to the
point where btrfs is no longer suitable as a backend for ceph and in
fact it's actively advised against such usages. Considering this, there
doesn't seem to be a widespread, legit use case of userspace
transaction. They also clutter the file->private pointer.
So to end the agony let's nuke the userspace transaction ioctls. As a
first step let's give time for people to voice their objection by just
WARN()ining when the userspace transaction is used.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move the warning past perm checks, keep the has-been-printed state;
we're ok with just one warning over all filesystems ]
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 15 Jun 2017 23:48:05 +0000 (01:48 +0200)]
btrfs: use named constant for bdev blocksize
Superblock is read and written using buffer heads, we need to set the
bdev blocksize. The magic constant has been hardcoded in several places,
so replace it with a named constant.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 15 Jun 2017 22:50:33 +0000 (00:50 +0200)]
btrfs: split write_dev_supers to two functions
There are two independent parts, one that writes the superblocks and
another that waits for completion. No functional changes, but cleanups,
reformatting and comment updates.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 15 Jun 2017 17:51:51 +0000 (19:51 +0200)]
btrfs: refactor find_device helper
Polish the helper:
* drop underscores, no special meaning here
* pass fs_devices, as this is what the API implements
* drop noinline, no apparent reason for such simple helper
* constify uuid
* add comment
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 14 Jun 2017 00:48:07 +0000 (02:48 +0200)]
btrfs: merge alloc_device helpers
There are two helpers called in chain from one location, we can merge the
functionaliy.
Originally, alloc_fs_devices could fill the device uuid randomly if we
we didn't give the uuid buffer. This happens for seed devices but the
fsid is generated in btrfs_prepare_sprout, so we can remove it.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 6 Jun 2017 17:14:26 +0000 (19:14 +0200)]
btrfs: merge REQ_OP and REQ_ flags to one parameter in submit_extent_page
The function submit_extent_page has 15(!) parameters right now, op and
op_flags are effectively one value stored to bio::bi_opf, no need to
pass them separately. So it's 14 parameters now.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 6 Jun 2017 17:03:49 +0000 (19:03 +0200)]
btrfs: cleanup types storing REQ_*
Unify types of local variables and parameters that store various
REQ_* values to unsigned int.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 29 Jun 2017 16:37:49 +0000 (18:37 +0200)]
btrfs: get fs_info from eb in btrfs_print_tree, remove argument
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 29 Jun 2017 16:37:49 +0000 (18:37 +0200)]
btrfs: get fs_info from eb in btrfs_print_leaf, remove argument
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 14 Jun 2017 14:28:42 +0000 (16:28 +0200)]
btrfs: simplify btrfs_dev_replace_kthread
This function prints an informative message and then continues
dev-replace. The message contains a progress percentage which is read
from the status. The status is allocated dynamically, about 2600 bytes,
just to read the single value. That's an overkill. We'll use the new
helper and drop the allocation.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 14 Jun 2017 14:24:56 +0000 (16:24 +0200)]
btrfs: factor reading progress out of btrfs_dev_replace_status
We'll want to read the percentage value from dev_replace elsewhere, move
the logic to a separate helper.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Jun 2017 01:22:58 +0000 (03:22 +0200)]
btrfs: defrag: make readahead state allocation failure non-fatal
All sorts of readahead errors are not considered fatal. We can continue
defragmentation without it, with some potential slow down, which will
last only for the current inode.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Jun 2017 01:13:02 +0000 (03:13 +0200)]
btrfs: use GFP_KERNEL in btrfs_defrag_file
We can safely use GFP_KERNEL, the function is called from two contexts:
- ioctl handler, called directly, no locks taken
- cleaner thread, running all queued defrag work, outside of any locks
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Jun 2017 00:26:54 +0000 (02:26 +0200)]
btrfs: use GFP_KERNEL in mount and remount
We don't need to restrict the allocation flags in btrfs_mount or
_remount. No big filesystem locks are held (possibly s_umount but that
does no count here).
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 13 Jul 2017 11:11:07 +0000 (14:11 +0300)]
btrfs: Remove never reached error handling code in __add_reloc_root
One of the error handling paths in __add_reloc_root contains btrfs_panic()
followed by some other code. As the name implies what it does is print
some error message and call BUG, naturally what follow afterwards is not
invoked. So remove this extra code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 19 Jul 2017 07:48:42 +0000 (10:48 +0300)]
btrfs: Remove unused parameters from volume.c functions
This also adjusts the respective callers in other files. Those were
found with -Wunused-parameter.
btrfs_full_stripe_len's mapping_tree - introduced by
53b381b3abeb
("Btrfs: RAID5 and RAID6") but it was never really used even in that
commit
btrfs_is_parity_mirror's mirror_num - same as above
chunk_drange_filter's chunk_offset - introduced by
94e60d5a5c4b ("Btrfs:
devid subset filter") and never used.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 19 Jul 2017 07:47:57 +0000 (10:47 +0300)]
btrfs: Remove unused variables
clear_super - usage was removed in commit
cea67ab92d3d ("btrfs: clean
the old superblocks before freeing the device") but that change forgot
to remove the actual variable.
max_key - commit
6174d3cb43aa ("Btrfs: remove unused max_key arg from
btrfs_search_forward") removed the max_key parameter but it forgot to
remove references from callers.
stripe_len - this one was added by
e06cd3dd7cea ("Btrfs: add validadtion
checks for chunk loading") but even then it wasn't used.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Fri, 14 Jul 2017 06:55:41 +0000 (09:55 +0300)]
btrfs: Remove find_raid56_stripe_len
find_raid56_stripe_len statically returns SZ_64K which equals BTRFS_STRIPE_LEN.
It's sole caller is __btrfs_alloc_chunk and it assigns the return value to ai
variable which is already set to BTRFS_STRIPE_LEN. So remove the function
invocation altogether and remove the function itself. Also remove the variable
since it's only aliasing BTRFS_STRIPE_LEN and use the define directly. Use
the occassion to simplify the rounding down of stripe_size now that the value
we want it to align is a power of 2.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 18 Jul 2017 12:39:08 +0000 (15:39 +0300)]
btrfs: Use explicit round_down macro in btrfs resize ioctl handler
No functional changes, just make the code more self-explanatory.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Tue, 18 Jul 2017 09:37:05 +0000 (17:37 +0800)]
btrfs: btrfs_inherit_iflags() can be static
btrfs_new_inode() is the only consumer move it to inode.c,
from ioctl.c.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nick Terrell [Thu, 29 Jun 2017 17:57:26 +0000 (10:57 -0700)]
btrfs: Keep one more workspace around
find_workspace() allocates up to num_online_cpus() + 1 workspaces.
free_workspace() will only keep num_online_cpus() workspaces. When
(de)compressing we will allocate num_online_cpus() + 1 workspaces, then
free one, and repeat. Instead, we can just keep num_online_cpus() + 1
workspaces around, and never have to allocate/free another workspace in the
common case.
I tested on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. I mounted a
BtrFS partition with -o compress-force={lzo,zlib,zstd} and logged whenever
a workspace was allocated of freed. Then I copied vmlinux (527 MB) to the
partition. Before the patch, during the copy it would allocate and free 5-6
workspaces. After, it only allocated the initial 3. This held true for lzo,
zlib, and zstd. The time it took to execute cp vmlinux /mnt/btrfs && sync
dropped from 1.70s to 1.44s with lzo compression, and from 2.04s to 1.80s
for zstd compression.
Signed-off-by: Nick Terrell <terrelln@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 13 Jul 2017 13:32:18 +0000 (15:32 +0200)]
btrfs: drop newlines from strings when using btrfs_* helpers
The helpers append "\n" so we can keep the actual strings shorter. The
extra newline will print an empty line. Some messages have been
slightly modified to be more consistent with the rest (lowercase first
letter).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 12 Jul 2017 06:42:19 +0000 (09:42 +0300)]
btrfs: qgroups: Fix BUG_ON condition in tree level check
The current code was erroneously checking for
root_level > BTRFS_MAX_LEVEL. If we had a root_level of 8 then the check
won't trigger and we could potentially hit a buffer overflow. The
correct check should be root_level >= BTRFS_MAX_LEVEL .
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 9 Mar 2017 01:34:42 +0000 (09:34 +0800)]
btrfs: Enhance message when a device is missing during mount
For a missing device, btrfs will just refuse to mount with almost
meaningless kernel message like:
BTRFS info (device vdb6): disk space caching is enabled
BTRFS info (device vdb6): has skinny extents
BTRFS error (device vdb6): failed to read the system array: -5
BTRFS error (device vdb6): open_ctree failed
This patch will print a new message about the missing device:
BTRFS info (device vdb6): disk space caching is enabled
BTRFS info (device vdb6): has skinny extents
BTRFS warning (device vdb6): devid 2 uuid
80470722-cad2-4b90-b7c3-
fee294552f1b is missing
BTRFS error (device vdb6): failed to read the system array: -5
BTRFS error (device vdb6): open_ctree failed
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 9 Mar 2017 01:34:41 +0000 (09:34 +0800)]
btrfs: Cleanup num_tolerated_disk_barrier_failures
As we use per-chunk degradable check, the global
num_tolerated_disk_barrier_failures is of no use.
We can now remove it.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 27 Jun 2017 09:28:40 +0000 (17:28 +0800)]
btrfs: Allow barrier_all_devices to do chunk level device check
The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices().
But it can be easily changed to the new per-chunk degradable check
framework.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 9 Mar 2017 01:34:38 +0000 (09:34 +0800)]
btrfs: Do chunk level check for degraded remount
Just the same for mount time check, use btrfs_check_rw_degradable() to
check if we are OK to be remounted rw.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 9 Mar 2017 01:34:37 +0000 (09:34 +0800)]
btrfs: Do chunk level check for degraded rw mount
Now use the btrfs_check_rw_degradable() to check if we can mount in the
degraded mode.
With this patch, we can mount in the following case:
# mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
# wipefs -a /dev/sdc
# mount /dev/sdb /mnt/btrfs -o degraded
As the single data chunk is only on sdb, so it's OK to mount as
degraded, as missing one device is OK for RAID1.
But still fail in the following case as expected:
# mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
# wipefs -a /dev/sdb
# mount /dev/sdc /mnt/btrfs -o degraded
As the data chunk is only in sdb, so it's not OK to mount it as
degraded.
Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reported-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 9 Mar 2017 01:34:36 +0000 (09:34 +0800)]
btrfs: Introduce a function to check if all chunks a OK for degraded rw mount
Introduce a new function, btrfs_check_rw_degradable(), to check if all
chunks in btrfs is OK for degraded rw mount.
It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.
Btrfs currently uses num_tolerated_disk_barrier_failures to do global
check for tolerated missing device.
Although the one-size-fit-all solution is quite safe, it's too strict
if data and metadata has different duplication level.
For example, if one use Single data and RAID1 metadata for 2 disks, it
means any missing device will make the fs unable to be degraded
mounted.
But in fact, some times all single chunks may be in the existing
device and in that case, we should allow it to be rw degraded mounted.
Such case can be easily reproduced using the following script:
# mkfs.btrfs -f -m raid1 -d sing /dev/sdb /dev/sdc
# wipefs -f /dev/sdc
# mount /dev/sdb -o degraded,rw
If using btrfs-debug-tree to check /dev/sdb, one should find that the
data chunk is only in sdb, so in fact it should allow degraded mount.
This patchset will introduce a new per-chunk degradable check for
btrfs, allow above case to succeed, and it's quite small anyway.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied text from cover letter with more details about the problem being
solved ]
Signed-off-by: David Sterba <dsterba@suse.com>
Liu Bo [Tue, 11 Jul 2017 20:43:16 +0000 (14:43 -0600)]
Btrfs: report errors when checksum is not found
When btrfs fails the checksum check, it'll fill the whole page with
"1".
However, if %csum_expected is 0 (which means there is no checksum), then
for some unknown reason, we just pretend that the read is correct, so
userspace would be confused about the dilemma that read is successful but
getting a page with all content being "1".
This can happen due to a bug in btrfs-convert.
This fixes it by always returning errors if checksum doesn't match.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 11 Jul 2017 13:55:51 +0000 (16:55 +0300)]
btrfs: Prevent possible ERR_PTR() dereference
In btrfs_full_stripe_len/btrfs_is_parity_mirror we have similar code which
gets the chunk map for a particular range via get_chunk_map. However,
get_chunk_map can return an ERR_PTR value and while the 2 callers do catch
this with a WARN_ON they then proceed to indiscriminately dereference the
extent map. This of course leads to a crash. Fix the offenders by making the
dereference conditional on IS_ERR.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 11 Jul 2017 10:47:50 +0000 (13:47 +0300)]
btrfs: Remove redundant checks from btrfs_alloc_data_chunk_ondemand
Many commits ago the data space_info in alloc_data_chunk_ondemand used to be
acquired from the inode. At that point commit
33b4d47f5e24 ("Btrfs: deal with NULL space info") got introduced to deal with
spurios cases where the space info could be null, following a rebalance.
Nowadays, however, the space info is referenced directly from the btrfs_fs_info
struct which is initialised at filesystem mount time. This makes the null
checks redundant, so remove them.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 11 Jul 2017 10:25:13 +0000 (13:25 +0300)]
btrfs: Remove redundant argument of flush_space
All callers of flush_space pass the same number for orig/num_bytes
arguments. Let's remove one of the numbers and also modify the trace
point to show only a single number - bytes requested.
Seems that last point where the two parameters were treated differently
is before the ticketed enospc rework.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Aleksa Sarai [Tue, 4 Jul 2017 11:49:06 +0000 (21:49 +1000)]
btrfs: resume qgroup rescan on rw remount
Several distributions mount the "proper root" as ro during initrd and
then remount it as rw before pivot_root(2). Thus, if a rescan had been
aborted by a previous shutdown, the rescan would never be resumed.
This issue would manifest itself as several btrfs ioctl(2)s causing the
entire machine to hang when btrfs_qgroup_wait_for_completion was hit
(due to the fs_info->qgroup_rescan_running flag being set but the rescan
itself not being resumed). Notably, Docker's btrfs storage driver makes
regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT
(causing this problem to be manifested on boot for some machines).
Cc: <stable@vger.kernel.org> # v3.11+
Cc: Jeff Mahoney <jeffm@suse.com>
Fixes: b382a324b60f ("Btrfs: fix qgroup rescan resume on mount")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Edmund Nadolski [Wed, 12 Jul 2017 22:20:11 +0000 (16:20 -0600)]
btrfs: clean up extraneous computations in add_delayed_refs
Repeating the same computation in multiple places is not
necessary.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Edmund Nadolski [Wed, 12 Jul 2017 22:20:10 +0000 (16:20 -0600)]
btrfs: allow backref search checks for shared extents
When called with a struct share_check, find_parent_nodes()
will detect a shared extent and immediately return with
BACKREF_SHARED_FOUND.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Edmund Nadolski [Wed, 12 Jul 2017 22:20:09 +0000 (16:20 -0600)]
btrfs: add cond_resched() calls when resolving backrefs
Since backref resolution is CPU-intensive, the cond_resched calls
should help alleviate soft lockup occurences.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 12 Jul 2017 22:20:08 +0000 (16:20 -0600)]
btrfs: backref, add tracepoints for prelim_ref insertion and merging
This patch adds a tracepoint event for prelim_ref insertion and
merging. For each, the ref being inserted or merged and the count
of tree nodes is issued.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Wed, 12 Jul 2017 22:20:07 +0000 (16:20 -0600)]
btrfs: add a node counter to each of the rbtrees
This patch adds counters to each of the rbtrees so that we can tell
how large they are growing for a given workload. These counters
will be exported by tracepoints in the next patch.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Edmund Nadolski [Wed, 12 Jul 2017 22:20:06 +0000 (16:20 -0600)]
btrfs: convert prelimary reference tracking to use rbtrees
It's been known for a while that the use of multiple lists
that are periodically merged was an algorithmic problem within
btrfs. There are several workloads that don't complete in any
reasonable amount of time (e.g. btrfs/130) and others that cause
soft lockups.
The solution is to use a set of rbtrees that do insertion merging
for both indirect and direct refs, with the former converting
refs into the latter. The result is a btrfs/130 workload that
used to take several hours now takes about half of that. This
runtime still isn't acceptable and a future patch will address that
by moving the rbtrees higher in the stack so the lookups can be
shared across multiple calls to find_parent_nodes.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Edmund Nadolski [Thu, 29 Jun 2017 03:56:59 +0000 (21:56 -0600)]
btrfs: remove ref_tree implementation from backref.c
Commit
afce772e87c3 ("btrfs: fix check_shared for fiemap ioctl") added
the ref_tree code in backref.c to reduce backref searching for
shared extents under the FIEMAP ioctl. This code will not be
compatible with the upcoming rbtree changes for improved backref
searching, so this patch removes the ref_tree code. The rbtree
changes will provide the equivalent functionality for FIEMAP.
The above commit also introduced transaction semantics around calls to
btrfs_check_shared() in order to accurately account for delayed refs.
This functionality needs to be retained, so a complete revert of the
above commit is not desirable. This patch therefore removes the
ref_tree portion of the commit as above, however it does not remove
the transaction portion.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Edmund Nadolski [Thu, 29 Jun 2017 03:56:58 +0000 (21:56 -0600)]
btrfs: btrfs_check_shared should manage its own transaction
Commit
afce772e87c3 ("btrfs: fix check_shared for fiemap ioctl") added
transaction semantics around calls to btrfs_check_shared() in order to
provide accurate accounting of delayed refs. The transaction management
should be done inside btrfs_check_shared(), so that callers do not need
to manage transactions individually.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Thu, 29 Jun 2017 03:56:57 +0000 (21:56 -0600)]
btrfs: backref, cleanup __ namespace abuse
We typically use __ to indicate a helper routine that shouldn't be
called directly without understanding the proper context required
to do so. We use static functions to indicate that a function is
private to a particular C file. The backref code uses static
function and __ prefixes on nearly everything, which makes the code
difficult to read and establishes a pattern for future code that
shouldn't be followed. This patch drops all the unnecessary prefixes.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Thu, 29 Jun 2017 03:56:56 +0000 (21:56 -0600)]
btrfs: backref, add unode_aux_to_inode_list helper
Replacing the double cast and ternary conditional with a helper makes
the code easier on the eyes.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Thu, 29 Jun 2017 03:56:55 +0000 (21:56 -0600)]
btrfs: backref, constify some arguments
This constifies a few buffers used in the backref code.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Thu, 29 Jun 2017 03:56:54 +0000 (21:56 -0600)]
btrfs: constify tracepoint arguments
Tracepoint arguments are all read-only. If we mark the arguments
as const, we're able to keep or convert those arguments to const
where appropriate.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jeff Mahoney [Thu, 29 Jun 2017 03:56:53 +0000 (21:56 -0600)]
btrfs: struct-funcs, constify readers
We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only. We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.
No impact on code, but serves as documentation that a buffer is intended
not to be modified.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Wed, 28 Jun 2017 08:05:22 +0000 (11:05 +0300)]
btrfs: remove unused sectorsize member
The sectorsize member of btrfs_block_group_cache is unused. So remove it, this
reduces the number of holes in the struct.
With patch:
/* size: 856, cachelines: 14, members: 40 */
/* sum members: 837, holes: 4, sum holes: 19 */
/* bit holes: 1, sum bit holes: 29 bits */
/* last cacheline: 24 bytes */
Without patch:
/* size: 864, cachelines: 14, members: 41 */
/* sum members: 841, holes: 5, sum holes: 23 */
/* bit holes: 1, sum bit holes: 29 bits */
/* last cacheline: 32 bytes */
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 27 Jun 2017 07:02:26 +0000 (10:02 +0300)]
btrfs: Be explicit about usage of min()
__btrfs_alloc_chunk contains code which boils down to:
ndevs = min(ndevs, devs_max)
It's conditional upon devs_max not being 0. However, it cannot really be 0
since it's always set to either BTRFS_MAX_DEVS_SYS_CHUNK or
BTRFS_MAX_DEVS(fs_info->chunk_root). So eliminate the condition check and use
min explicitly. This has no functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 27 Jun 2017 07:02:25 +0000 (10:02 +0300)]
btrfs: Use explicit round_down call rather than open-coding it
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 27 Jun 2017 07:02:24 +0000 (10:02 +0300)]
btrfs: convert while loop to list_for_each_entry
No functional changes, just make the loop a bit more readable
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Sun, 13 Aug 2017 23:01:32 +0000 (16:01 -0700)]
Linux 4.13-rc5
Linus Torvalds [Sun, 13 Aug 2017 22:34:28 +0000 (15:34 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"Another round of MIPS fixes:
- compressed boot: Ignore a generated .c file
- VDSO: Fix a register clobber list
- DECstation: Fix an int-handler.S CPU_DADDI_WORKAROUNDS regression
- Octeon: Fix recent cleanups that cleaned away a bit too much thus
breaking the arch side of the EDAC and USB drivers.
- uasm: Fix duplicate const in "const struct foo const bar[]" which
GCC 7.1 no longer accepts.
- Fix race on setting and getting cpu_online_mask
- Fix preemption issue. To do so cleanly introduce macro to get the
size of L3 cache line.
- Revert include cleanup that sometimes results in build error
- MicroMIPS uses bit 0 of the PC to indicate microMIPS mode. Make
sure this bit is set for kernel entry as well.
- Prevent configuring the kernel for both microMIPS and MT. There are
no such CPUs currently and thus the combination is unsupported and
results in build errors.
This has been sitting in linux-next for a few days and has survived
automated testing by Imagination's test farm. No known regressions
pending except a number of issues that crept up due to lots of people
switching to GCC 7.1"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: Set ISA bit in entry-y for microMIPS kernels
MIPS: Prevent building MT support for microMIPS kernels
MIPS: PCI: Fix smp_processor_id() in preemptible
MIPS: Introduce cpu_tcache_line_size
MIPS: DEC: Fix an int-handler.S CPU_DADDI_WORKAROUNDS regression
MIPS: VDSO: Fix clobber lists in fallback code paths
Revert "MIPS: Don't unnecessarily include kmalloc.h into <asm/cache.h>."
MIPS: OCTEON: Fix USB platform code breakage.
MIPS: Octeon: Fix broken EDAC driver.
MIPS: gitignore: ignore generated .c files
MIPS: Fix race on setting and getting cpu_online_mask
MIPS: mm: remove duplicate "const" qualifier on insn_table
Linus Torvalds [Sun, 13 Aug 2017 19:44:18 +0000 (12:44 -0700)]
Merge tag 'driver-core-4.13-rc5' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are three firmware core fixes for 4.13-rc5.
All three of these fix reported issues and have been floating around
for a few weeks. They have been in linux-next with no reported
problems"
* tag 'driver-core-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
firmware: avoid invalid fallback aborts by using killable wait
firmware: fix batched requests - send wake up on failure on direct lookups
firmware: fix batched requests - wake all waiters
Linus Torvalds [Sun, 13 Aug 2017 19:41:58 +0000 (12:41 -0700)]
Merge tag 'char-misc-4.13-rc5' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are two patches for 4.13-rc5.
One is a fix for a reported thunderbolt issue, and the other a fix for
an MEI driver issue. Both have been in linux-next with no reported
issues"
* tag 'char-misc-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
thunderbolt: Do not enumerate more ports from DROM than the controller has
mei: exclude device from suspend direct complete optimization
Linus Torvalds [Sun, 13 Aug 2017 19:33:35 +0000 (12:33 -0700)]
Merge tag 'tty-4.13-rc5' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are two tty serial driver fixes for 4.13-rc5. One is a revert of
a -rc1 patch that turned out to not be a good idea, and the other is a
fix for the pl011 serial driver.
Both have been in linux-next with no reported issues"
* tag 'tty-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
Revert "serial: Delete dead code for CIR serial ports"
tty: pl011: fix initialization order of QDF2400 E44
Linus Torvalds [Sun, 13 Aug 2017 19:30:17 +0000 (12:30 -0700)]
Merge tag 'staging-4.13-rc5' of git://git./linux/kernel/git/gregkh/staging
Pull staging/iio fixes from Greg KH:
"Here are some Staging and IIO driver fixes for 4.13-rc5.
Nothing major, just a number of small fixes for reported issues. All
of these have been in linux-next for a while now with no reported
issues. Full details are in the shortlog"
* tag 'staging-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: comedi: comedi_fops: do not call blocking ops when !TASK_RUNNING
iio: aspeed-adc: wait for initial sequence.
iio: accel: bmc150: Always restore device to normal mode after suspend-resume
staging:iio:resolver:ad2s1210 fix negative IIO_ANGL_VEL read
iio: adc: axp288: Fix the GPADC pin reading often wrongly returning 0
iio: adc: vf610_adc: Fix VALT selection value for REFSEL bits
iio: accel: st_accel: add SPI-3wire support
iio: adc: Revert "axp288: Drop bogus AXP288_ADC_TS_PIN_CTRL register modifications"
iio: adc: sun4i-gpadc-iio: fix unbalanced irq enable/disable
iio: pressure: st_pressure_core: disable multiread by default for LPS22HB
iio: light: tsl2563: use correct event code
Linus Torvalds [Sun, 13 Aug 2017 19:27:42 +0000 (12:27 -0700)]
Merge tag 'usb-4.13-rc5' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of small USB driver fixes and new device ids for
4.13-rc5. There is the usual gadget driver fixes, some new quirks for
"messy" hardware, and some new device ids.
All have been in linux-next with no reported issues"
* tag 'usb-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: serial: pl2303: add new ATEN device id
usb: quirks: Add no-lpm quirk for Moshi USB to Ethernet Adapter
USB: Check for dropped connection before switching to full speed
usb:xhci:Add quirk for Certain failing HP keyboard on reset after resume
usb: renesas_usbhs: gadget: fix unused-but-set-variable warning
usb: renesas_usbhs: Fix UGCTRL2 value for R-Car Gen3
usb: phy: phy-msm-usb: Fix usage of devm_regulator_bulk_get()
usb: gadget: udc: renesas_usb3: Fix usb_gadget_giveback_request() calling
usb: dwc3: gadget: Correct ISOC DATA PIDs for short packets
USB: serial: option: add D-Link DWM-222 device ID
usb: musb: fix tx fifo flush handling again
usb: core: unlink urbs from the tail of the endpoint's urb_list
usb-storage: fix deadlock involving host lock and scsi_done
uas: Add US_FL_IGNORE_RESIDUE for Initio Corporation INIC-3069
USB: hcd: Mark secondary HCD as dead if the primary one died
USB: serial: cp210x: add support for Qivicon USB ZigBee dongle
Linus Torvalds [Sat, 12 Aug 2017 23:19:43 +0000 (16:19 -0700)]
Merge tag 'for-linus-
20170812' of git://git.infradead.org/linux-mtd
Pull another MTD fix from Brian Norris:
"An mtdblock regression occurred in -rc1 (all writes were broken!), in
the process of some block subsystem refactoring. Noticed and fixed
last week, but I'm a little slow on the uptake"
* tag 'for-linus-
20170812' of git://git.infradead.org/linux-mtd:
mtd: blkdevs: Fix mtd block write failure
Abhishek Sahu [Wed, 2 Aug 2017 12:33:05 +0000 (18:03 +0530)]
mtd: blkdevs: Fix mtd block write failure
All the MTD block write requests are failing with
following error messages
mkfs.ext4 /dev/mtdblock0
print_req_error: I/O error, dev mtdblock0, sector 0
Buffer I/O error on dev mtdblock0, logical block 0,
lost async page write
The control is going to default case after block write request
because of missing return.
Fixes: commit 2a842acab109 ("block: introduce new block status code type")
Signed-off-by: Abhishek Sahu <absahu@codeaurora.org>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Linus Torvalds [Sat, 12 Aug 2017 19:08:59 +0000 (12:08 -0700)]
Merge git://git./linux/kernel/git/nab/target-pending
Pull SCSI target fixes from Nicholas Bellinger:
"The highlights include:
- Fix iscsi-target payload memory leak during
ISCSI_FLAG_TEXT_CONTINUE (Varun Prakash)
- Fix tcm_qla2xxx incorrect use of tcm_qla2xxx_free_cmd during ABORT
(Pascal de Bruijn + Himanshu Madhani + nab)
- Fix iscsi-target long-standing issue with parallel delete of a
single network portal across multiple target instances (Gary Guo +
nab)
- Fix target dynamic se_node GPF during uncached shutdown regression
(Justin Maggard + nab)"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
target: Fix node_acl demo-mode + uncached dynamic shutdown regression
iscsi-target: Fix iscsi_np reset hung task during parallel delete
qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
cxgbit: fix sg_nents calculation
iscsi-target: fix invalid flags in text response
iscsi-target: fix memory leak in iscsit_setup_text_cmd()
cxgbit: add missing __kfree_skb()
tcmu: free old string on reconfig
tcmu: Fix possible to/from address overflow when doing the memcpy
Linus Torvalds [Sat, 12 Aug 2017 16:01:36 +0000 (09:01 -0700)]
Merge tag 'for-linus-4.13b-rc5-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Some fixes for Xen:
- a fix for a regression introduced in 4.13 for a Xen HVM-guest
configured with KASLR
- a fix for a possible deadlock in the xenbus driver when booting the
system
- a fix for lost interrupts in Xen guests"
* tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/events: Fix interrupt lost during irq_disable and irq_enable
xen: avoid deadlock in xenbus
xen: fix hvm guest with kaslr enabled
xen: split up xen_hvm_init_shared_info()
x86: provide an init_mem_mapping hypervisor hook
Linus Torvalds [Fri, 11 Aug 2017 20:54:09 +0000 (13:54 -0700)]
Merge tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"A few more NFS client bugfixes from me for rc5.
Dros has a stable fix for flexfiles to prevent leaking the
nfs4_ff_ds_version arrays when freeing a layout, Trond fixed a
potential recovery loop situation with the TEST_STATEID operation, and
Christoph fixed up the pNFS blocklayout Kconfig options to prevent
unsafe use with kernels that don't have large block device support.
Summary:
Stable fix:
- fix leaking nfs4_ff_ds_version array
Other fixes:
- improve TEST_STATEID OLD_STATEID handling to prevent recovery loop
- require 64-bit sector_t for pNFS blocklayout to prevent 32-bit
compile errors"
* tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs:
pnfs/blocklayout: require 64-bit sector_t
NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid()
nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
Linus Torvalds [Fri, 11 Aug 2017 19:26:49 +0000 (12:26 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A set of fixes that should go into this series. This contains:
- Fix from Bart for blk-mq requeue queue running, preventing a
continued loop of run/restart.
- Fix for a bio/blk-integrity issue, in two parts. One from
Christoph, fixing where verification happens, and one from Milan,
for a NULL profile.
- NVMe pull request, most of the changes being for nvme-fc, but also
a few trivial core/pci fixes"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvme: fix directive command numd calculation
nvme: fix nvme reset command timeout handling
nvme-pci: fix CMB sysfs file removal in reset path
lpfc: support nvmet_fc defer_rcv callback
nvmet_fc: add defer_req callback for deferment of cmd buffer return
nvme: strip trailing 0-bytes in wwid_show
block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time
bio-integrity: only verify integrity on the lowest stacked driver
bio-integrity: Fix regression if profile verify_fn is NULL
Linus Torvalds [Fri, 11 Aug 2017 18:56:54 +0000 (11:56 -0700)]
Merge tag 'mmc-v4.13-rc4' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- fix lockdep splat when removing mmc_block module
- fix the logic for setting eMMC HS400ES signal voltage
MMC host:
- omap_hsmmc: add CMD23 capability to fix -EIO errors"
* tag 'mmc-v4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: block: fix lockdep splat when removing mmc_block module
mmc: mmc: correct the logic for setting HS400ES signal voltage
mmc: host: omap_hsmmc: Add CMD23 capability to omap_hsmmc driver
Linus Torvalds [Fri, 11 Aug 2017 18:44:18 +0000 (11:44 -0700)]
Merge tag 'fbdev-v4.13-rc5' of git://github.com/bzolnier/linux
Pull fbdev fixes from Bartlomiej Zolnierkiewicz:
- allow user to disable write combined mapping in efifb driver (Dave
Airlie)
- fix use after free bugs on driver removal in imxfb driver (Dan
Carpenter)
- fix unused variable warning in omapfb driver (Arnd Bergmann)
* tag 'fbdev-v4.13-rc5' of git://github.com/bzolnier/linux:
efifb: allow user to disable write combined mapping.
fbdev: omapfb: remove unused variable
video: fbdev: imxfb: use after free in imxfb_remove()
Linus Torvalds [Fri, 11 Aug 2017 18:20:48 +0000 (11:20 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Fix a few bugs in fuse"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: set mapping error in writepage_locked when it fails
fuse: Dont call set_page_dirty_lock() for ITER_BVEC pages for async_dio
fuse: initialize the flock flag in fuse_file on allocation
Linus Torvalds [Fri, 11 Aug 2017 18:15:51 +0000 (11:15 -0700)]
Merge tag 'iommu-fixes-v4.13-rc4' of git://git./linux/kernel/git/joro/iommu
Pull IOMMU fix from Joerg Roedel:
"Fix a NULL-pointer dereference in arm_smmu_add_device"
* tag 'iommu-fixes-v4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/arm-smmu: fix null-pointer dereference in arm_smmu_add_device
Christoph Hellwig [Sat, 5 Aug 2017 08:59:14 +0000 (10:59 +0200)]
pnfs/blocklayout: require 64-bit sector_t
The blocklayout code does not compile cleanly for a 32-bit sector_t,
and also has no reliable checks for devices sizes, which makes it
unsafe to use with a kernel that doesn't support large block devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5c83746a0cf2 ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Linus Torvalds [Fri, 11 Aug 2017 15:56:01 +0000 (08:56 -0700)]
Merge tag 'powerpc-4.13-6' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"All fixes for code that went in this cycle.
- a revert of an optimisation to the syscall exit path, which could
lead to an oops on either older machines or machines with > 1TB of
memory
- disable some deep idle states if the firmware configuration for
them fails
- re-enable HARD/SOFT lockup detectors in defconfigs after a Kconfig
change
- six fairly small patches fixing bugs in our new watchdog code
Thanks to: Gautham R Shenoy, Nicholas Piggin"
* tag 'powerpc-4.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/watchdog: add locking around init/exit functions
powerpc/watchdog: Fix marking of stuck CPUs
powerpc/watchdog: Fix final-check recovered case
powerpc/watchdog: Moderate touch_nmi_watchdog overhead
powerpc/watchdog: Improve watchdog lock primitive
powerpc: NMI IPI improve lock primitive
powerpc/configs: Re-enable HARD/SOFT lockup detectors
powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT states when stop-api fails
Revert "powerpc/64: Avoid restore_math call if possible in syscall exit"
Artem Savkov [Tue, 8 Aug 2017 10:26:02 +0000 (12:26 +0200)]
iommu/arm-smmu: fix null-pointer dereference in arm_smmu_add_device
Commit
c54451a "iommu/arm-smmu: Fix the error path in arm_smmu_add_device"
removed fwspec assignment in legacy_binding path as redundant which is
wrong. It needs to be updated after fwspec initialisation in
arm_smmu_register_legacy_master() as it is dereferenced later. Without
this there is a NULL-pointer dereference panic during boot on some hosts.
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Liu Shuo [Sat, 29 Jul 2017 16:59:57 +0000 (00:59 +0800)]
xen/events: Fix interrupt lost during irq_disable and irq_enable
Here is a device has xen-pirq-MSI interrupt. Dom0 might lost interrupt
during driver irq_disable/irq_enable. Here is the scenario,
1. irq_disable -> disable_dynirq -> mask_evtchn(irq channel)
2. dev interrupt raised by HW and Xen mark its evtchn as pending
3. irq_enable -> startup_pirq -> eoi_pirq ->
clear_evtchn(channel of irq) -> clear pending status
4. consume_one_event process the irq event without pending bit assert
which result in interrupt lost once
5. No HW interrupt raising anymore.
Now use enable_dynirq for enable_pirq of xen_pirq_chip to remove
eoi_pirq when irq_enable.
Signed-off-by: Liu Shuo <shuo.a.liu@intel.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Fri, 28 Jul 2017 14:53:55 +0000 (16:53 +0200)]
xen: avoid deadlock in xenbus
When starting the xenwatch thread a theoretical deadlock situation is
possible:
xs_init() contains:
task = kthread_run(xenwatch_thread, NULL, "xenwatch");
if (IS_ERR(task))
return PTR_ERR(task);
xenwatch_pid = task->pid;
And xenwatch_thread() does:
mutex_lock(&xenwatch_mutex);
...
event->handle->callback();
...
mutex_unlock(&xenwatch_mutex);
The callback could call unregister_xenbus_watch() which does:
...
if (current->pid != xenwatch_pid)
mutex_lock(&xenwatch_mutex);
...
In case a watch is firing before xenwatch_pid could be set and the
callback of that watch unregisters a watch, then a self-deadlock would
occur.
Avoid this by setting xenwatch_pid in xenwatch_thread().
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Jens Axboe [Fri, 11 Aug 2017 14:07:19 +0000 (08:07 -0600)]
Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Christoph:
"A few more small fixes - the fc/lpfc update is the biggest by far."
Juergen Gross [Fri, 28 Jul 2017 10:23:14 +0000 (12:23 +0200)]
xen: fix hvm guest with kaslr enabled
A Xen HVM guest running with KASLR enabled will die rather soon today
because the shared info page mapping is using va() too early. This was
introduced by commit
a5d5f328b0e2baa5ee7c119fd66324eb79eeeb66 ("xen:
allocate page for shared info page from low memory").
In order to fix this use early_memremap() to get a temporary virtual
address for shared info until va() can be used safely.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Fri, 28 Jul 2017 10:23:13 +0000 (12:23 +0200)]
xen: split up xen_hvm_init_shared_info()
Instead of calling xen_hvm_init_shared_info() on boot and resume split
it up into a boot time function searching for the pfn to use and a
mapping function doing the hypervisor mapping call.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Fri, 28 Jul 2017 10:23:12 +0000 (12:23 +0200)]
x86: provide an init_mem_mapping hypervisor hook
Provide a hook in hypervisor_x86 called after setting up initial
memory mapping.
This is needed e.g. by Xen HVM guests to map the hypervisor shared
info page.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Jeff Layton [Thu, 25 May 2017 10:57:50 +0000 (06:57 -0400)]
fuse: set mapping error in writepage_locked when it fails
This ensures that we see errors on fsync when writeback fails.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Linus Torvalds [Fri, 11 Aug 2017 05:33:47 +0000 (22:33 -0700)]
Merge tag 'drm-fixes-for-v4.13-rc5' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Nothing too earth shattering here, it just seems like lots of little
things all over the place.
msm has probably the larger amount of changes, but they all seem fine,
otherwise, some rockchip, i915, etnaviv and exynos fixes, along with
one nouveau regression fix for some older GPUs"
* tag 'drm-fixes-for-v4.13-rc5' of git://people.freedesktop.org/~airlied/linux: (35 commits)
drm/nouveau/disp/nv04: avoid creation of output paths
drm: make DRM_STM default n
drm/exynos: forbid creating framebuffers from too small GEM buffers
drm/etnaviv: Fix off-by-one error in reloc checking
drm/i915: fix backlight invert for non-zero minimum brightness
drm/i915/shrinker: Wrap need_resched() inside preempt-disable
drm/i915/perf: fix flex eu registers programming
drm/i915: Fix out-of-bounds array access in bdw_load_gamma_lut
drm/i915/gvt: Change the max length of mmio_reg_rw from 4 to 8
drm/i915/gvt: Initialize MMIO Block with HW state
drm/rockchip: vop: report error when check resource error
drm/rockchip: vop: round_up pitches to word align
drm/rockchip: vop: fix NV12 video display error
drm/rockchip: vop: fix iommu page fault when resume
drm/i915/gvt: clean workload queue if error happened
drm/i915/gvt: change resetting to resetting_eng
drm/msm: gpu: don't abuse dma_alloc for non-DMA allocations
drm/msm: gpu: call qcom_mdt interfaces only for ARCH_QCOM
drm/msm/adreno: Prevent unclocked access when retrieving timestamps
drm/msm: Remove __user from __u64 data types
...
Linus Torvalds [Thu, 10 Aug 2017 23:20:52 +0000 (16:20 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"21 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
zram: rework copy of compressor name in comp_algorithm_store()
rmap: do not call mmu_notifier_invalidate_page() under ptl
mm: fix list corruptions on shmem shrinklist
mm/balloon_compaction.c: don't zero ballooned pages
MAINTAINERS: copy virtio on balloon_compaction.c
mm: fix KSM data corruption
mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
mm: make tlb_flush_pending global
mm: refactor TLB gathering API
Revert "mm: numa: defer TLB flush for THP migration as long as possible"
mm: migrate: fix barriers around tlb_flush_pending
mm: migrate: prevent racy access to tlb_flush_pending
fault-inject: fix wrong should_fail() decision in task context
test_kmod: fix small memory leak on filesystem tests
test_kmod: fix the lock in register_test_dev_kmod()
test_kmod: fix bug which allows negative values on two config options
test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
mm: ratelimit PFNs busy info message
...
Mike Rapoport [Thu, 10 Aug 2017 22:24:32 +0000 (15:24 -0700)]
userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
When the process exit races with outstanding mcopy_atomic, it would be
better to return ESRCH error. When such race occurs the process and
it's mm are going away and returning "no such process" to the uffd
monitor seems better fit than ENOSPC.
Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthias Kaehlcke [Thu, 10 Aug 2017 22:24:29 +0000 (15:24 -0700)]
zram: rework copy of compressor name in comp_algorithm_store()
comp_algorithm_store() passes the size of the source buffer to strlcpy()
instead of the destination buffer size. Make it explicit that the two
buffers have the same size and use strcpy() instead of strlcpy(). The
latter can be done safely since the function ensures that the string in
the source buffer is terminated.
Link: http://lkml.kernel.org/r/20170803163350.45245-1-mka@chromium.org
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill A. Shutemov [Thu, 10 Aug 2017 22:24:27 +0000 (15:24 -0700)]
rmap: do not call mmu_notifier_invalidate_page() under ptl
MMU notifiers can sleep, but in page_mkclean_one() we call
mmu_notifier_invalidate_page() under page table lock.
Let's instead use mmu_notifier_invalidate_range() outside
page_vma_mapped_walk() loop.
[jglisse@redhat.com: try_to_unmap_one() do not call mmu_notifier under ptl]
Link: http://lkml.kernel.org/r/20170809204333.27485-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170804134928.l4klfcnqatni7vsc@black.fi.intel.com
Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reported-by: axie <axie@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Writer, Tim" <Tim.Writer@amd.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>