John Pittman [Tue, 30 Jan 2018 21:39:00 +0000 (16:39 -0500)]
dm cache: Documentation: update default migration_throttling value
In commit
f8350daf7af0 ("dm cache: tune migration throttling") the
value for DEFAULT_MIGRATION_THRESHOLD was decreased from 204800 to
2048. Edit device-mapper/cache.txt to reflect the correct default
value for migration_threshold.
Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Khazhismel Kumykov [Fri, 19 Jan 2018 23:07:37 +0000 (15:07 -0800)]
dm mpath selector: more evenly distribute ties
Move the last used path to the end of the list (least preferred) so that
ties are more evenly distributed.
For example, in case with three paths with one that is slower than
others, the remaining two would be unevenly used if they tie. This is
due to the rotation not being a truely fair distribution.
Illustrated: paths a, b, c, 'c' has 1 outstanding IO, a and b are 'tied'
Three possible rotations:
(a, b, c) -> best path 'a'
(b, c, a) -> best path 'b'
(c, a, b) -> best path 'a'
(a, b, c) -> best path 'a'
(b, c, a) -> best path 'b'
(c, a, b) -> best path 'a'
...
So 'a' is used 2x more than 'b', although they should be used evenly.
With this change, the most recently used path is always the least
preferred, removing this bias resulting in even distribution.
(a, b, c) -> best path 'a'
(b, c, a) -> best path 'b'
(c, a, b) -> best path 'a'
(c, b, a) -> best path 'b'
...
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Scott Bauer [Tue, 23 Jan 2018 17:55:18 +0000 (10:55 -0700)]
dm unstripe: fix target length versus number of stripes size check
Since the unstripe target takes a target length which is the
size of *one* striped member we're trying to expose, not the
total size of *all* the striped members, the check does not
make sense and fails for some striped setups.
For example, say we have a 4TB striped device:
or
3907018496 sectors per underlying device:
if (sector_div(width, uc->stripes)) :
3907018496 / 2(num stripes) ==
1953509248
tmp_len = width;
if (sector_div(tmp_len, uc->chunk_size)) :
1953509248 / 256(chunk size) ==
7630895.5
(fails)
Fix this by removing the first check which isn't valid for unstriping.
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Luis de Bethencourt [Wed, 17 Jan 2018 15:09:25 +0000 (15:09 +0000)]
dm thin: fix trailing semicolon in __remap_and_issue_shared_cell
The trailing semicolon is an empty statement that does no operation.
Removing it since it doesn't do anything.
Signed-off-by: Luis de Bethencourt <luisbg@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Sat, 13 Jan 2018 19:33:30 +0000 (14:33 -0500)]
dm table: fix NVMe bio-based dm_table_determine_type() validation
The 'verify_rq_based:' code in dm_table_determine_type() was checking
all devices in the DM table rather than only checking the data devices.
Fix this by using the immutable target's iterate_devices method.
Also, tweak the block of dm_table_determine_type() code that decides
whether to upgrade from DM_TYPE_BIO_BASED to DM_TYPE_NVME_BIO_BASED so
that it makes sure the immutable_target doesn't support require
splitting IOs.
These changes have been verified to allow a "thin-pool" target whose
data device is an NVMe device to be upgraded to DM_TYPE_NVME_BIO_BASED.
Using the thin-pool in NVMe bio-based mode was verified to pass all the
device-mapper-test-suite's "thin-provisioning" tests.
Also verified that request-based DM multipath (with queue_mode "rq" and
"mq") works as expected using the 'mptest' harness.
Fixes: 22c11858e ("dm: introduce DM_TYPE_NVME_BIO_BASED")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Fri, 12 Jan 2018 14:32:21 +0000 (09:32 -0500)]
dm: various cleanups to md->queue initialization code
Also, add dm_sysfs_init() error handling to dm_create().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Sat, 13 Jan 2018 00:53:40 +0000 (19:53 -0500)]
dm mpath: delay the retry of a request if the target responded as busy
Add DM_ENDIO_DELAY_REQUEUE to allow request-based multipath's
multipath_end_io() to instruct dm-rq.c:dm_done() to delay a requeue.
This is beneficial to do if BLK_STS_RESOURCE is returned from the target
(because target is busy).
Relative to blk-mq: kick the hw queues via blk_mq_requeue_work(),
indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay.
For old .request_fn: use blk_delay_queue().
bio-based multipath doesn't have feature parity with request-based for
retryable error requeues; that is something that'll need fixing in the
future.
Suggested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Bart Van Assche <bart.vanassche@wdc.com>
[as interpreted from Bart's "... patch looks fine to me."]
Ming Lei [Thu, 11 Jan 2018 06:01:55 +0000 (14:01 +0800)]
dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED
Avoid using DM_MAPIO_REQUEUE unless absolutely necessary because it
results in dm-rq.c:dm_mq_queue_rq() returning BLK_STS_RESOURCE to
blk-mq -- doing so should only ever be done if the underlying queue is
out of resources. So switch to returning DM_MAPIO_DELAY_REQUEUE from
multipath_clone_and_map() if either MPATHF_QUEUE_IO or
MPATHF_PG_INIT_REQUIRED are set.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Ming Lei [Thu, 11 Jan 2018 06:01:56 +0000 (14:01 +0800)]
dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
blk-mq will rerun queue via RESTART or dispatch wake after one request
is completed, so not necessary to wait random time for requeuing, we
should trust blk-mq to do it.
More importantly, we need to return BLK_STS_RESOURCE to blk-mq so that
dequeuing from the I/O scheduler can be stopped, this results in
improved I/O merging.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Ma Shimiao [Tue, 12 Dec 2017 09:39:10 +0000 (17:39 +0800)]
dm log writes: fix max length used for kstrndup
If source string is longer than max, kstrndup will allocate max+1
space. So make sure the result will not exceed max.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Sat, 6 Jan 2018 02:17:20 +0000 (21:17 -0500)]
dm: backfill missing calls to mutex_destroy()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Thu, 23 Nov 2017 21:15:43 +0000 (16:15 -0500)]
dm snapshot: use mutex instead of rw_semaphore
The rw_semaphore is acquired for read only in two places, neither is
performance-critical. So replace it with a mutex -- which is more
efficient.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Goldwyn Rodrigues [Mon, 4 Dec 2017 03:14:12 +0000 (21:14 -0600)]
dm flakey: check for null arg_name in parse_features()
One can crash dm-flakey by specifying more feature arguments than the
number of features supplied. Checking for null in arg_name avoids
this.
dmsetup create flakey-test --table "0
66076080 flakey /dev/sdb9 0 0 180 2 drop_writes"
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
mulhern [Mon, 27 Nov 2017 15:02:45 +0000 (10:02 -0500)]
dm thin: extend thinpool status format string with omitted fields
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
mulhern [Mon, 27 Nov 2017 15:02:44 +0000 (10:02 -0500)]
dm thin: fixes in thin-provisioning.txt
Make the format string for thinpool status more correct.
Swap the order of two items to correspond with reality.
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
mulhern [Mon, 27 Nov 2017 15:02:43 +0000 (10:02 -0500)]
dm thin: document representation of <highest mapped sector> when there is none
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
mulhern [Mon, 27 Nov 2017 15:02:39 +0000 (10:02 -0500)]
dm thin: fix documentation relative to low water mark threshold
Fixes:
1. The use of "exceeds" when the opposite of exceeds, falls below,
was meant.
2. Properly speaking, a table can not exceed a threshold.
It emphasizes the important point, which is that it is the userspace
daemon's responsibility to check for low free space when a device
is resumed, since it won't get a special event indicating low free
space in that situation.
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
mulhern [Mon, 27 Nov 2017 15:02:42 +0000 (10:02 -0500)]
dm cache: be consistent in specifying sectors and SI units in cache.txt
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
mulhern [Mon, 27 Nov 2017 15:02:41 +0000 (10:02 -0500)]
dm cache: delete obsoleted paragraph in cache.txt
The 'mq' policy is no longer the default policy, and the default policy,
'smq', does not store hit counts.
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
mulhern [Mon, 27 Nov 2017 15:02:40 +0000 (10:02 -0500)]
dm cache: fix grammar in cache-policies.txt
Use possessive pronoun where appropriate, instead of contraction.
Signed-off-by: mulhern <amulhern@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Wed, 2 Dec 2015 17:32:49 +0000 (12:32 -0500)]
dm snapshot: improve documentation relative to origin suspend requirements
Add a note to snapshot.txt that the origin target must be suspended when
loading or unloading the snapshot target.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Brian Norris [Tue, 28 Mar 2017 18:31:02 +0000 (11:31 -0700)]
dm: move dm_table_destroy() to same header as dm_table_create()
If anyone is going to use dm_table_create(), they probably should be
able to use dm_table_destroy() too. Move the dm_table_destroy()
definition outside the private header, near dm_table_create()
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Wei Yongjun [Tue, 2 Jan 2018 06:18:10 +0000 (06:18 +0000)]
dm raid: make raid_sets symbol static
Fixes the following sparse warning:
drivers/md/dm-raid.c:33:1: warning:
symbol 'raid_sets' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Thu, 4 Jan 2018 17:14:57 +0000 (12:14 -0500)]
dm bufio: eliminate unnecessary labels in dm_bufio_client_create()
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Aliaksei Karaliou [Sat, 23 Dec 2017 10:27:04 +0000 (13:27 +0300)]
dm bufio: check result of register_shrinker()
dm_bufio_client_create() does not check result of register_shrinker()
which was tagged as __must_check recently, reported by sparse.
Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Aliaksei Karaliou [Sat, 23 Dec 2017 10:27:03 +0000 (13:27 +0300)]
dm bufio: add missed destroys of client mutex
The client's mutex needs to be destroyed in dm_bufio_client_destroy() as
well as the dm_bufio_client_create() error path.
Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Sat, 2 Dec 2017 05:33:39 +0000 (00:33 -0500)]
dm bufio: use REQ_OP_READ and REQ_OP_WRITE
Use REQ_OP_READ and REQ_OP_WRITE macros instead of READ and WRITE. They
have the same value, but the block layer uses REQ_OP so bufio should
too.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Scott Bauer [Mon, 18 Dec 2017 17:28:08 +0000 (10:28 -0700)]
dm: add unstriped target
This device mapper "unstriped" target remaps and unstripes I/O so it
is issued solely on a single drive in a HW RAID0 or dm-striped target.
In a 4 drive HW RAID0 the striped target exposes 1/4th of the LBA range
as a virtual drive. Each I/O to that virtual drive will only be issued
to the 1 drive that was selected of the 4 drives in the HW RAID0.
This unstriped target is most useful for Intel NVMe drives that have
multiple cores but that do not have firmware control to pin separate LBA
ranges to each discrete cpu core.
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Mon, 11 Dec 2017 16:02:29 +0000 (11:02 -0500)]
dm mpath: factor out SCSI vs NVMe path selection
Trying to do both SCSI and NVMe bio-based handling with branching in the
same common code has proven too tedious on a code maintenance level. In
addition it slightly hurts IO performance.
Fix this by factoring out __map_bio() and __map_bio_nvme().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Sun, 10 Dec 2017 20:37:21 +0000 (15:37 -0500)]
dm mpath: optimize NVMe bio-based support
All code that deals with pg_init is not used with bio-based NVMe mode.
This includes skipping initialization of pg_init related variables.
Also, pg_init related members on 'struct multipath' have been grouped
together.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Tue, 5 Dec 2017 21:02:21 +0000 (16:02 -0500)]
dm mpath: implement NVMe bio-based support
This DM multipath NVMe bio-based support requires CONFIG_NVME_MULTIPATH
to not be set. In the future hopefully NVMe multipath and DM multipath
can co-exist more seemlessly. But as is, if CONFIG_NVME_MULTIPATH=Y
then all the individal NVMe paths will remain hidden to upper layers and
as such DM multipath will not be able to manage them.
Though NVMe's native multipathing doesn't multipath namespaces across
subsystems; so technically a user _could_ use CONFIG_NVME_MULTIPATH=Y
and also use DM multipath to multipath across subsystems.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Wed, 6 Dec 2017 21:08:14 +0000 (16:08 -0500)]
dm mpath: move dm_bio_restore out of endio method
Moving the dm_bio_restore() to process_queued_bios() avoids doing that
work in multipath_end_io_bio().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Mon, 11 Dec 2017 21:08:54 +0000 (16:08 -0500)]
dm mpath: optimize retrieval of bio_details from per-bio-data
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Mon, 11 Dec 2017 20:58:41 +0000 (15:58 -0500)]
dm mpath: remove unnecessary memset() calls for per-io-data
All underlying members are initialized directly so the memset() calls
are not needed. Also, initialize mpio->nr_bytes from the start since it
never changes.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Tue, 5 Dec 2017 19:10:33 +0000 (14:10 -0500)]
dm mpath: remove unused param from multipath_init_per_bio_data()
'struct dm_bio_details *' isn't ever needed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Sat, 9 Dec 2017 20:16:42 +0000 (15:16 -0500)]
dm: optimize bio-based NVMe IO submission
Upper level bio-based drivers that stack immediately ontop of NVMe can
leverage direct_make_request(). In addition DM's NVMe bio-based
will initially only ever have one NVMe device that it submits IO to at a
time. There is no splitting needed. Enhance DM core so that
DM_TYPE_NVME_BIO_BASED's IO submission takes advantage of both of these
characteristics.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Tue, 5 Dec 2017 02:07:37 +0000 (21:07 -0500)]
dm: introduce DM_TYPE_NVME_BIO_BASED
If dm_table_determine_type() establishes DM_TYPE_NVME_BIO_BASED then
all devices in the DM table do not support partial completions. Also,
the table has a single immutable target that doesn't require DM core to
split bios.
This will enable adding NVMe optimizations to bio-based DM.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Sun, 17 Dec 2017 16:56:48 +0000 (11:56 -0500)]
dm: simplify start of block stats accounting for bio-based
No apparent need to generic_start_io_acct() until before the IO is ready
for submission. start_io_acct() is the proper place to do this
accounting -- it is also where DM accounts for pending IO and, if
enabled, starts dm-stats accounting.
Replace start_io_acct()'s part_round_stats() with generic_start_io_acct().
This eliminates needing to take part_stat_lock() multiple times when
starting an IO on bio-based devices.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Thu, 14 Dec 2017 21:30:42 +0000 (16:30 -0500)]
dm: remove redundant mapped_device member from clone_info structure
'struct dm_io' already has the same pointer. So update all accesses
from ci->md to ci->io->md.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Tue, 12 Dec 2017 04:28:13 +0000 (23:28 -0500)]
dm: remove now unused bio-based io_pool and _io_cache
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Tue, 12 Dec 2017 04:17:47 +0000 (23:17 -0500)]
dm: improve performance by moving dm_io structure to per-bio-data
Eliminates need for a separate mempool to allocate 'struct dm_io'
objects from. As such, it saves an extra mempool allocation for each
original bio that DM core is issued.
This complicates the per-bio-data accessor functions by needing to
conditonally add extra padding to get to a target's per-bio-data. But
in the end this provides a decent performance improvement for all
bio-based DM devices.
On an NVMe-loop based testbed to a ramdisk (~3100 MB/s): bio-based
DM linear performance improved by 2% (went from 2665 to 2777 MB/s).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Tue, 12 Dec 2017 01:51:50 +0000 (20:51 -0500)]
dm: rename 'bio' member of dm_io structure to 'orig_bio'
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Sun, 10 Dec 2017 01:38:16 +0000 (20:38 -0500)]
dm: remove stale comment blocks
These CRUD comments have worn out their welcome. The code is what it
is, over time it'll hopefully get better. But these comments serve no
purpose whatsoever.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Tue, 5 Dec 2017 04:28:32 +0000 (23:28 -0500)]
dm: set QUEUE_FLAG_DAX accordingly in dm_table_set_restrictions()
Rather than having DAX support be unique by setting it based on table
type in dm_setup_md_queue().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Fri, 8 Dec 2017 20:02:11 +0000 (15:02 -0500)]
dm: fix __send_changing_extent_only() to send first bio and chain remainder
__send_changing_extent_only() must follow the same pattern that was
established with commit "dm: ensure bio submission follows a depth-first
tree walk". That is: submit first bio up to split boundary and then
split the remainder to further submissions.
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Fri, 8 Dec 2017 19:40:52 +0000 (14:40 -0500)]
dm: ensure bio-based DM's bioset and io_pool support targets' maximum IOs
alloc_multiple_bios() assumes it can allocate the requested number of
bios but until now there was no gaurantee that the mempools would be
accomodating.
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Wed, 22 Nov 2017 20:37:43 +0000 (15:37 -0500)]
dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure
Now that all of DM has been revised and/or verified to no longer require
the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Wed, 22 Nov 2017 19:56:12 +0000 (14:56 -0500)]
dm: safely allocate multiple bioset bios
DM targets can request multiple bios be sent to them by DM core (see:
num_{flush,discard,write_same,write_zeroes}_bios). But until now these
bios were allocated in an unsafe manner than could potentially exhaust
the DM device's bioset -- in the face of multiple threads each trying to
do multiple allocations from the same DM device's bioset.
Fix __send_duplicate_bios() by using the new alloc_multiple_bios(). The
allocation strategy used by alloc_multiple_bios() models that used by
dm-crypt.c:crypt_alloc_buffer().
Neil Brown initially proposed this fix but the implementation has been
revised enough that it inappropriate to attribute the entirety of it to
him.
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
NeilBrown [Wed, 22 Nov 2017 03:25:18 +0000 (14:25 +1100)]
dm: remove unused 'num_write_bios' target interface
No DM target provides num_write_bios and none has since dm-cache's
brief use in 2013.
Having the possibility of num_write_bios > 1 complicates bio
allocation. So remove the interface and assume there is only one bio
needed.
If a target ever needs more, it must provide a suitable bioset and
allocate itself based on its particular needs.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
NeilBrown [Tue, 5 Sep 2017 23:43:28 +0000 (09:43 +1000)]
dm: ensure bio submission follows a depth-first tree walk
A dm device can, in general, represent a tree of targets, each of which
handles a sub-range of the range of blocks handled by the parent.
The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner. Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.
In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.
This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the
canonical approach.
dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available. It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request(). This requires the 'rescue'
functionality provided by dm_offload_{start,end}.
Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach. That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately. This
delays the handling until the first part is completely processed, so the
deadlock problems do not occur.
__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work(). When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request(), current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.
When we use bio_clone_bioset() to split off the front part of a bio
and chain the two together and submit the remainder to
generic_make_request(), it is important that the newly allocated
bio is used as the head to be processed immediately, and the original
bio gets "bio_advance()"d and sent to generic_make_request() as the
remainder. Otherwise, if the newly allocated bio is used as the
remainder, and if it then needs to be split again, then the next
bio_clone_bioset() call will be made while holding a reference a bio
(result of the first clone) from the same bioset. This can potentially
exhaust the bioset mempool and result in a memory allocation deadlock.
Note that there is no race caused by reassigning cio.io->bio after already
calling __map_bio(). This bio will only be dereferenced again after
dec_pending() has found io->io_count to be zero, and this cannot happen
before the dec_pending() call at the end of __split_and_process_bio().
To provide the clone bio when splitting, we use q->bio_split. This
was previously being freed by bio-based dm to avoid having excess
rescuer threads. As bio_split bio sets no longer create rescuer
threads, there is little cost and much gain from restoring the
q->bio_split bio set.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
NeilBrown [Tue, 29 Aug 2017 22:10:18 +0000 (08:10 +1000)]
dm io: remove BIOSET_NEED_RESCUER flag from bios bioset
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.
dm_io() is called from make_request_fn() context. The closest it comes
to multiple allocations is in chunk_io() in dm-snap-persistent. But
there the code uses a separate thread to avoid problems.
So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
NeilBrown [Tue, 29 Aug 2017 22:10:18 +0000 (08:10 +1000)]
dm crypt: remove BIOSET_NEED_RESCUER flag
The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
do two allocations from the one bioset, and the second one could block
until the first bio completes.
dm-crypt does allocate from this bioset inside the dm make_request_fn,
but does so using GFP_NOWAIT so that the allocation will not block.
So BIOSET_NEED_RESCUER is not needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
NeilBrown [Tue, 21 Nov 2017 13:44:35 +0000 (08:44 -0500)]
dm: fix comment above dm_accept_partial_bio
Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
bios.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Wed, 13 Dec 2017 16:13:21 +0000 (17:13 +0100)]
dm raid: use rs_is_raid*()
Cleanup, no functional change.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Wed, 13 Dec 2017 16:13:20 +0000 (17:13 +0100)]
dm raid: simplify rs_get_progress()
No need to calculate the reshaping progress because
mddev->curr_resync_completed holds it.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Wed, 13 Dec 2017 16:13:19 +0000 (17:13 +0100)]
dm raid: ensure 'a' chars during reshape
During reshape, 'A' chars were reported in status rather than 'a'.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Wed, 13 Dec 2017 16:13:18 +0000 (17:13 +0100)]
dm raid: stop keeping raid set frozen altogether
In order to avoid redoing synchronization/recovery/reshape partially,
the raid set got frozen until after all passed in table line flags had
been cleared. The related table reload sequence had to be precisely
followed, or reshaping may lead to data corruption caused by the active
mapping carrying on with a reshape when the inactive mapping already
had retrieved a stale reshape position.
Harden by retrieving the actual resync/recovery/reshape position
during resume whilst the active table is suspended thus avoiding
to keep the raid set frozen altogether. This prevents superfluous
redoing of an already resynchronized or recovered segment and,
most importantly, potential for redoing of an already reshaped
segment causing data corruption.
Fixes: d39f0010e ("dm raid: fix raid_resume() to keep raid set frozen as needed")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Wed, 13 Dec 2017 16:13:17 +0000 (17:13 +0100)]
dm raid: validate current raid sets redundancy
Verifying the current raid sets redundancy based on retrieved
superblock content has to use the superblock's raid level (e.g. raid0),
not the constructor requested one (e.g. raid10).
Using the requested raid level of raid10 lead to a "divide error"
on raid0 which defines data copies divided by to be zero.
Also check for bogus data copies.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Mon, 4 Dec 2017 15:26:21 +0000 (10:26 -0500)]
dm raid: bump target version to reflect numerous fixes
Also update Documentation accordingly.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:56 +0000 (01:03 +0100)]
dm raid: small cleanup and remove unsed "struct raid_set" member
Move raid_resume()'s setting of 'rw' and 'in_sync' to just prior to
mddev_resume().
Also, remove unused 'bitmap_loaded' member from "struct raid_set".
No functional changes.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:55 +0000 (01:03 +0100)]
dm raid: fix rs_get_progress() synchronization state/ratio
Fix various sync state issues causing racy/bogus sync ratio,
sync_action ad health chars in dm_status() info output.
Sync ratio could be N/N (i.e. 100%) shortly after raid set
creation, i.e. creating a new RaidLV or upconverting a linear LV to
raid1 thus:
"0
2097152 raid raid1 2 Aa
2097162/
2097152 recover 0 0 -"
instead of:
"0
2097152 raid raid1 2 Aa 0/
2097152 idle 0 0 -"
Sync action could be non-idle, when the MD thread was done with io.
Health chars could be 'A' when they should be 'a' for a short time
before a resynchonization started.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:54 +0000 (01:03 +0100)]
dm raid: avoid passing array_in_sync variable to raid_status() callees
The raid_status() function passes the bool array_in_sync variable around
providing synchronization state of the MD array. Replace it with a
runtime flag. This will avoid a pattern of having to pass discrete
variables to various functions.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:53 +0000 (01:03 +0100)]
dm raid: display a consistent copy of the MD status via raid_status()
The MD sync thread updates recovery flags providing state of any
running, idle, frozen, recovering, reshaping, ... activity it performs
and updates respective flags asynchronously versus dm processing
raid_status(). To close that race window, take a single copy of the
flags and pass it into its callees.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:52 +0000 (01:03 +0100)]
dm raid: fix raid_resume() to keep raid set frozen as needed
During a reshape request: if userspace reloads a "raid" table multiple
times, resulting in multiple superblock reads, the raid set needs to
stay frozen until all config changes (chunk size, layout data_offset,
delta_disks) have been stored in the superblocks and respective flags
cleared.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:59 +0000 (01:03 +0100)]
dm raid: add component device size checks to avoid runtime failure
Check all component data device sizes versus calculated size.
Reject if device(s) are too small. Otherwise, MD will fail the
operation by accessing beyond the end of the data device.
An example use-case is that growing bitmap won't fit any more and the MD
runtime will report an error when DM raid should catch this earlier.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:51 +0000 (01:03 +0100)]
dm raid: fix raid set size revalidation
The raid set size is being revalidated unconditionally before a
reshaping conversion is started. MD requires the size to only be
reduced in case of a stripe removing (i.e. shrinking) reshape but not
when growing because the raid array has to stay small until after the
growing reshape finishes.
Fix by avoiding the size revalidation in preresume unless a shrinking
reshape is requested.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:50 +0000 (01:03 +0100)]
dm raid: correct resizing state relative to reshape space in ctr
Pay attention to existing reshape space to define if a raid set needs
resizing. Otherwise we can hit "Can't resize a reshaping raid set"
when a reshape is being requested.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:49 +0000 (01:03 +0100)]
dm raid: consume sizes after md_finish_reshape() completes changing them
The md raid personalities call md_finish_reshape() at the end of a
reshape conversion which adjusts rdev->sectors.
Correct/check rdev->sectors before initiating a reshape and raise the
recovery pointer accordingly.
Otherwise, the DM raid coordinated reshape will fail.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Heinz Mauelshagen [Sat, 2 Dec 2017 00:03:48 +0000 (01:03 +0100)]
dm raid: fix deadlock caused by premature md_stop_writes()
md_stop_writes() is called in raid_presuspend() causing deadlocks on
bios submitted afterwards -- which happens on loaded raid sets with
conversion requests.
Fix by moving md_stop_writes() to raid_postsuspend(). NOTE: when the
recovery's frozen (MD_RECOVERY_FROZEN), writes haven't been started (or
are already stopped) so don't stop them again.
Also remove superfluous readonly setting.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Suren Baghdasaryan [Wed, 6 Dec 2017 17:27:30 +0000 (09:27 -0800)]
dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
When system is under memory pressure it is observed that dm bufio
shrinker often reclaims only one buffer per scan. This change fixes
the following two issues in dm bufio shrinker that cause this behavior:
1. ((nr_to_scan - freed) <= retain_target) condition is used to
terminate slab scan process. This assumes that nr_to_scan is equal
to the LRU size, which might not be correct because do_shrink_slab()
in vmscan.c calculates nr_to_scan using multiple inputs.
As a result when nr_to_scan is less than retain_target (64) the scan
will terminate after the first iteration, effectively reclaiming one
buffer per scan and making scans very inefficient. This hurts vmscan
performance especially because mutex is acquired/released every time
dm_bufio_shrink_scan() is called.
New implementation uses ((LRU size - freed) <= retain_target)
condition for scan termination. LRU size can be safely determined
inside __scan() because this function is called after dm_bufio_lock().
2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to
determine number of freeable objects in the slab. However dm_bufio
always retains retain_target buffers in its LRU and will terminate
a scan when this mark is reached. Therefore returning the entire LRU size
from dm_bufio_shrink_count() is misleading because that does not
represent the number of freeable objects that slab will reclaim during
a scan. Returning (LRU size - retain_target) better represents the
number of freeable objects in the slab. This way do_shrink_slab()
returns 0 when (LRU size < retain_target) and vmscan will not try to
scan this shrinker avoiding scans that will not reclaim any memory.
Test: tested using Android device running
<AOSP>/system/extras/alloc-stress that generates memory pressure
and causes intensive shrinker scans
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Fri, 8 Dec 2017 03:42:27 +0000 (22:42 -0500)]
dm mpath: fix bio-based multipath queue_if_no_path handling
Commit
ca5beb76 ("dm mpath: micro-optimize the hot path relative to
MPATHF_QUEUE_IF_NO_PATH") caused bio-based DM-multipath to fail mptest's
"test_02_sdev_delete".
Restoring the logic that existed prior to commit
ca5beb76 fixes this
bio-based DM-multipath regression. Also verified all mptest tests pass
with request-based DM-multipath.
This commit effectively reverts commit
ca5beb76 -- but it does so
without reintroducing the need to take the m->lock spinlock in
must_push_back_{rq,bio}.
Fixes: ca5beb76 ("dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
monty_pavel@sina.com [Fri, 24 Nov 2017 17:43:50 +0000 (01:43 +0800)]
dm: fix various targets to dm_register_target after module __init resources created
A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
processes race to load the dm-thin-pool module:
PID: 25992 TASK:
ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
#0 [
ffff883cd743d600] machine_kexec at
ffffffff81038fa9
0000001 [
ffff883cd743d660] crash_kexec at
ffffffff810c5992
0000002 [
ffff883cd743d730] oops_end at
ffffffff81515c90
0000003 [
ffff883cd743d760] no_context at
ffffffff81049f1b
0000004 [
ffff883cd743d7b0] __bad_area_nosemaphore at
ffffffff8104a1a5
0000005 [
ffff883cd743d800] bad_area at
ffffffff8104a2ce
0000006 [
ffff883cd743d830] __do_page_fault at
ffffffff8104aa6f
0000007 [
ffff883cd743d950] do_page_fault at
ffffffff81517bae
0000008 [
ffff883cd743d980] page_fault at
ffffffff81514f95
[exception RIP: kmem_cache_alloc+108]
RIP:
ffffffff8116ef3c RSP:
ffff883cd743da38 RFLAGS:
00010046
RAX:
0000000000000004 RBX:
ffffffff81121b90 RCX:
ffff881bf1e78cc0
RDX:
0000000000000000 RSI:
00000000000000d0 RDI:
0000000000000000
RBP:
ffff883cd743da68 R8:
ffff881bf1a4eb00 R9:
0000000080042000
R10:
0000000000002000 R11:
0000000000000000 R12:
00000000000000d0
R13:
0000000000000000 R14:
00000000000000d0 R15:
0000000000000246
ORIG_RAX:
ffffffffffffffff CS: 0010 SS: 0018
0000009 [
ffff883cd743da70] mempool_alloc_slab at
ffffffff81121ba5
0000010 [
ffff883cd743da80] mempool_create_node at
ffffffff81122083
0000011 [
ffff883cd743dad0] mempool_create at
ffffffff811220f4
0000012 [
ffff883cd743dae0] pool_ctr at
ffffffffa08de049 [dm_thin_pool]
0000013 [
ffff883cd743dbd0] dm_table_add_target at
ffffffffa0005f2f [dm_mod]
0000014 [
ffff883cd743dc30] table_load at
ffffffffa0008ba9 [dm_mod]
0000015 [
ffff883cd743dc90] ctl_ioctl at
ffffffffa0009dc4 [dm_mod]
The race results in a NULL pointer because:
Process A (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target not registered;
c. modprobe dm_thin_pool and wait until end.
Process B (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target registered;
c. table_load->dm_table_add_target->pool_ctr;
d. _new_mapping_cache is NULL and panic.
Note:
1. process A and process B are two concurrent processes.
2. pool_target can be detected by process B but
_new_mapping_cache initialization has not ended.
To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
with the same problem, simply dm_register_target() after all resources
created during module init (as labelled with __init) are finished.
Cc: stable@vger.kernel.org
Signed-off-by: monty <monty_pavel@sina.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Sat, 25 Nov 2017 05:27:26 +0000 (00:27 -0500)]
dm table: fix regression from improper dm_dev_internal.count refcount_t conversion
Multiple refcounts are needed if the device was already added. The
micro-optimization of setting the refcount to 1 on first added (rather
than fall thru to a common refcount_inc) lost sight of the fact that the
refcount_inc is also needed for the case when the device already exists
and the mode need not be upgraded.
Fixes: 2a0b4682e0 ("dm: convert dm_dev_internal.count from atomic_t to refcount_t")
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Linus Torvalds [Sun, 3 Dec 2017 16:01:47 +0000 (11:01 -0500)]
Linux 4.15-rc2
Linus Torvalds [Sun, 3 Dec 2017 15:51:08 +0000 (10:51 -0500)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fix from Russell King:
"Just one fix this time around, for the late commit in the merge window
that triggered a problem with qemu. Qemu is apparently also going to
receive a fix for the discovered issue"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: avoid faulting on qemu
Linus Torvalds [Sun, 3 Dec 2017 15:48:24 +0000 (10:48 -0500)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Here are two bugfixes for I2C, fixing a memleak in the core and irq
allocation for i801.
Also three bugfixes for the at24 eeprom driver which Bartosz collected
while taking over maintainership for this driver"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
eeprom: at24: check at24_read/write arguments
eeprom: at24: fix reading from 24MAC402/24MAC602
eeprom: at24: correctly set the size for at24mac402
i2c: i2c-boardinfo: fix memory leaks on devinfo
i2c: i801: Fix Failed to allocate irq -
2147483648 error
Linus Torvalds [Sun, 3 Dec 2017 15:46:16 +0000 (10:46 -0500)]
Merge tag 'hwmon-for-linus-v4.15-rc2' of git://git./linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Fixes:
- Drop reference to obsolete maintainer tree
- Fix overflow bug in pmbus driver
- Fix SMBUS timeout problem in jc42 driver
For the SMBUS timeout handling, we had a brief discussion if this
should be considered a bug fix or a feature. Peter says "it fixes real
problems where the application misbehave due to faulty content when
reading from an eeprom", and he needs the patch in his company's v4.14
images. This is good enough for me and warrants backport to stable
kernels"
* tag 'hwmon-for-linus-v4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (jc42) optionally try to disable the SMBUS timeout
hwmon: (pmbus) Use 64bit math for DIRECT format values
hwmon: Drop reference to Jean's tree
Wolfram Sang [Sat, 2 Dec 2017 22:32:13 +0000 (23:32 +0100)]
Merge tag 'at24-4.15-fixes-for-wolfram' of git://git./linux/kernel/git/brgl/linux into i2c/for-current
Please consider pulling the following fixes for v4.15. While it doesn't
fix any regression introduced in the v4.15 merge window, we have a
feature in at24 since linux v4.8 - reading the mac address block from
at24mac series - which turned out to be not working.
This pull request contains changes that fix it together with a patch
that hardens the read and write argument sanitization with
out-of-bounds checks that were missing.
Linus Torvalds [Sat, 2 Dec 2017 01:04:20 +0000 (20:04 -0500)]
Merge tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"These patches fix a problem with compiling using an old version of
gcc, and also fix up error handling in the SUNRPC layer.
- NFSv4: Ensure gcc 4.4.4 can compile initialiser for
"invalid_stateid"
- SUNRPC: Allow connect to return EHOSTUNREACH
- SUNRPC: Handle ENETDOWN errors"
* tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: Handle ENETDOWN errors
SUNRPC: Allow connect to return EHOSTUNREACH
NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid"
Linus Torvalds [Sat, 2 Dec 2017 01:00:19 +0000 (20:00 -0500)]
Merge tag 'xfs-4.15-fixes-4' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are some bug fixes for 4.15-rc2.
- fix memory leaks that appeared after removing ifork inline data
buffer
- recover deferred rmap update log items in correct order
- fix memory leaks when buffer construction fails
- fix memory leaks when bmbt is corrupt
- fix some uninitialized variables and math problems in the quota
scrubber
- add some omitted attribution tags on the log replay commit
- fix some UBSAN complaints about integer overflows with large sparse
files
- implement an effective inode mode check in online fsck
- fix log's inability to retry quota item writeout due to transient
errors"
* tag 'xfs-4.15-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Properly retry failed dquot items in case of error during buffer writeback
xfs: scrub inode mode properly
xfs: remove unused parameter from xfs_writepage_map
xfs: ubsan fixes
xfs: calculate correct offset in xfs_scrub_quota_item
xfs: fix uninitialized variable in xfs_scrub_quota
xfs: fix leaks on corruption errors in xfs_bmap.c
xfs: fortify xfs_alloc_buftarg error handling
xfs: log recovery should replay deferred ops in order
xfs: always free inline data before resetting inode fork during ifree
Linus Torvalds [Sat, 2 Dec 2017 00:39:12 +0000 (19:39 -0500)]
Merge tag 'riscv-for-linus-4.15-rc2_cleanups' of git://git./linux/kernel/git/palmer/linux
Pull RISC-V cleanups and ABI fixes from Palmer Dabbelt:
"This contains a handful of small cleanups that are a result of
feedback that didn't make it into our original patch set, either
because the feedback hadn't been given yet, I missed the original
emails, or we weren't ready to submit the changes yet.
I've been maintaining the various cleanup patch sets I have as their
own branches, which I then merged together and signed. Each merge
commit has a short summary of the changes, and each branch is based on
your latest tag (4.15-rc1, in this case). If this isn't the right way
to do this then feel free to suggest something else, but it seems sane
to me.
Here's a short summary of the changes, roughly in order of how
interesting they are.
- libgcc.h has been moved from include/lib, where it's the only
member, to include/linux. This is meant to avoid tab completion
conflicts.
- VDSO entries for clock_get/gettimeofday/getcpu have been added.
These are simple syscalls now, but we want to let glibc use them
from the start so we can make them faster later.
- A VDSO entry for instruction cache flushing has been added so
userspace can flush the instruction cache.
- The VDSO symbol versions for __vdso_cmpxchg{32,64} have been
removed, as those VDSO entries don't actually exist.
- __io_writes has been corrected to respect the given type.
- A new READ_ONCE in arch_spin_is_locked().
- __test_and_op_bit_ord() is now actually ordered.
- Various small fixes throughout the tree to enable allmodconfig to
build cleanly.
- Removal of some dead code in our atomic support headers.
- Improvements to various comments in our atomic support headers"
* tag 'riscv-for-linus-4.15-rc2_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux: (23 commits)
RISC-V: __io_writes should respect the length argument
move libgcc.h to include/linux
RISC-V: Clean up an unused include
RISC-V: Allow userspace to flush the instruction cache
RISC-V: Flush I$ when making a dirty page executable
RISC-V: Add missing include
RISC-V: Use define for get_cycles like other architectures
RISC-V: Provide stub of setup_profiling_timer()
RISC-V: Export some expected symbols for modules
RISC-V: move empty_zero_page definition to C and export it
RISC-V: io.h: type fixes for warnings
RISC-V: use RISCV_{INT,SHORT} instead of {INT,SHORT} for asm macros
RISC-V: use generic serial.h
RISC-V: remove spin_unlock_wait()
RISC-V: `sfence.vma` orderes the instruction cache
RISC-V: Add READ_ONCE in arch_spin_is_locked()
RISC-V: __test_and_op_bit_ord should be strongly ordered
RISC-V: Remove smb_mb__{before,after}_spinlock()
RISC-V: Remove __smp_bp__{before,after}_atomic
RISC-V: Comment on why {,cmp}xchg is ordered how it is
...
Linus Torvalds [Sat, 2 Dec 2017 00:37:03 +0000 (19:37 -0500)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The critical one here is a fix for fpsimd register corruption across
signals which was introduced by the SVE support code (the register
files overlap), but the others are worth having as well.
Summary:
- Fix FP register corruption when SVE is not available or in use
- Fix out-of-tree module build failure when CONFIG_ARM64_MODULE_PLTS=y
- Missing 'const' generating errors with LTO builds
- Remove unsupported events from Cortex-A73 PMU description
- Removal of stale and incorrect comments"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: context: Fix comments and remove pointless smp_wmb()
arm64: cpu_ops: Add missing 'const' qualifiers
arm64: perf: remove unsupported events for Cortex-A73
arm64: fpsimd: Fix failure to restore FPSIMD state after signals
arm64: pgd: Mark pgd_cache as __ro_after_init
arm64: ftrace: emit ftrace-mod.o contents through code
arm64: module-plts: factor out PLT generation code for ftrace
arm64: mm: cleanup stale AIVIVT references
Palmer Dabbelt [Fri, 1 Dec 2017 21:31:31 +0000 (13:31 -0800)]
RISC-V: Fixes for clean allmodconfig build
Olaf said: Here's a short series of patches that produces a working
allmodconfig. Would be nice to see them go in so we can add build
coverage.
I've dropped patches 8 and 10 from the original set:
* [PATCH 08/10] (RISC-V: Set __ARCH_WANT_RENAMEAT to pick up generic
version) has a better fix that I've sent out for review, we don't want
renameat.
* [PATCH 10/10] (input: joystick: riscv has get_cycles) has already been
taken into Dmitry Torokhov's tree.
Palmer Dabbelt [Fri, 1 Dec 2017 21:16:15 +0000 (13:16 -0800)]
move libgcc.h to include/linux
Palmer Dabbelt [Fri, 1 Dec 2017 21:14:36 +0000 (13:14 -0800)]
RISC-V: __io_writes should respect the length argument
Palmer Dabbelt [Fri, 1 Dec 2017 21:12:10 +0000 (13:12 -0800)]
RISC-V: User-Visible Changes
This merge contains the user-visible, ABI-breaking changes that we want
to make sure we have in Linux before our first release. Highlights
include:
* VDSO entries for clock_get/gettimeofday/getcpu have been added. These
are simple syscalls now, but we want to let glibc use them from the
start so we can make them faster later.
* A VDSO entry for instruction cache flushing has been added so
userspace can flush the instruction cache.
* The VDSO symbol versions for __vdso_cmpxchg{32,64} have been removed,
as those VDSO entries don't actually exist.
Conflicts:
arch/riscv/include/asm/tlbflush.h
Palmer Dabbelt [Fri, 1 Dec 2017 21:10:42 +0000 (13:10 -0800)]
RISC-V Atomic Cleanups
This patch set is the result of some feedback that filtered through
after our original patch set was reviewed, some of which was the result
of me missing some email. It contains:
* A new READ_ONCE in arch_spin_is_locked()
* __test_and_op_bit_ord() is now actually ordered
* Improvements to various comments
* Removal of some dead code
Palmer Dabbelt [Tue, 28 Nov 2017 22:15:32 +0000 (14:15 -0800)]
RISC-V: __io_writes should respect the length argument
Whoops -- I must have just been being an idiot again. Thanks to Segher
for finding the bug :).
CC: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Christoph Hellwig [Wed, 22 Nov 2017 10:47:28 +0000 (11:47 +0100)]
move libgcc.h to include/linux
Introducing a new include/lib directory just for this file totally
messes up tab completion for include/linux, which is highly annoying.
Move it to include/linux where we have headers for all kinds of other
lib/ code as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Linus Torvalds [Fri, 1 Dec 2017 13:40:17 +0000 (08:40 -0500)]
Merge tag 'powerpc-4.15-3' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Two fixes for nasty kexec/kdump crashes in certain configurations.
A couple of minor fixes for the new TIDR code.
A fix for an oops in a CXL error handling path.
Thanks to: Andrew Donnellan, Christophe Lombard, David Gibson, Mahesh
Salgaonkar, Vaibhav Jain"
* tag 'powerpc-4.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc: Do not assign thread.tidr if already assigned
powerpc: Avoid signed to unsigned conversion in set_thread_tidr()
powerpc/kexec: Fix kexec/kdump in P9 guest kernels
powerpc/powernv: Fix kexec crashes caused by tlbie tracing
cxl: Check if vphb exists before iterating over AFU devices
Linus Torvalds [Fri, 1 Dec 2017 13:36:27 +0000 (08:36 -0500)]
Merge tag 'afs-fixes-
20171201' of git://git./linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Two fix patches for the AFS filesystem:
- Fix the refcounting on permit caching.
- AFS inode (afs_vnode) fields need resetting after allocation
because they're only initialised when slab pages are obtained from
the page allocator"
* tag 'afs-fixes-
20171201' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Properly reset afs_vnode (inode) fields
afs: Fix permit refcounting
Linus Torvalds [Fri, 1 Dec 2017 13:14:22 +0000 (08:14 -0500)]
Merge tag 'mmc-v4.15-2' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Ensure that debugfs files are removed properly
- Fix missing blk_put_request()
- Deal with errors from blk_get_request()
- Rewind mmc bus suspend operations at failures
- Prepend '0x' to ocr and pre_eol_info in sysfs to identify as hex
MMC host:
- sdhci-msm: Make it optional to wait for signal level changes
- sdhci: Avoid swiotlb buffer being full"
* tag 'mmc-v4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: core: prepend 0x to OCR entry in sysfs
mmc: core: prepend 0x to pre_eol_info entry in sysfs
mmc: sdhci: Avoid swiotlb buffer being full
mmc: sdhci-msm: Optionally wait for signal level changes
mmc: block: Ensure that debugfs files are removed
mmc: core: Do not leave the block driver in a suspended state
mmc: block: Check return value of blk_get_request()
mmc: block: Fix missing blk_put_request()
Linus Torvalds [Fri, 1 Dec 2017 13:10:09 +0000 (08:10 -0500)]
Merge tag 'drm-fixes-for-v4.15-rc2' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes and cleanups from Dave Airlie:
"The main thing are a bunch of fixes for the new amd display code, a
bunch of smatch fixes.
core:
- Atomic helper regression fix.
- Deferred fbdev fallout regression fix.
amdgpu:
- New display code (dc) dpms, suspend/resume and smatch fixes, along
with some others
- Some regression fixes for amdkfd/radeon.
- Fix a ttm regression for swiotlb disabled
bridge:
- A bunch of fixes for the tc358767 bridge
mali-dp + hdlcd:
- some fixes and internal API catchups.
imx-drm:
-regression fix in atomic code.
omapdrm:
- platform detection regression fixes"
* tag 'drm-fixes-for-v4.15-rc2' of git://people.freedesktop.org/~airlied/linux: (76 commits)
drm/imx: always call wait_for_flip_done in commit_tail
omapdrm: hdmi4_cec: signedness bug in hdmi4_cec_init()
drm: omapdrm: Fix DPI on platforms using the DSI VDDS
omapdrm: hdmi4: Correct the SoC revision matching
drm/omap: displays: panel-dpi: add backlight dependency
drm/omap: Fix error handling path in 'omap_dmm_probe()'
drm/i915: Disable THP until we have a GPU read BW W/A
drm/bridge: tc358767: fix 1-lane behavior
drm/bridge: tc358767: fix AUXDATAn registers access
drm/bridge: tc358767: fix timing calculations
drm/bridge: tc358767: fix DP0_MISC register set
drm/bridge: tc358767: filter out too high modes
drm/bridge: tc358767: do no fail on hi-res displays
drm/bridge: Fix lvds-encoder since the panel_bridge rework.
drm/bridge: synopsys/dw-hdmi: Enable cec clock
drm/bridge: adv7511/33: Fix adv7511_cec_init() failure handling
drm/radeon: remove init of CIK VMIDs 8-16 for amdkfd
drm/ttm: fix populate_and_map() functions once more
drm/fb_helper: Disable all crtc's when initial setup fails.
drm/atomic: make drm_atomic_helper_wait_for_vblanks more agressive
...
Linus Torvalds [Fri, 1 Dec 2017 13:05:45 +0000 (08:05 -0500)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A selection of fixes/changes that should make it into this series.
This contains:
- NVMe, two merges, containing:
- pci-e, rdma, and fc fixes
- Device quirks
- Fix for a badblocks leak in null_blk
- bcache fix from Rui Hua for a race condition regression where
-EINTR was returned to upper layers that didn't expect it.
- Regression fix for blktrace for a bug introduced in this series.
- blktrace cleanup for cgroup id.
- bdi registration error handling.
- Small series with cleanups for blk-wbt.
- Various little fixes for typos and the like.
Nothing earth shattering, most important are the NVMe and bcache fixes"
* 'for-linus' of git://git.kernel.dk/linux-block: (34 commits)
nvme-pci: fix NULL pointer dereference in nvme_free_host_mem()
nvme-rdma: fix memory leak during queue allocation
blktrace: fix trace mutex deadlock
nvme-rdma: Use mr pool
nvme-rdma: Check remotely invalidated rkey matches our expected rkey
nvme-rdma: wait for local invalidation before completing a request
nvme-rdma: don't complete requests before a send work request has completed
nvme-rdma: don't suppress send completions
bcache: check return value of register_shrinker
bcache: recover data from backing when data is clean
bcache: Fix building error on MIPS
bcache: add a comment in journal bucket reading
nvme-fc: don't use bit masks for set/test_bit() numbers
blk-wbt: fix comments typo
blk-wbt: move wbt_clear_stat to common place in wbt_done
blk-sysfs: remove NULL pointer checking in queue_wb_lat_store
blk-wbt: remove duplicated setting in wbt_init
nvme-pci: add quirk for delay before CHK RDY for WDC SN200
block: remove useless assignment in bio_split
null_blk: fix dev->badblocks leak
...
Will Deacon [Thu, 30 Nov 2017 18:25:17 +0000 (18:25 +0000)]
arm64: context: Fix comments and remove pointless smp_wmb()
The comments in the ASID allocator incorrectly hint at an MP-style idiom
using the asid_generation and the active_asids array. In fact, the
synchronisation is achieved using a combination of an xchg operation
and a spinlock, so update the comments and remove the pointless smp_wmb().
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Yury Norov [Wed, 29 Nov 2017 14:03:03 +0000 (17:03 +0300)]
arm64: cpu_ops: Add missing 'const' qualifiers
Building the kernel with an LTO-enabled GCC spits out the following "const"
warning for the cpu_ops code:
mm/percpu.c:2168:20: error: pcpu_fc_names causes a section type conflict
with dt_supported_cpu_ops
const char * const pcpu_fc_names[PCPU_FC_NR] __initconst = {
^
arch/arm64/kernel/cpu_ops.c:34:37: note: ‘dt_supported_cpu_ops’ was declared here
static const struct cpu_operations *dt_supported_cpu_ops[] __initconst = {
Fix it by adding missed const qualifiers.
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Xu YiPing [Wed, 15 Nov 2017 07:39:26 +0000 (15:39 +0800)]
arm64: perf: remove unsupported events for Cortex-A73
bus access read/write events are not supported in A73, based on the
Cortex-A73 TRM r0p2, section 11.9 Events (pages 11-457 to 11-460).
Fixes: 5561b6c5e981 "arm64: perf: add support for Cortex-A73"
Acked-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Xu YiPing <xuyiping@hisilicon.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Dave Martin [Thu, 30 Nov 2017 11:56:37 +0000 (11:56 +0000)]
arm64: fpsimd: Fix failure to restore FPSIMD state after signals
The fpsimd_update_current_state() function is responsible for
loading the FPSIMD state from the user signal frame into the
current task during sigreturn. When implementing support for SVE,
conditional code was added to this function in order to handle the
case where SVE state need to be loaded for the task and merged with
the FPSIMD data from the signal frame; however, the FPSIMD-only
case was unintentionally dropped.
As a result of this, sigreturn does not currently restore the
FPSIMD state of the task, except in the case where the system
supports SVE and the signal frame contains SVE state in addition to
FPSIMD state.
This patch fixes this bug by making the copy-in of the FPSIMD data
from the signal frame to thread_struct unconditional.
This remains a performance regression from v4.14, since the FPSIMD
state is now copied into thread_struct and then loaded back,
instead of _only_ being loaded into the CPU FPSIMD registers.
However, it is essential to call task_fpsimd_load() here anyway in
order to ensure that the SVE enable bit in CPACR_EL1 is set
correctly before returning to userspace. This could use some
refactoring, but since sigreturn is not a fast path I have kept
this patch as a pure fix and left the refactoring for later.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Fixes: 8cd969d28fd2 ("arm64/sve: Signal handling support")
Reported-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Jinbum Park [Wed, 22 Nov 2017 12:43:59 +0000 (21:43 +0900)]
arm64: pgd: Mark pgd_cache as __ro_after_init
pgd_cache is setup once while init stage and never changed after
that, so it is good candidate for __ro_after_init
Signed-off-by: Jinbum Park <jinb.park7@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Ard Biesheuvel [Mon, 20 Nov 2017 17:41:30 +0000 (17:41 +0000)]
arm64: ftrace: emit ftrace-mod.o contents through code
When building the arm64 kernel with both CONFIG_ARM64_MODULE_PLTS and
CONFIG_DYNAMIC_FTRACE enabled, the ftrace-mod.o object file is built
with the kernel and contains a trampoline that is linked into each
module, so that modules can be loaded far away from the kernel and
still reach the ftrace entry point in the core kernel with an ordinary
relative branch, as is emitted by the compiler instrumentation code
dynamic ftrace relies on.
In order to be able to build out of tree modules, this object file
needs to be included into the linux-headers or linux-devel packages,
which is undesirable, as it makes arm64 a special case (although a
precedent does exist for 32-bit PPC).
Given that the trampoline essentially consists of a PLT entry, let's
not bother with a source or object file for it, and simply patch it
in whenever the trampoline is being populated, using the existing
PLT support routines.
Cc: <stable@vger.kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>