Ming Lei [Tue, 30 Apr 2019 01:52:27 +0000 (09:52 +0800)]
blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().
However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.
Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Apr 2019 01:52:26 +0000 (09:52 +0800)]
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Apr 2019 01:52:25 +0000 (09:52 +0800)]
blk-mq: free hw queue's resource in hctx's release handler
Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit
45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.
However, that commit introduces another issue. Before
45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
We have invented ways for addressing this kind of issue before, such as:
8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")
But still can't cover all cases, recently James reports another such
kind of issue:
https://marc.info/?l=linux-scsi&m=
155389088124782&w=2
This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.
Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.
This approach follows typical design pattern wrt. kobject's release handler.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Apr 2019 01:52:24 +0000 (09:52 +0800)]
blk-mq: move cancel of requeue_work into blk_mq_release
With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.
So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Apr 2019 01:52:23 +0000 (09:52 +0800)]
blk-mq: grab .q_usage_counter when queuing request from plug code path
Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.
If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.
However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:
blk_mq_flush_plug_list():
blk_mq_sched_insert_requests()
insert requests to sw queue or scheduler queue
blk_mq_run_hw_queue
Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().
Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 2 May 2019 22:07:05 +0000 (16:07 -0600)]
Merge branch 'nvme-5.2' of git://git.infradead.org/nvme into for-5.2/block
Pull NVMe updates from Christoph.
* 'nvme-5.2' of git://git.infradead.org/nvme:
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
nvme-tcp: fix possible null deref on a timed out io queue connect
Raul E Rangel [Thu, 2 May 2019 19:48:11 +0000 (13:48 -0600)]
block: fix function name in comment
The comment was out of date.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Raul E Rangel <rrangel@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sagi Grimberg [Mon, 29 Apr 2019 23:28:19 +0000 (16:28 -0700)]
nvmet: protect discovery change log event list iteration
When we iterate on the discovery subsystem controllers
we need to protect against concurrent mutations to it.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Tue, 30 Apr 2019 15:37:43 +0000 (11:37 -0400)]
nvme: mark nvme_core_init and nvme_core_exit static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Christoph Hellwig [Tue, 30 Apr 2019 15:36:52 +0000 (11:36 -0400)]
nvme: move command size checks to the core
Most command aren't PCIe specific, so move the size checking for them
to core.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Minwoo Im [Thu, 11 Apr 2019 15:18:31 +0000 (00:18 +0900)]
nvme-fabrics: check more command sizes
struct common_command provides a common structure for NVMe-oF command
format. It also needs to be checked for unintended size growth.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Minwoo Im [Thu, 11 Apr 2019 15:18:32 +0000 (00:18 +0900)]
nvme-pci: check more command sizes
All the NVMe command has 64bytes fixed size so that it has been assured
with BUILD_BUG_ON(). The remaining command structures in linux/nvme.h
also need to be checked here.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Minwoo Im [Thu, 11 Apr 2019 15:52:39 +0000 (00:52 +0900)]
nvme-pci: remove an unneeded variable initialization
Variable "n" will be assigned once kstrtoint() succeeds, otherwise it
will not be referred because kstrtoint() will return an error which
means go out from this function.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Tue, 30 Apr 2019 15:33:41 +0000 (09:33 -0600)]
nvme-pci: unquiesce admin queue on shutdown
Just like IO queues, the admin queue also will not be restarted after a
controller shutdown. Unquiesce this queue so that we do not block
request dispatch on a permanently disabled controller.
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Tue, 30 Apr 2019 15:33:40 +0000 (09:33 -0600)]
nvme-pci: shutdown on timeout during deletion
We do not restart a controller in a deleting state for timeout errors.
When in this state, unblock potential request dispatchers with failed
completions by shutting down the controller on timeout detection.
Reported-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Klaus Birkelund Jensen [Tue, 30 Apr 2019 16:53:29 +0000 (18:53 +0200)]
nvme-pci: fix psdt field for single segment sgls
The shortcut for single segment SGL requests did not set the PSDT field
to mark the request as using SGLs.
Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs")
Signed-off-by: Klaus Birkelund Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 29 Apr 2019 03:24:42 +0000 (20:24 -0700)]
nvme-multipath: don't print ANA group state by default
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Tue, 30 Apr 2019 16:57:09 +0000 (18:57 +0200)]
nvme-multipath: split bios with the ns_head bio_set before submitting
If the bio is moved to a different queue via blk_steal_bios() and
the original queue is destroyed in nvme_remove_ns() we'll be ending
with a crash in bio_endio() as the mempool for the split bio bvecs
had already been destroyed.
So split the bio using the original queue (which will remain during the
lifetime of the bio) before sending it down to the underlying device.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Mon, 29 Apr 2019 23:25:48 +0000 (16:25 -0700)]
nvme-tcp: fix possible null deref on a timed out io queue connect
If I/O queue connect times out, we might have freed the queue socket
already, so check for that on the error path in nvme_tcp_start_queue.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Jens Axboe [Wed, 1 May 2019 12:34:09 +0000 (06:34 -0600)]
bcache: make is_discard_enabled() static
It's not used outside this file.
Fixes: 631207314d88 ("bcache: fix failure in journal relplay")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 30 Apr 2019 17:56:16 +0000 (13:56 -0400)]
block: remove the unused blk_queue_dma_pad function
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 30 Apr 2019 18:42:43 +0000 (14:42 -0400)]
block: add SPDX tags to block layer files missing licensing information
Various block layer files do not have any licensing information at all.
Add SPDX tags for the default kernel GPLv2 license to those.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 30 Apr 2019 18:42:42 +0000 (14:42 -0400)]
block: add a SPDX tag to blk-mq-rdma.h
This file has no copyright notice, but was added as part of a commit
adding another file using the default kernel GPLv2 license. Add
a matching SPDX tag.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 30 Apr 2019 18:42:41 +0000 (14:42 -0400)]
sed-opal.h: remove redundant licence boilerplate
The file already has the correct SPDX header.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 30 Apr 2019 18:42:40 +0000 (14:42 -0400)]
block: switch all files cleared marked as GPLv2 or later to SPDX tags
All these files have some form of the usual GPLv2 or later boilerplate.
Switch them to use SPDX tags instead.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 30 Apr 2019 18:42:39 +0000 (14:42 -0400)]
block: switch all files cleared marked as GPLv2 to SPDX tags
All these files have some form of the usual GPLv2 boilerplate. Switch
them to use SPDX tags instead.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 25 Apr 2019 07:04:35 +0000 (09:04 +0200)]
block: clean up __bio_add_pc_page a bit
Share the bi_size update by moving the done label up, and duplicate
the bv_len update in the two callers to get rid of the bvec_merge
label.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 25 Apr 2019 07:04:34 +0000 (09:04 +0200)]
block: remove bogus comments in __bio_add_pc_page
We are never called with file system pages by defintions for the
passthrough interface, and we also never undo any addition later
these days.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 25 Apr 2019 07:04:33 +0000 (09:04 +0200)]
block: remove the __bio_add_pc_page export
The same page optimization is a rather odd corner case, which is not
used outside bio.c and which really should not be used outside of bio.c
either - we have better highlevel helpers like the rq/bio mapping
helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 25 Apr 2019 07:03:00 +0000 (09:03 +0200)]
block: remove the i argument to bio_for_each_segment_all
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 25 Apr 2019 07:02:59 +0000 (09:02 +0200)]
bcache: clean up do_btree_node_write a bit
Use a variable containing the buffer address instead of the to be
removed integer iterator from bio_for_each_segment_all.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Tue, 30 Apr 2019 14:02:25 +0000 (22:02 +0800)]
bcache: remove redundant LIST_HEAD(journal) from run_cache_set()
Commit
95f18c9d1310 ("bcache: avoid potential memleak of list of
journal_replay(s) in the CACHE_SYNC branch of run_cache_set") forgets
to remove the original define of LIST_HEAD(journal), which makes
the change no take effect. This patch removes redundant variable
LIST_HEAD(journal) from run_cache_set(), to make Shenghui's fix
working.
Fixes: 95f18c9d1310 ("bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set")
Reported-by: Juha Aatrokoski <juha.aatrokoski@aalto.fi>
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 26 Apr 2019 16:25:19 +0000 (10:25 -0600)]
Merge branch 'nvme-5.2' of git://git.infradead.org/nvme into for-5.2/block
Pull NVMe changes from Christoph.
* 'nvme-5.2' of git://git.infradead.org/nvme:
nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE
nvme-rdma: fix typo in struct comment
nvme-loop: kill timeout handler
nvme-tcp: rename function to have nvme_tcp prefix
nvme-rdma: fix a NULL deref when an admin connect times out
nvme-tcp: fix a NULL deref when an admin connect times out
nvmet-tcp: don't fail maxr2t greater than 1
nvmet-file: clamp-down file namespace lba_shift
nvmet: include <linux/scatterlist.h>
nvmet: return a specified error it subsys_alloc fails
nvmet: rename nvme_completion instances from rsp to cqe
nvmet-rdma: remove p2p_client initialization from fast-path
Christoph Hellwig [Thu, 25 Apr 2019 06:12:11 +0000 (08:12 +0200)]
mtip32xx: remove trim support
The trim support in mtip32xx has been "temporarily" disabled for 6
years, which is 3/4 of the time the driver even exists in the tree.
Remove it as it obviously is dead code now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sagi Grimberg [Mon, 11 Mar 2019 22:02:25 +0000 (15:02 -0700)]
nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE
If our target exposed a namespace with a block size that is greater
than PAGE_SIZE, set 0 capacity on the namespace as we do not support it.
This issue encountered when the nvmet namespace was backed by a tempfile.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Minwoo Im [Wed, 10 Apr 2019 14:48:59 +0000 (23:48 +0900)]
nvme-rdma: fix typo in struct comment
struct nvme_rdma_cm_rej has two different attributes: recfmt and sts.
And sts will have value what this comment wanted to show.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Ming Lei [Mon, 15 Apr 2019 01:51:46 +0000 (09:51 +0800)]
nvme-loop: kill timeout handler
Firstly it doesn't make sense to handle timeout for loop: 1) for admin
queue, the request is always completed in code path of queuing IO. 2)
for normal IO request, the timeout on these IOs have been handled by
underlying queue already.
Secondly nvme-loop's timeout handler is simply broken, and easy to
cause issue: 1) no any sync/protection between timeout and normal
completion, and now it is driver's responsibility to deal with
that; 2) bad reset implementation, blk_mq_update_nr_hw_queues()
is called after all NSs's queue is stopped(quiesced), and easy
to trigger deadlock.
So kill the timeout handler.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewd-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Wed, 24 Apr 2019 18:53:19 +0000 (11:53 -0700)]
nvme-tcp: rename function to have nvme_tcp prefix
usually nvme_ prefix is for core functions.
While we're cleaning up, remove redundant empty lines
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Wed, 24 Apr 2019 18:53:18 +0000 (11:53 -0700)]
nvme-rdma: fix a NULL deref when an admin connect times out
If we timeout the admin startup sequence we might not yet have
an I/O tagset allocated which causes the teardown sequence to crash.
Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags
if the tagset wasn't allocated.
Fixes: 4c174e636674 ("nvme-rdma: fix timeout handler")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Wed, 24 Apr 2019 18:53:17 +0000 (11:53 -0700)]
nvme-tcp: fix a NULL deref when an admin connect times out
If we timeout the admin startup sequence we might not yet have
an I/O tagset allocated which causes the teardown sequence to crash.
Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags
if the tagset wasn't allocated.
Fixes: 39d57757467b ("nvme-tcp: fix timeout handler")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Wed, 24 Apr 2019 18:53:16 +0000 (11:53 -0700)]
nvmet-tcp: don't fail maxr2t greater than 1
The host may support it, but nothing prevents us from
sending a single r2t at a time like we do anyways.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Wed, 24 Apr 2019 18:43:23 +0000 (11:43 -0700)]
nvmet-file: clamp-down file namespace lba_shift
When the backing file is a tempfile for example, the inode i_blkbits
can be 1M in size which causes problems for hosts to support as the
disk block size. Instead, expose the minimum between i_blkbits and
12 (4K sector size).
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by:- Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Enrico Weigelt, metux IT consult [Wed, 24 Apr 2019 10:34:39 +0000 (12:34 +0200)]
nvmet: include <linux/scatterlist.h>
Build breaks:
drivers/nvme/target/core.c: In function 'nvmet_req_alloc_sgl':
drivers/nvme/target/core.c:939:12: error: implicit declaration of \
function 'sgl_alloc'; did you mean 'bio_alloc'? \
[-Werror=implicit-function-declaration]
req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
^~~~~~~~~
bio_alloc
drivers/nvme/target/core.c:939:10: warning: assignment makes pointer \
from integer without a cast [-Wint-conversion]
req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, &req->sg_cnt);
^
drivers/nvme/target/core.c: In function 'nvmet_req_free_sgl':
drivers/nvme/target/core.c:952:3: error: implicit declaration of \
function 'sgl_free'; did you mean 'ida_free'? [-Werror=implicit-function-declaration]
sgl_free(req->sg);
^~~~~~~~
ida_free
Cause:
1. missing include to <linux/scatterlist.h>
2. SGL_ALLOC needs to be enabled
Therefore adding the missing include, as well as Kconfig dependency.
Signed-off-by: Enrico Weigelt, metux IT consult <info@metux.net>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Minwoo Im [Sun, 7 Apr 2019 06:28:06 +0000 (15:28 +0900)]
nvmet: return a specified error it subsys_alloc fails
nvmet_subsys_alloc() returns its pointer or NULL if it fails. We can
see three different steps in this function:
1. memory allocation
2. argument check
3. memory allocation for string
But now the callers of this function do not seem to handle case 2 by
returning -ENOMEM only even if it fails with an invalid parameter.
This patch specifies error codes so that caller can pass it to its own
caller.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Max Gurtovoy [Mon, 8 Apr 2019 15:39:59 +0000 (18:39 +0300)]
nvmet: rename nvme_completion instances from rsp to cqe
Use NVMe namings for improving code readability.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Max Gurtovoy [Mon, 8 Apr 2019 15:39:58 +0000 (18:39 +0300)]
nvmet-rdma: remove p2p_client initialization from fast-path
Initialize it during command allocation.
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Shenghui Wang [Wed, 24 Apr 2019 16:48:43 +0000 (00:48 +0800)]
bcache: avoid potential memleak of list of journal_replay(s) in the CACHE_SYNC branch of run_cache_set
In the CACHE_SYNC branch of run_cache_set(), LIST_HEAD(journal) is used
to collect journal_replay(s) and filled by bch_journal_read().
If all goes well, bch_journal_replay() will release the list of
jounal_replay(s) at the end of the branch.
If something goes wrong, code flow will jump to the label "err:" and leave
the list unreleased.
This patch will release the list of journal_replay(s) in the case of
error detected.
v1 -> v2:
* Move the release code to the location after label 'err:' to
simply the change.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shenghui Wang [Wed, 24 Apr 2019 16:48:42 +0000 (00:48 +0800)]
bcache: fix wrong usage use-after-freed on keylist in out_nocoalesce branch of btree_gc_coalesce
Elements of keylist should be accessed before the list is freed.
Move bch_keylist_free() calling after the while loop to avoid wrong
content accessed.
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tang Junhui [Wed, 24 Apr 2019 16:48:41 +0000 (00:48 +0800)]
bcache: fix failure in journal relplay
journal replay failed with messages:
Sep 10 19:10:43 ceph kernel: bcache: error on
bb379a64-e44e-4812-b91d-
a5599871a3b1: bcache: journal entries
2057493-
2057567 missing! (replaying
2057493-
2076601), disabling
caching
The reason is in journal_reclaim(), when discard is enabled, we send
discard command and reclaim those journal buckets whose seq is old
than the last_seq_now, but before we write a journal with last_seq_now,
the machine is restarted, so the journal with the last_seq_now is not
written to the journal bucket, and the last_seq_wrote in the newest
journal is old than last_seq_now which we expect to be, so when we doing
replay, journals from last_seq_wrote to last_seq_now are missing.
It's hard to write a journal immediately after journal_reclaim(),
and it harmless if those missed journal are caused by discarding
since those journals are already wrote to btree node. So, if miss
seqs are started from the beginning journal, we treat it as normal,
and only print a message to show the miss journal, and point out
it maybe caused by discarding.
Patch v2 add a judgement condition to ignore the missed journal
only when discard enabled as Coly suggested.
(Coly Li: rebase the patch with other changes in bch_journal_replay())
Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com>
Tested-by: Dennis Schridde <devurandom@gmx.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:40 +0000 (00:48 +0800)]
bcache: improve bcache_reboot()
This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.
This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.
After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:39 +0000 (00:48 +0800)]
bcache: add comments for closure_fn to be called in closure_queue()
Add code comments to explain which call back function might be called
for the closure_queue(). This is an effort to make code to be more
understandable for readers.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:38 +0000 (00:48 +0800)]
bcache: Add comments for blkdev_put() in registration code path
Add comments to explain why in register_bcache() blkdev_put() won't
be called in two location. Add comments to explain why blkdev_put()
must be called in register_cache() when cache_alloc() failed.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:37 +0000 (00:48 +0800)]
bcache: add error check for calling register_bdev()
This patch adds return value to register_bdev(). Then if failure happens
inside register_bdev(), its caller register_bcache() may detect and
handle the failure more properly.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:36 +0000 (00:48 +0800)]
bcache: return error immediately in bch_journal_replay()
When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.
So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:35 +0000 (00:48 +0800)]
bcache: add comments for kobj release callback routine
Bcache has several routines to release resources in implicit way, they
are called when the associated kobj released. This patch adds code
comments to notice when and which release callback will be called,
- When dc->disk.kobj released:
void bch_cached_dev_release(struct kobject *kobj)
- When d->kobj released:
void bch_flash_dev_release(struct kobject *kobj)
- When c->kobj released:
void bch_cache_set_release(struct kobject *kobj)
- When ca->kobj released
void bch_cache_release(struct kobject *kobj)
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:34 +0000 (00:48 +0800)]
bcache: add failure check to run_cache_set() for journal replay
Currently run_cache_set() has no return value, if there is failure in
bch_journal_replay(), the caller of run_cache_set() has no idea about
such failure and just continue to execute following code after
run_cache_set(). The internal failure is triggered inside
bch_journal_replay() and being handled in async way. This behavior is
inefficient, while failure handling inside bch_journal_replay(), cache
register code is still running to start the cache set. Registering and
unregistering code running as same time may introduce some rare race
condition, and make the code to be more hard to be understood.
This patch adds return value to run_cache_set(), and returns -EIO if
bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
such failure and stop registering code flow immedidately inside
register_cache_set().
If journal replay fails, run_cache_set() can report error immediately
to register_cache_set(). This patch makes the failure handling for
bch_journal_replay() be in synchronized way, easier to understand and
debug, and avoid poetential race condition for register-and-unregister
in same time.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:33 +0000 (00:48 +0800)]
bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.
The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),
529 for_each_cache(ca, c, iter) {
530 struct journal_device *ja = &ca->journal;
531 unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533 /* No space available on this device */
534 if (next == ja->discard_idx)
535 continue;
536
537 ja->cur_idx = next;
538 k->ptr[n++] = MAKE_PTR(0,
539 bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540 ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);
If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.
Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,
649 for (i = 0; i < KEY_PTRS(k); i++) {
650-671 submit journal data to cache device
672 }
If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.
Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Coly Li [Wed, 24 Apr 2019 16:48:32 +0000 (00:48 +0800)]
bcache: move definition of 'int ret' out of macro read_bucket()
'int ret' is defined as a local variable inside macro read_bucket().
Since this macro is called multiple times, and following patches will
use a 'int ret' variable in bch_journal_read(), this patch moves
definition of 'int ret' from macro read_bucket() to range of function
bch_journal_read().
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Liang Chen [Wed, 24 Apr 2019 16:48:31 +0000 (00:48 +0800)]
bcache: fix a race between cache register and cacheset unregister
There is a race between cache device register and cache set unregister.
For an already registered cache device, register_bcache will call
bch_is_open to iterate through all cachesets and check every cache
there. The race occurs if cache_set_free executes at the same time and
clears the caches right before ca is dereferenced in bch_is_open_cache.
To close the race, let's make sure the clean up work is protected by
the bch_register_lock as well.
This issue can be reproduced as follows,
while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &
and results in the following oops,
[ +0.000053] BUG: unable to handle kernel NULL pointer dereference at
0000000000000998
[ +0.000457] #PF error: [normal kernel read fault]
[ +0.000464] PGD
800000003ca9d067 P4D
800000003ca9d067 PUD
3ca9c067 PMD 0
[ +0.000388] Oops: 0000 [#1] SMP PTI
[ +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ #6
[ +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
[ +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
[ +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
[ +0.000839] RSP: 0018:
ffff92ee804cbd88 EFLAGS:
00010202
[ +0.000328] RAX:
ffffffffc010e190 RBX:
ffff918b5c6b5000 RCX:
ffff918b7d8e0000
[ +0.000399] RDX:
ffff918b7d8e0000 RSI:
ffffffffc010e190 RDI:
0000000000000001
[ +0.000398] RBP:
ffff918b7d318340 R08:
0000000000000000 R09:
ffffffffb9bd2d7a
[ +0.000385] R10:
ffff918b7eb253c0 R11:
ffffb95980f51200 R12:
ffffffffc010e1a0
[ +0.000411] R13:
fffffffffffffff2 R14:
000000000000000b R15:
ffff918b7e232620
[ +0.000384] FS:
00007f955bec2740(0000) GS:
ffff918b7eb00000(0000) knlGS:
0000000000000000
[ +0.000420] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ +0.000801] CR2:
0000000000000998 CR3:
000000003cad6000 CR4:
00000000001406e0
[ +0.000837] Call Trace:
[ +0.000682] ? _cond_resched+0x10/0x20
[ +0.000691] ? __kmalloc+0x131/0x1b0
[ +0.000710] kernfs_fop_write+0xfa/0x170
[ +0.000733] __vfs_write+0x2e/0x190
[ +0.000688] ? inode_security+0x10/0x30
[ +0.000698] ? selinux_file_permission+0xd2/0x120
[ +0.000752] ? security_file_permission+0x2b/0x100
[ +0.000753] vfs_write+0xa8/0x1a0
[ +0.000676] ksys_write+0x4d/0xb0
[ +0.000699] do_syscall_64+0x3a/0xf0
[ +0.000692] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
George Spelvin [Wed, 24 Apr 2019 16:48:30 +0000 (00:48 +0800)]
bcache: Clean up bch_get_congested()
There are a few nits in this function. They could in theory all
be separate patches, but that's probably taking small commits
too far.
1) I added a brief comment saying what it does.
2) I like to declare pointer parameters "const" where possible
for documentation reasons.
3) It uses bitmap_weight(&rand, BITS_PER_LONG) to compute the Hamming
weight of a 32-bit random number (giving a random integer with
mean 16 and variance 8). Passing by reference in a 64-bit variable
is silly; just use hweight32().
4) Its helper function fract_exp_two is unnecessarily tangled.
Gcc can optimize the multiply by (1 << x) to a shift, but it can
be written in a much more straightforward way at the cost of one
more bit of internal precision. Some analysis reveals that this
bit is always available.
This shrinks the object code for fract_exp_two(x, 6) from 23 bytes:
0000000000000000 <foo1>:
0: 89 f9 mov %edi,%ecx
2: c1 e9 06 shr $0x6,%ecx
5: b8 01 00 00 00 mov $0x1,%eax
a: d3 e0 shl %cl,%eax
c: 83 e7 3f and $0x3f,%edi
f: d3 e7 shl %cl,%edi
11: c1 ef 06 shr $0x6,%edi
14: 01 f8 add %edi,%eax
16: c3 retq
To 19:
0000000000000017 <foo2>:
17: 89 f8 mov %edi,%eax
19: 83 e0 3f and $0x3f,%eax
1c: 83 c0 40 add $0x40,%eax
1f: 89 f9 mov %edi,%ecx
21: c1 e9 06 shr $0x6,%ecx
24: d3 e0 shl %cl,%eax
26: c1 e8 06 shr $0x6,%eax
29: c3 retq
(Verified with 0 <= frac_bits <= 8, 0 <= x < 16<<frac_bits;
both versions produce the same output.)
5) And finally, the call to bch_get_congested() in check_should_bypass()
is separated from the use of the value by multiple tests which
could moot the need to compute it. Move the computation down to
where it's needed. This also saves a local register to hold the
computed value.
Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Geliang Tang [Wed, 24 Apr 2019 16:48:29 +0000 (00:48 +0800)]
bcache: use kmemdup_nul for CACHED_LABEL buffer
This patch uses kmemdup_nul to create a NUL-terminated string from
dc->sb.label. This is better than open coding it.
With this, we can move env[2] initialization into env[] array to make
code more elegant.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Arnd Bergmann [Wed, 24 Apr 2019 16:48:28 +0000 (00:48 +0800)]
bcache: avoid clang -Wunintialized warning
clang has identified a code path in which it thinks a
variable may be unused:
drivers/md/bcache/alloc.c:333:4: error: variable 'bucket' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
fifo_pop(&ca->free_inc, bucket);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
#define fifo_pop(fifo, i) fifo_pop_front(fifo, (i))
^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/md/bcache/util.h:189:6: note: expanded from macro 'fifo_pop_front'
if (_r) { \
^~
drivers/md/bcache/alloc.c:343:46: note: uninitialized use occurs here
allocator_wait(ca, bch_allocator_push(ca, bucket));
^~~~~~
drivers/md/bcache/alloc.c:287:7: note: expanded from macro 'allocator_wait'
if (cond) \
^~~~
drivers/md/bcache/alloc.c:333:4: note: remove the 'if' if its condition is always true
fifo_pop(&ca->free_inc, bucket);
^
drivers/md/bcache/util.h:219:27: note: expanded from macro 'fifo_pop'
#define fifo_pop(fifo, i) fifo_pop_front(fifo, (i))
^
drivers/md/bcache/util.h:189:2: note: expanded from macro 'fifo_pop_front'
if (_r) { \
^
drivers/md/bcache/alloc.c:331:15: note: initialize the variable 'bucket' to silence this warning
long bucket;
^
This cannot happen in practice because we only enter the loop
if there is at least one element in the list.
Slightly rearranging the code makes this clearer to both the
reader and the compiler, which avoids the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoju Fang [Wed, 24 Apr 2019 16:48:27 +0000 (00:48 +0800)]
bcache: fix inaccurate result of unused buckets
To get the amount of unused buckets in sysfs_priority_stats, the code
count the buckets which GC_SECTORS_USED is zero. It's correct and should
not be overwritten by the count of buckets which prio is zero.
Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoju Fang [Wed, 24 Apr 2019 16:48:26 +0000 (00:48 +0800)]
bcache: fix crashes stopping bcache device before read miss done
The bio from upper layer is considered completed when bio_complete()
returns. In most scenarios bio_complete() is called in search_free(),
but when read miss happens, the bio_compete() is called when backing
device reading completed, while the struct search is still in use until
cache inserting finished.
If someone stops the bcache device just then, the device may be closed
and released, but after cache inserting finished the struct search will
access a freed struct cached_dev.
This patch add the reference of bcache device before bio_complete() when
read miss happens, and put it after the search is not used.
Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 24 Apr 2019 11:01:46 +0000 (19:01 +0800)]
block: don't run get_page() on pages from non-bvec iov iter
The refcount has been increased for pages retrieved from non-bvec iov iter
via __bio_iov_iter_get_pages(), so don't need to do that again.
Otherwise, IO pages are leaked easily.
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Fixes: 7321ecbfc7cf ("block: change how we get page references in bio_iov_iter_get_pages")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 23 Apr 2019 02:51:04 +0000 (10:51 +0800)]
block: clarify that bio_add_page() and related helpers can add multi pages
bio_add_page() and __bio_add_page() are capable of adding pages into
bio, and now we have at least two such usages alreay:
- __bio_iov_bvec_add_pages()
- nvmet_bdev_execute_rw().
So update comments on these two helpers.
The thing is a bit special for __bio_try_merge_page(), given the caller
needs to know if the new added page is same with the last added page,
then it isn't safe to pass multi-page in case that 'same_page' is true,
so adds warning on potential misuse, and updates comment on
__bio_try_merge_page().
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 22 Apr 2019 19:57:24 +0000 (13:57 -0600)]
Merge branch 'md-next' of https://github.com/liu-song-6/linux into for-5.2/block
Pull MD fixes from Song.
* 'md-next' of https://github.com/liu-song-6/linux:
md/raid: raid5 preserve the writeback action after the parity check
Revert "Don't jump to compute_result state from check_result state"
md: return -ENODEV if rdev has no mddev assigned
block: fix use-after-free on gendisk
Weiping Zhang [Tue, 2 Apr 2019 13:14:30 +0000 (21:14 +0800)]
block: don't show io_timeout if driver has no timeout handler
If the low level driver has no timeout handler, the
/sys/block/<disk>/queue/io_timeout will not be displayed.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 19 Apr 2019 06:56:24 +0000 (08:56 +0200)]
block: avoid scatterlist offsets > PAGE_SIZE
While we generally allow scatterlists to have offsets larger than page
size for an entry, and other subsystems like the crypto code make use of
that, the block layer isn't quite ready for that. Flip the switch back
to avoid them for now, and revisit that decision early in a merge window
once the known offenders are fixed.
Fixes: 8a96a0e40810 ("block: rewrite blk_bvec_map_sg to avoid a nth_page call")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hou Tao [Mon, 22 Apr 2019 13:23:21 +0000 (21:23 +0800)]
brd: re-enable __GFP_HIGHMEM in brd_insert_page()
__GFP_HIGHMEM is disabled if dax is enabled on brd, however
dax support for brd has been removed since commit (
7a862fbbdec6
"brd: remove dax support"), so restore __GFP_HIGHMEM in
brd_insert_page().
Also remove the no longer applicable comments about DAX and highmem.
Cc: stable@vger.kernel.org
Fixes: 7a862fbbdec6 ("brd: remove dax support")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yufen Yu [Tue, 2 Apr 2019 12:06:34 +0000 (20:06 +0800)]
block: fix use-after-free on gendisk
commit
2da78092dda "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.
However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:
Process1 Worker Process2
md_free
blkdev_open
del_gendisk
add delete_partition_work_fn() to wq
__blkdev_get
get_gendisk
put_disk
disk_release
kfree(disk)
find part from ext_devt_idr
get_disk_and_module(disk)
cause use after free
delete_partition_work_fn
put_device(part)
part_release
remove part from ext_devt_idr
Before <devt, hd_struct pointer> is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().
We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.
Thanks to Jan Kara for providing the solution and more clear comments
for the code.
Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 22 Apr 2019 15:47:36 +0000 (09:47 -0600)]
Merge tag 'v5.1-rc6' into for-5.2/block
Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
comment, and is trivial. The other one is a conflict due to a later fix
in the bio multi-page work, and needs a bit more care.
* tag 'v5.1-rc6': (770 commits)
Linux 5.1-rc6
block: make sure that bvec length can't be overflow
block: kill all_q_node in request_queue
x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
mm/kmemleak.c: fix unused-function warning
init: initialize jump labels before command line option parsing
kernel/watchdog_hld.c: hard lockup message should end with a newline
kcov: improve CONFIG_ARCH_HAS_KCOV help text
mm: fix inactive list balancing between NUMA nodes and cgroups
mm/hotplug: treat CMA pages as unmovable
proc: fixup proc-pid-vm test
proc: fix map_files test on F29
mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
mm: swapoff: shmem_unuse() stop eviction without igrab()
mm: swapoff: take notice of completion sooner
mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
mm: swapoff: shmem_find_swap_entries() filter out other types
slab: store tagged freelist for off-slab slabmgmt
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 21 Apr 2019 17:45:57 +0000 (10:45 -0700)]
Linux 5.1-rc6
Linus Torvalds [Sat, 20 Apr 2019 19:55:23 +0000 (12:55 -0700)]
Merge tag 'nfs-for-5.1-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfix from Trond Myklebust:
"Fix a regression in which an RPC call can be tagged with an error
despite the transmission being successful"
* tag 'nfs-for-5.1-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Ignore queue transmission errors on successful transmission
Linus Torvalds [Sat, 20 Apr 2019 19:52:23 +0000 (12:52 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three minor fixes: two obvious ones in drivers and a fix to the SG_IO
path to correctly return status on error"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: aic7xxx: fix EISA support
Revert "scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO"
scsi: core: set result when the command cannot be dispatched
Linus Torvalds [Sat, 20 Apr 2019 19:20:58 +0000 (12:20 -0700)]
Merge tag 'for-linus-
20190420' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A set of small fixes that should go into this series. This contains:
- Removal of unused queue member (Hou)
- Overflow bvec fix (Ming)
- Various little io_uring tweaks (me)
- kthread parking
- Only call cpu_possible() for verified CPU
- Drop unused 'file' argument to io_file_put()
- io_uring_enter vs io_uring_register deadlock fix
- CQ overflow fix
- BFQ internal depth update fix (me)"
* tag 'for-linus-
20190420' of git://git.kernel.dk/linux-block:
block: make sure that bvec length can't be overflow
block: kill all_q_node in request_queue
io_uring: fix CQ overflow condition
io_uring: fix possible deadlock between io_uring_{enter,register}
io_uring: drop io_file_put() 'file' argument
bfq: update internal depth state when queue depth changes
io_uring: only test SQPOLL cpu after we've verified it
io_uring: park SQPOLL thread if it's percpu
Linus Torvalds [Sat, 20 Apr 2019 17:43:37 +0000 (10:43 -0700)]
Merge tag 'i3c/fixes-for-5.1-rc6' of git://git./linux/kernel/git/i3c/linux
Pill i3c fixes from Boris Brezillon:
- fix the random PID check
- fix the disable controller logic in the designware driver
- fix I3C entry in MAINTAINERS
* tag 'i3c/fixes-for-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
MAINTAINERS: Fix the I3C entry
i3c: dw: Fix dw_i3c_master_disable controller by using correct mask
i3c: Fix the verification of random PID
Linus Torvalds [Sat, 20 Apr 2019 17:19:30 +0000 (10:19 -0700)]
Merge tag 'sound-5.1-rc6' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Two core fixes for long-standing bugs for the races at concurrent
device creation and deletion that were (unsurprisingly) spotted by
syzkaller with usb-fuzzer.
The rest are usual small HD-audio fixes"
* tag 'sound-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - add two more pin configuration sets to quirk table
ALSA: core: Fix card races between register and disconnect
ALSA: info: Fix racy addition/deletion of nodes
ALSA: hda: Initialize power_state field properly
Linus Torvalds [Sat, 20 Apr 2019 17:10:49 +0000 (10:10 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
"Misc clocksource driver fixes, and a sched-clock wrapping fix"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze()
clocksource/drivers/timer-ti-dm: Remove omap_dm_timer_set_load_start
clocksource/drivers/oxnas: Fix OX820 compatible
clocksource/drivers/arm_arch_timer: Remove unneeded pr_fmt macro
clocksource/drivers/npcm: select TIMER_OF
Linus Torvalds [Sat, 20 Apr 2019 17:05:02 +0000 (10:05 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes:
- various tooling fixes
- kretprobe fixes
- kprobes annotation fixes
- kprobes error checking fix
- fix the default events for AMD Family 17h CPUs
- PEBS fix
- AUX record fix
- address filtering fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Avoid kretprobe recursion bug
kprobes: Mark ftrace mcount handler functions nokprobe
x86/kprobes: Verify stack frame on kretprobe
perf/x86/amd: Add event map for AMD Family 17h
perf bpf: Return NULL when RB tree lookup fails in perf_env__find_btf()
perf tools: Fix map reference counting
perf evlist: Fix side band thread draining
perf tools: Check maps for bpf programs
perf bpf: Return NULL when RB tree lookup fails in perf_env__find_bpf_prog_info()
tools include uapi: Sync sound/asound.h copy
perf top: Always sample time to satisfy needs of use of ordered queuing
perf evsel: Use hweight64() instead of hweight_long(attr.sample_regs_user)
tools lib traceevent: Fix missing equality check for strcmp
perf stat: Disable DIR_FORMAT feature for 'perf stat record'
perf scripts python: export-to-sqlite.py: Fix use of parent_id in calls_view
perf header: Fix lock/unlock imbalances when processing BPF/BTF info
perf/x86: Fix incorrect PEBS_REGS
perf/ring_buffer: Fix AUX record suppression
perf/core: Fix the address filtering fix
kprobes: Fix error check when reusing optimized probes
Linus Torvalds [Sat, 20 Apr 2019 17:01:11 +0000 (10:01 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes all over the place: a console spam fix, section attributes
fixes, a KASLR fix, a TLB stack-variable alignment fix, a reboot
quirk, boot options related warnings fix, an LTO fix, a deadlock fix
and an RDT fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
x86/cpu/bugs: Use __initconst for 'const' init data
x86/mm/KASLR: Fix the size of the direct mapping section
x86/Kconfig: Fix spelling mistake "effectivness" -> "effectiveness"
x86/mm/tlb: Revert "x86/mm: Align TLB invalidation info"
x86/reboot, efi: Use EFI reboot for Acer TravelMate X514-51T
x86/mm: Prevent bogus warnings with "noexec=off"
x86/build/lto: Fix truncated .bss with -fdata-sections
x86/speculation: Prevent deadlock on ssb_state::lock
x86/resctrl: Do not repeat rdtgroup mode initialization
Linus Torvalds [Sat, 20 Apr 2019 16:53:36 +0000 (09:53 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"A deadline scheduler warning/race fix, and a cfs_period_us quota
calculation workaround where the real fix looks too involved to merge
immediately"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Correctly handle active 0-lag timers
sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
Linus Torvalds [Sat, 20 Apr 2019 16:38:01 +0000 (09:38 -0700)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
"A lockdep warning fix and a script execution fix when atomics are
generated"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/atomics: Don't assume that scripts are executable
locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' again
Linus Torvalds [Sat, 20 Apr 2019 01:03:55 +0000 (18:03 -0700)]
Merge branch 'for-5.1-fixes' of git://git./linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"A patch to fix a RCU imbalance error in the devices cgroup
configuration error path"
* 'for-5.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
device_cgroup: fix RCU imbalance in error case
Linus Torvalds [Fri, 19 Apr 2019 22:37:22 +0000 (15:37 -0700)]
Merge branch 'for-5.1-fixes' of git://git./linux/kernel/git/dennis/percpu
Pull percpu fixlet from Dennis Zhou:
"This stops printing the base address of percpu memory on
initialization"
* 'for-5.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: stop printing kernel addresses
Linus Torvalds [Fri, 19 Apr 2019 19:22:27 +0000 (12:22 -0700)]
Merge tag 'tty-5.1-rc6' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are five small fixes for some tty/serial/vt issues that have been
reported.
The vt one has been around for a while, it is good to finally get that
resolved. The others fix a build warning that showed up in 5.1-rc1,
and resolve a problem in the sh-sci driver.
Note, the second patch for build warning fix for the sc16is7xx driver
was just applied to the tree, as it resolves a problem with the
previous patch to try to solve the issue. It has not shown up in
linux-next yet, unlike all of the other patches, but it has passed
0-day testing and everyone seems to agree that it is correct"
* tag 'tty-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
sc16is7xx: put err_spi and err_i2c into correct #ifdef
vt: fix cursor when clearing the screen
sc16is7xx: move label 'err_spi' to correct section
serial: sh-sci: Fix HSCIF RX sampling point adjustment
serial: sh-sci: Fix HSCIF RX sampling point calculation
Linus Torvalds [Fri, 19 Apr 2019 18:46:51 +0000 (11:46 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
mm/kmemleak.c: fix unused-function warning
init: initialize jump labels before command line option parsing
kernel/watchdog_hld.c: hard lockup message should end with a newline
kcov: improve CONFIG_ARCH_HAS_KCOV help text
mm: fix inactive list balancing between NUMA nodes and cgroups
mm/hotplug: treat CMA pages as unmovable
proc: fixup proc-pid-vm test
proc: fix map_files test on F29
mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
mm: swapoff: shmem_unuse() stop eviction without igrab()
mm: swapoff: take notice of completion sooner
mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
mm: swapoff: shmem_find_swap_entries() filter out other types
slab: store tagged freelist for off-slab slabmgmt
Linus Torvalds [Fri, 19 Apr 2019 18:10:42 +0000 (11:10 -0700)]
Merge tag 'staging-5.1-rc6' of git://git./linux/kernel/git/gregkh/staging
Pull staging and IIO fixes from Greg KH:
"Here is a bunch of IIO driver fixes, and some smaller staging driver
fixes, for 5.1-rc6. The IIO fixes were delayed due to my vacation, but
all resolve a number of reported issues and have been in linux-next
for a few weeks with no reported issues.
The other staging driver fixes are all tiny, resolving some reported
issues in the comedi and most drivers, as well as some erofs fixes.
All of these patches have been in linux-next with no reported issues"
* tag 'staging-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (24 commits)
staging: comedi: ni_usb6501: Fix possible double-free of ->usb_rx_buf
staging: comedi: ni_usb6501: Fix use of uninitialized mutex
staging: erofs: fix unexpected out-of-bound data access
staging: comedi: vmk80xx: Fix possible double-free of ->usb_rx_buf
staging: comedi: vmk80xx: Fix use of uninitialized semaphore
staging: most: core: use device description as name
iio: core: fix a possible circular locking dependency
iio: ad_sigma_delta: select channel when reading register
iio: pms7003: select IIO_TRIGGERED_BUFFER
iio: cros_ec: Fix the maths for gyro scale calculation
iio: adc: xilinx: prevent touching unclocked h/w on remove
iio: adc: xilinx: fix potential use-after-free on probe
iio: adc: xilinx: fix potential use-after-free on remove
iio: dac: mcp4725: add missing powerdown bits in store eeprom
io: accel: kxcjk1013: restore the range after resume.
iio:chemical:bme680: Fix SPI read interface
iio:chemical:bme680: Fix, report temperature in millidegrees
iio: chemical: fix missing Kconfig block for sgp30
iio: adc: at91: disable adc channel interrupt in timeout case
iio: gyro: mpu3050: fix chip ID reading
...
Linus Torvalds [Fri, 19 Apr 2019 18:08:43 +0000 (11:08 -0700)]
Merge tag 'char-misc-5.1-rc6' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are four small misc driver fixes for 5.1-rc6.
Nothing major at all, they fix up a Kconfig issues, a SPDX invalid
license tag, and two tiny bugfixes.
All have been in linux-next for a while with no reported issues"
* tag 'char-misc-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
drivers: power: supply: goldfish_battery: Fix bogus SPDX identifier
extcon: ptn5150: fix COMPILE_TEST dependencies
misc: fastrpc: add checked value for dma_set_mask
habanalabs: remove low credit limit of DMA #0
Ming Lei [Wed, 17 Apr 2019 01:11:26 +0000 (09:11 +0800)]
block: make sure that bvec length can't be overflow
bvec->bv_offset may be bigger than PAGE_SIZE sometimes, such as,
when one bio is splitted in the middle of one bvec via bio_split(),
and bi_iter.bi_bvec_done is used to build offset of the 1st bvec of
remained bio. And the remained bio's bvec may be re-submitted to fs
layer via ITER_IBVEC, such as loop and nvme-loop.
So we have to make sure that every bvec's offset is less than
PAGE_SIZE from bio_for_each_segment_all() because some drivers(loop,
nvme-loop) passes the splitted bvec to fs layer via ITER_BVEC.
This patch fixes this issue reported by Zhang Yi When running nvme/011.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 6dc4f100c175 ("block: allow bio_for_each_segment_all() to iterate over multi-page bvec")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hou Tao [Fri, 19 Apr 2019 02:31:27 +0000 (10:31 +0800)]
block: kill all_q_node in request_queue
all_q_node has not been used since commit
4b855ad37194 ("blk-mq: Create
hctx for each present CPU"), so remove it.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 19 Apr 2019 17:28:27 +0000 (10:28 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- several new key mappings for HID
- a host of new ACPI IDs used to identify Elan touchpads in Lenovo
laptops
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: snvs_pwrkey - initialize necessary driver data before enabling IRQ
HID: input: add mapping for "Toggle Display" key
HID: input: add mapping for "Full Screen" key
HID: input: add mapping for keyboard Brightness Up/Down/Toggle keys
HID: input: add mapping for Expose/Overview key
HID: input: fix mapping of aspect ratio key
[media] doc-rst: switch to new names for Full Screen/Aspect keys
Input: document meanings of KEY_SCREEN and KEY_ZOOM
Input: elan_i2c - add hardware ID for multiple Lenovo laptops
Hans de Goede [Sun, 30 Dec 2018 17:27:15 +0000 (18:27 +0100)]
x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
The "ENERGY_PERF_BIAS: Set to 'normal', was 'performance'" message triggers
on pretty much every Intel machine. The purpose of log messages with
a warning level is to notify the user of something which potentially is
a problem, or at least somewhat unexpected.
This message clearly does not match those criteria, so lower its log
priority from warning to info.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181230172715.17469-1-hdegoede@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar [Fri, 19 Apr 2019 17:10:47 +0000 (19:10 +0200)]
Merge tag 'perf-urgent-for-mingo-5.1-
20190419' of git://git./linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
perf top:
Jiri Olsa:
- Fix 'perf top --pid', it needs PERF_SAMPLE_TIME since we switched to using
a different thread to sort the events and then even for just a single
thread we now need timestamps.
BPF:
Jiri Olsa:
- Fix bpf_prog and btf lookup functions failure path to to properly return
NULL.
- Fix side band thread draining, used to process PERF_RECORD_BPF_EVENT
metadata records.
core:
Jiri Olsa:
- Fix map lookup by name to get a refcount when the name is already in
the tree. Found
Song Liu:
- Fix __map__is_kmodule() by taking into account recently added BPF
maps.
UAPI:
Arnaldo Carvalho de Melo:
- Sync sound/asound.h copy
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andrea Arcangeli [Fri, 19 Apr 2019 00:50:52 +0000 (17:50 -0700)]
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it. Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier. For example in Hugh's post from Jul 2017:
https://lkml.kernel.org/r/alpine.LSU.2.11.
1707191716030.2055@eggly.anvils
"Not strictly relevant here, but a related note: I was very surprised
to discover, only quite recently, how handle_mm_fault() may be called
without down_read(mmap_sem) - when core dumping. That seems a
misguided optimization to me, which would also be nice to correct"
In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.
Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.
Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.
For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs. Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.
Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.
In order to facilitate the backporting I added "Fixes:
86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.
Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm(). The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.
Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Fri, 19 Apr 2019 00:50:48 +0000 (17:50 -0700)]
mm/kmemleak.c: fix unused-function warning
The only references outside of the #ifdef have been removed, so now we
get a warning in non-SMP configurations:
mm/kmemleak.c:1404:13: error: unused function 'scan_large_block' [-Werror,-Wunused-function]
Add a new #ifdef around it.
Link: http://lkml.kernel.org/r/20190416123148.3502045-1-arnd@arndb.de
Fixes: 298a32b13208 ("kmemleak: powerpc: skip scanning holes in the .bss section")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Fri, 19 Apr 2019 00:50:44 +0000 (17:50 -0700)]
init: initialize jump labels before command line option parsing
When a module option, or core kernel argument, toggles a static-key it
requires jump labels to be initialized early. While x86, PowerPC, and
ARM64 arrange for jump_label_init() to be called before parse_args(),
ARM does not.
Kernel command line: rdinit=/sbin/init page_alloc.shuffle=1 panic=-1 console=ttyAMA0,115200 page_alloc.shuffle=1
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at ./include/linux/jump_label.h:303
page_alloc_shuffle+0x12c/0x1ac
static_key_enable(): static key 'page_alloc_shuffle_key+0x0/0x4' used
before call to jump_label_init()
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted
5.1.0-rc4-next-20190410-00003-g3367c36ce744 #1
Hardware name: ARM Integrator/CP (Device Tree)
[<
c0011c68>] (unwind_backtrace) from [<
c000ec48>] (show_stack+0x10/0x18)
[<
c000ec48>] (show_stack) from [<
c07e9710>] (dump_stack+0x18/0x24)
[<
c07e9710>] (dump_stack) from [<
c001bb1c>] (__warn+0xe0/0x108)
[<
c001bb1c>] (__warn) from [<
c001bb88>] (warn_slowpath_fmt+0x44/0x6c)
[<
c001bb88>] (warn_slowpath_fmt) from [<
c0b0c4a8>]
(page_alloc_shuffle+0x12c/0x1ac)
[<
c0b0c4a8>] (page_alloc_shuffle) from [<
c0b0c550>] (shuffle_store+0x28/0x48)
[<
c0b0c550>] (shuffle_store) from [<
c003e6a0>] (parse_args+0x1f4/0x350)
[<
c003e6a0>] (parse_args) from [<
c0ac3c00>] (start_kernel+0x1c0/0x488)
Move the fallback call to jump_label_init() to occur before
parse_args().
The redundant calls to jump_label_init() in other archs are left intact
in case they have static key toggling use cases that are even earlier
than option parsing.
Link: http://lkml.kernel.org/r/155544804466.1032396.13418949511615676665.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Guenter Roeck <groeck@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Russell King <rmk@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sergey Senozhatsky [Fri, 19 Apr 2019 00:50:41 +0000 (17:50 -0700)]
kernel/watchdog_hld.c: hard lockup message should end with a newline
Separate print_modules() and hard lockup error message.
Before the patch:
NMI watchdog: Watchdog detected hard LOCKUP on cpu 1Modules linked in: nls_cp437
Link: http://lkml.kernel.org/r/20190412062557.2700-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark Rutland [Fri, 19 Apr 2019 00:50:37 +0000 (17:50 -0700)]
kcov: improve CONFIG_ARCH_HAS_KCOV help text
The help text for CONFIG_ARCH_HAS_KCOV is stale, and describes the
feature as being enabled only for x86_64, when it is now enabled for
several architectures, including arm, arm64, powerpc, and s390.
Let's remove that stale help text, and update it along the lines of hat
for ARCH_HAS_FORTIFY_SOURCE, better describing when an architecture
should select CONFIG_ARCH_HAS_KCOV.
Link: http://lkml.kernel.org/r/20190412102733.5154-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 19 Apr 2019 00:50:34 +0000 (17:50 -0700)]
mm: fix inactive list balancing between NUMA nodes and cgroups
During !CONFIG_CGROUP reclaim, we expand the inactive list size if it's
thrashing on the node that is about to be reclaimed. But when cgroups
are enabled, we suddenly ignore the node scope and use the cgroup scope
only. The result is that pressure bleeds between NUMA nodes depending
on whether cgroups are merely compiled into Linux. This behavioral
difference is unexpected and undesirable.
When the refault adaptivity of the inactive list was first introduced,
there were no statistics at the lruvec level - the intersection of node
and memcg - so it was better than nothing.
But now that we have that infrastructure, use lruvec_page_state() to
make the list balancing decision always NUMA aware.
[hannes@cmpxchg.org: fix bisection hole]
Link: http://lkml.kernel.org/r/20190417155241.GB23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412144438.2645-1-hannes@cmpxchg.org
Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>