Jens Axboe [Thu, 9 Jan 2020 00:59:24 +0000 (17:59 -0700)]
io_uring: add support for IORING_OP_OPENAT2
Add support for the new openat2(2) system call. It's trivial to do, as
we can have openat(2) just be wrapped around it.
Suggested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 9 Jan 2020 00:47:02 +0000 (17:47 -0700)]
io_uring: remove 'fname' from io_open structure
We only use it internally in the prep functions for both statx and
openat, so we don't need it to be persistent across the request.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 9 Jan 2020 00:41:21 +0000 (17:41 -0700)]
io_uring: add 'struct open_how' to the openat request context
We'll need this for openat2(2) support, remove flags and mode from
the existing io_open struct.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 8 Jan 2020 18:04:00 +0000 (11:04 -0700)]
io_uring: enable option to only trigger eventfd for async completions
If an application is using eventfd notifications with poll to know when
new SQEs can be issued, it's expecting the following read/writes to
complete inline. And with that, it knows that there are events available,
and don't want spurious wakeups on the eventfd for those requests.
This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
IORING_REGISTER_EVENTFD, except it only triggers notifications for events
that happen from async completions (IRQ, or io-wq worker completions).
Any completions inline from the submission itself will not trigger
notifications.
Suggested-by: Mark Papadakis <markuspapadakis@icloud.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 8 Jan 2020 18:01:46 +0000 (11:01 -0700)]
io_uring: change io_ring_ctx bool fields into bit fields
In preparation for adding another one, which would make us spill into
another long (and hence bump the size of the ctx), change them to
bit fields.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 8 Jan 2020 15:26:07 +0000 (08:26 -0700)]
io_uring: file set registration should use interruptible waits
If an application attempts to register a set with unbounded requests
pending, we can be stuck here forever if they don't complete. We can
make this wait interruptible, and just abort if we get signaled.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
YueHaibing [Tue, 7 Jan 2020 14:22:44 +0000 (22:22 +0800)]
io_uring: Remove unnecessary null check
Null check kfree is redundant, so remove it.
This is detected by coccinelle.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 5 Jan 2020 03:19:44 +0000 (20:19 -0700)]
io_uring: add support for send(2) and recv(2)
This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for
recv(2) support.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 30 Dec 2019 18:24:47 +0000 (21:24 +0300)]
io_uring: remove extra io_wq_current_is_worker()
io_wq workers use io_issue_sqe() to forward sqes and never
io_queue_sqe(). Remove extra check for io_wq_current_is_worker()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 30 Dec 2019 18:24:46 +0000 (21:24 +0300)]
io_uring: optimise commit_sqring() for common case
It should be pretty rare to not submitting anything when there is
something in the ring. No need to keep heuristics for this case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 30 Dec 2019 18:24:45 +0000 (21:24 +0300)]
io_uring: optimise head checks in io_get_sqring()
A user may ask to submit more than there is in the ring, and then
io_uring will submit as much as it can. However, in the last iteration
it will allocate an io_kiocb and immediately free it. It could do
better and adjust @to_submit to what is in the ring.
And since the ring's head is already checked here, there is no need to
do it in the loop, spamming with smp_load_acquire()'s barriers
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 30 Dec 2019 18:24:44 +0000 (21:24 +0300)]
io_uring: clamp to_submit in io_submit_sqes()
Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated
code and prepares for following changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 28 Dec 2019 22:39:54 +0000 (15:39 -0700)]
io_uring: add support for IORING_SETUP_CLAMP
Some applications like to start small in terms of ring size, and then
ramp up as needed. This is a bit tricky to do currently, since we don't
advertise the max ring size.
This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
size exceed what we support, then clamp them at the max values instead
of returning -EINVAL. Since we return the chosen ring sizes after setup,
no further changes are needed on the application side. io_uring already
changes the ring sizes if the application doesn't ask for power-of-two
sizes, for example.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 28 Dec 2019 19:11:08 +0000 (12:11 -0700)]
io_uring: extend batch freeing to cover more cases
Currently we only batch free if fixed files are used, no links, no aux
data, etc. This extends the batch freeing to only exclude the linked
case and fallback case, and make io_free_req_many() handle the other
cases just fine.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 28 Dec 2019 17:48:22 +0000 (10:48 -0700)]
io_uring: wrap multi-req freeing in struct req_batch
This cleans up the code a bit, and it allows us to build on top of the
multi-req freeing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 28 Dec 2019 11:13:03 +0000 (14:13 +0300)]
io_uring: batch getting pcpu references
percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().
~5% throughput boost for a "submit and wait 128 nops" benchmark.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
__io_req_free_empty() -> __io_req_do_free()
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 28 Dec 2019 11:13:02 +0000 (14:13 +0300)]
pcpu_ref: add percpu_ref_tryget_many()
Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 26 Dec 2019 05:18:28 +0000 (22:18 -0700)]
io_uring: add IORING_OP_MADVISE
This adds support for doing madvise(2) through io_uring. We assume that
any operation can block, and hence punt everything async. This could be
improved, but hard to make bullet proof. The async punt ensures it's
safe.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 26 Dec 2019 05:14:54 +0000 (22:14 -0700)]
mm: make do_madvise() available internally
This is in preparation for enabling this functionality through io_uring.
Add a helper that is just exporting what sys_madvise() does, and have the
system call use it.
No functional changes in this patch.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 26 Dec 2019 05:03:45 +0000 (22:03 -0700)]
io_uring: add IORING_OP_FADVISE
This adds support for doing fadvise through io_uring. We assume that
WILLNEED doesn't block, but that DONTNEED may block.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 25 Dec 2019 23:33:42 +0000 (16:33 -0700)]
io_uring: allow use of offset == -1 to mean file position
This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
update) the current file position. This obviously comes with the caveat
that if the application has multiple read/writes in flight, then the
end result will not be as expected. This is similar to threads sharing
a file descriptor and doing IO using the current file position.
Since this feature isn't easily detectable by doing a read or write,
add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
detect presence of this feature.
Reported-by: 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 22 Dec 2019 22:19:35 +0000 (15:19 -0700)]
io_uring: add non-vectored read/write commands
For uses cases that don't already naturally have an iovec, it's easier
(or more convenient) to just use a buffer address + length. This is
particular true if the use case is from languages that want to create
a memory safe abstraction on top of io_uring, and where introducing
the need for the iovec may impose an ownership issue. For those cases,
they currently need an indirection buffer, which means allocating data
just for this purpose.
Add basic read/write that don't require the iovec.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 19 Dec 2019 19:06:02 +0000 (12:06 -0700)]
io_uring: improve poll completion performance
For busy IORING_OP_POLL_ADD workloads, we can have enough contention
on the completion lock that we fail the inline completion path quite
often as we fail the trylock on that lock. Add a list for deferred
completions that we can use in that case. This helps reduce the number
of async offloads we have to do, as if we get multiple completions in
a row, we'll piggy back on to the poll_llist instead of having to queue
our own offload.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 19 Dec 2019 00:12:20 +0000 (17:12 -0700)]
io_uring: split overflow state into SQ and CQ side
We currently check ->cq_overflow_list from both SQ and CQ context, which
causes some bouncing of that cache line. Add separate bits of state for
this instead, so that the SQ side can check using its own state, and
likewise for the CQ side.
This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow
with the CQ state. If we hit an overflow condition, both of these bits
are set. Likewise for overflow flush clear, we clear both bits. For the
fast path of just checking if there's an overflow condition on either
the SQ or CQ side, we can use our own private bit for this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 18 Dec 2019 16:50:26 +0000 (09:50 -0700)]
io_uring: add lookup table for various opcode needs
We currently have various switch statements that check if an opcode needs
a file, mm, etc. These are hard to keep in sync as opcodes are added. Add
a struct io_op_def that holds all of this information, so we have just
one spot to update when opcodes are added.
This also enables us to NOT allocate req->io if a deferred command
doesn't need it, and corrects some mistakes we had in terms of what
commands need mm context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 18 Dec 2019 15:54:21 +0000 (08:54 -0700)]
io_uring: remove two unnecessary function declarations
__io_free_req() and io_double_put_req() aren't used before they are
defined, so we can kill these two forwards.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 17 Dec 2019 19:26:58 +0000 (22:26 +0300)]
io_uring: move *queue_link_head() from common path
Move io_queue_link_head() to links handling code in io_submit_sqe(),
so it wouldn't need extra checks and would have better data locality.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 16 Dec 2019 23:22:07 +0000 (02:22 +0300)]
io_uring: rename prev to head
Calling "prev" a head of a link is a bit misleading. Rename it
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 17 Dec 2019 15:04:44 +0000 (08:04 -0700)]
io_uring: add IOSQE_ASYNC
io_uring defaults to always doing inline submissions, if at all
possible. But for larger copies, even if the data is fully cached, that
can take a long time. Add an IOSQE_ASYNC flag that the application can
set on the SQE - if set, it'll ensure that we always go async for those
kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
get the concurrency we desire for this case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 17 Dec 2019 15:46:33 +0000 (08:46 -0700)]
io-wq: support concurrent non-blocking work
io-wq assumes that work will complete fast (and not block), so it
doesn't create a new worker when work is enqueued, if we already have
at least one worker running. This is done on the assumption that if work
is running, then it will complete fast.
Add an option to force io-wq to fork a new worker for work queued. This
is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
case, io-wq will create a new worker, even though workers are already
running.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 14 Dec 2019 04:18:10 +0000 (21:18 -0700)]
io_uring: add support for IORING_OP_STATX
This provides support for async statx(2) through io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 14 Dec 2019 20:26:33 +0000 (13:26 -0700)]
fs: make two stat prep helpers available
To implement an async stat, we need to provide the flags mapping and
the statx user copy. Make them available internally, through
fs/internal.h.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 9 Dec 2019 18:22:50 +0000 (11:22 -0700)]
io_uring: avoid ring quiesce for fixed file set unregister and update
We currently fully quiesce the ring before an unregister or update of
the fixed fileset. This is very expensive, and we can be a bit smarter
about this.
Add a percpu refcount for the file tables as a whole. Grab a percpu ref
when we use a registered file, and put it on completion. This is cheap
to do. Upon removal of a file from a set, switch the ref count to atomic
mode. When we hit zero ref on the completion side, then we know we can
drop the previously registered files. When the old files have been
dropped, switch the ref back to percpu mode for normal operation.
Since there's a period between doing the update and the kernel being
done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
same action. The application knows the update has completed when it gets
the CQE for it. Between doing the update and receiving this completion,
the application must continue to use the unregistered fd if submitting
IO on this particular file.
This takes the runtime of test/file-register from liburing from 14s to
about 0.7s.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 11 Dec 2019 21:02:38 +0000 (14:02 -0700)]
io_uring: add support for IORING_OP_CLOSE
This works just like close(2), unsurprisingly. We remove the file
descriptor and post the completion inline, then offload the actual
(potential) last file put to async context.
Mark the async part of this work as uncancellable, as we really must
guarantee that the latter part of the close is run.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 12 Dec 2019 02:29:43 +0000 (19:29 -0700)]
io-wq: add support for uncancellable work
Not all work can be cancelled, some of it we may need to guarantee
that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
on work that must not be cancelled. Note that the caller work function
must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
IO_WQ_WORK_CANCEL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 11 Dec 2019 21:10:35 +0000 (14:10 -0700)]
fs: move filp_close() outside of __close_fd_get_file()
Just one caller of this, and just use filp_close() there manually.
This is important to allow async close/removal of the fd.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 11 Dec 2019 18:20:36 +0000 (11:20 -0700)]
io_uring: add support for IORING_OP_OPENAT
This works just like openat(2), except it can be performed async. For
the normal case of a non-blocking path lookup this will complete
inline. If we have to do IO to perform the open, it'll be done from
async context.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 13 Dec 2019 18:10:11 +0000 (11:10 -0700)]
fs: make build_open_flags() available internally
This is a prep patch for supporting non-blocking open from io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 10 Dec 2019 17:38:56 +0000 (10:38 -0700)]
io_uring: add support for fallocate()
This exposes fallocate(2) through io_uring.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 21 Jan 2020 00:01:17 +0000 (17:01 -0700)]
Merge branch 'io_uring-5.5' into for-5.6/io_uring-vfs
Pull in compatability fix for the files_update command.
* io_uring-5.5:
io_uring: fix compat for IORING_REGISTER_FILES_UPDATE
Eugene Syromiatnikov [Wed, 15 Jan 2020 16:35:38 +0000 (17:35 +0100)]
io_uring: fix compat for IORING_REGISTER_FILES_UPDATE
fds field of struct io_uring_files_update is problematic with regards
to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
and 64-bit user space. In order to avoid custom handling of compat in
the syscall implementation, make fds __u64 and use u64_to_user_ptr in
order to retrieve it. Also, align the field naturally and check that
no garbage is passed there.
Fixes: c3a31e605620c279 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 20 Jan 2020 02:47:04 +0000 (19:47 -0700)]
Merge branch 'work.openat2' of git://git./linux/kernel/git/viro/vfs into for-5.6/io_uring-vfs
Pull in Al's openat2 branch, since we'll need that for the openat2
support.
* 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Documentation: path-lookup: include new LOOKUP flags
selftests: add openat2(2) selftests
open: introduce openat2(2) syscall
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: allow set_root() to produce errors
namei: allow nd_jump_link() to produce errors
nsfs: clean-up ns_get_path() signature to return int
namei: only return -ECHILD from follow_dotdot_rcu()
Linus Torvalds [Mon, 20 Jan 2020 00:02:49 +0000 (16:02 -0800)]
Linux 5.5-rc7
Linus Torvalds [Sun, 19 Jan 2020 20:10:28 +0000 (12:10 -0800)]
Merge tag 'riscv/for-v5.5-rc7' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
"Three fixes for RISC-V:
- Don't free and reuse memory containing the code that CPUs parked at
boot reside in.
- Fix rv64 build problems for ubsan and some modules by adding
logical and arithmetic shift helpers for 128-bit values. These are
from libgcc and are similar to what's present for ARM64.
- Fix vDSO builds to clean up their own temporary files"
* tag 'riscv/for-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Less inefficient gcc tishift helpers (and export their symbols)
riscv: delete temporary files
riscv: make sure the cores stay looping in .Lsecondary_park
Linus Torvalds [Sun, 19 Jan 2020 20:03:53 +0000 (12:03 -0800)]
Merge git://git./linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
1) Fix non-blocking connect() in x25, from Martin Schiller.
2) Fix spurious decryption errors in kTLS, from Jakub Kicinski.
3) Netfilter use-after-free in mtype_destroy(), from Cong Wang.
4) Limit size of TSO packets properly in lan78xx driver, from Eric
Dumazet.
5) r8152 probe needs an endpoint sanity check, from Johan Hovold.
6) Prevent looping in tcp_bpf_unhash() during sockmap/tls free, from
John Fastabend.
7) hns3 needs short frames padded on transmit, from Yunsheng Lin.
8) Fix netfilter ICMP header corruption, from Eyal Birger.
9) Fix soft lockup when low on memory in hns3, from Yonglong Liu.
10) Fix NTUPLE firmware command failures in bnxt_en, from Michael Chan.
11) Fix memory leak in act_ctinfo, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
cxgb4: reject overlapped queues in TC-MQPRIO offload
cxgb4: fix Tx multi channel port rate limit
net: sched: act_ctinfo: fix memory leak
bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal.
bnxt_en: Fix ipv6 RFS filter matching logic.
bnxt_en: Fix NTUPLE firmware command failures.
net: systemport: Fixed queue mapping in internal ring map
net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec
net: dsa: sja1105: Don't error out on disabled ports with no phy-mode
net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset
net: hns: fix soft lockup when there is not enough memory
net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()
net/sched: act_ife: initalize ife->metalist earlier
netfilter: nat: fix ICMP header corruption on ICMP errors
net: wan: lapbether.c: Use built-in RCU list checking
netfilter: nf_tables: fix flowtable list del corruption
netfilter: nf_tables: fix memory leak in nf_tables_parse_netdev_hooks()
netfilter: nf_tables: remove WARN and add NLA_STRING upper limits
netfilter: nft_tunnel: ERSPAN_VERSION must not be null
netfilter: nft_tunnel: fix null-attribute check
...
Linus Torvalds [Sun, 19 Jan 2020 20:02:06 +0000 (12:02 -0800)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Two runtime PM fixes and one leak fix"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: iop3xx: Fix memory leak in probe error path
i2c: tegra: Properly disable runtime PM on driver's probe error
i2c: tegra: Fix suspending in active runtime PM state
Rahul Lakkireddy [Fri, 17 Jan 2020 12:51:47 +0000 (18:21 +0530)]
cxgb4: reject overlapped queues in TC-MQPRIO offload
A queue can't belong to multiple traffic classes. So, reject
any such configuration that results in overlapped queues for a
traffic class.
Fixes: b1396c2bd675 ("cxgb4: parse and configure TC-MQPRIO offload")
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 17 Jan 2020 12:53:55 +0000 (18:23 +0530)]
cxgb4: fix Tx multi channel port rate limit
T6 can support 2 egress traffic management channels per port to
double the total number of traffic classes that can be configured.
In this configuration, if the class belongs to the other channel,
then all the queues must be bound again explicitly to the new class,
for the rate limit parameters on the other channel to take effect.
So, always explicitly bind all queues to the port rate limit traffic
class, regardless of the traffic management channel that it belongs
to. Also, only bind queues to port rate limit traffic class, if all
the queues don't already belong to an existing different traffic
class.
Fixes: 4ec4762d8ec6 ("cxgb4: add TC-MATCHALL classifier egress offload")
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 19 Jan 2020 04:45:06 +0000 (20:45 -0800)]
net: sched: act_ctinfo: fix memory leak
Implement a cleanup method to properly free ci->params
BUG: memory leak
unreferenced object 0xffff88811746e2c0 (size 64):
comm "syz-executor617", pid 7106, jiffies
4294943055 (age 14.250s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
c0 34 60 84 ff ff ff ff 00 00 00 00 00 00 00 00 .4`.............
backtrace:
[<
0000000015aa236f>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
[<
0000000015aa236f>] slab_post_alloc_hook mm/slab.h:586 [inline]
[<
0000000015aa236f>] slab_alloc mm/slab.c:3320 [inline]
[<
0000000015aa236f>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3549
[<
000000002c946bd1>] kmalloc include/linux/slab.h:556 [inline]
[<
000000002c946bd1>] kzalloc include/linux/slab.h:670 [inline]
[<
000000002c946bd1>] tcf_ctinfo_init+0x21a/0x530 net/sched/act_ctinfo.c:236
[<
0000000086952cca>] tcf_action_init_1+0x400/0x5b0 net/sched/act_api.c:944
[<
000000005ab29bf8>] tcf_action_init+0x135/0x1c0 net/sched/act_api.c:1000
[<
00000000392f56f9>] tcf_action_add+0x9a/0x200 net/sched/act_api.c:1410
[<
0000000088f3c5dd>] tc_ctl_action+0x14d/0x1bb net/sched/act_api.c:1465
[<
000000006b39d986>] rtnetlink_rcv_msg+0x178/0x4b0 net/core/rtnetlink.c:5424
[<
00000000fd6ecace>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477
[<
0000000047493d02>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
[<
00000000bdcf8286>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
[<
00000000bdcf8286>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328
[<
00000000fc5b92d9>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917
[<
00000000da84d076>] sock_sendmsg_nosec net/socket.c:639 [inline]
[<
00000000da84d076>] sock_sendmsg+0x54/0x70 net/socket.c:659
[<
0000000042fb2eee>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330
[<
000000008f23f67e>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384
[<
00000000d838e4f6>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417
[<
00000000289a9cb1>] __do_sys_sendmsg net/socket.c:2426 [inline]
[<
00000000289a9cb1>] __se_sys_sendmsg net/socket.c:2424 [inline]
[<
00000000289a9cb1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424
Fixes: 24ec483cec98 ("net: sched: Introduce act_ctinfo action")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Kevin 'ldir' Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Kevin 'ldir' Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Olof Johansson [Tue, 17 Dec 2019 04:06:31 +0000 (20:06 -0800)]
riscv: Less inefficient gcc tishift helpers (and export their symbols)
The existing __lshrti3 was really inefficient, and the other two helpers
are also needed to compile some modules.
Add the missing versions, and export all of the symbols like arm64
already does.
This code is based on the assembly generated by libgcc builds.
This fixes a build break triggered by ubsan:
riscv64-unknown-linux-gnu-ld: lib/ubsan.o: in function `.L2':
ubsan.c:(.text.unlikely+0x38): undefined reference to `__ashlti3'
riscv64-unknown-linux-gnu-ld: ubsan.c:(.text.unlikely+0x42): undefined reference to `__ashrti3'
Signed-off-by: Olof Johansson <olof@lixom.net>
[paul.walmsley@sifive.com: use SYM_FUNC_{START,END} instead of
ENTRY/ENDPROC; note libgcc origin]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Linus Torvalds [Sun, 19 Jan 2020 00:34:17 +0000 (16:34 -0800)]
Merge tag 'mtd/fixes-for-5.5-rc7' of git://git./linux/kernel/git/mtd/linux
Pull MTD fixes from Miquel Raynal:
"Raw NAND:
- GPMI: Fix the suspend/resume
SPI-NOR:
- Fix quad enable on Spansion like flashes
- Fix selection of 4-byte addressing opcodes on Spansion"
* tag 'mtd/fixes-for-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: rawnand: gpmi: Restore nfc timing setup after suspend/resume
mtd: rawnand: gpmi: Fix suspend/resume problem
mtd: spi-nor: Fix quad enable for Spansion like flashes
mtd: spi-nor: Fix selection of 4-byte addressing opcodes on Spansion
Linus Torvalds [Sat, 18 Jan 2020 21:57:31 +0000 (13:57 -0800)]
Merge tag 'drm-fixes-2020-01-19' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Back from LCA2020, fixes wasn't too busy last week, seems to have
quieten down appropriately, some amdgpu, i915, then a core mst fix and
one fix for virtio-gpu and one for rockchip:
core mst:
- serialize down messages and clear timeslots are on unplug
amdgpu:
- Update golden settings for renoir
- eDP fix
i915:
- uAPI fix: Remove dash and colon from PMU names to comply with
tools/perf
- Fix for include file that was indirectly included
- Two fixes to make sure VMA are marked active for error capture
virtio:
- maintain obj reservation lock when submitting cmds
rockchip:
- increase link rate var size to accommodate rates"
* tag 'drm-fixes-2020-01-19' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: Reorder detect_edp_sink_caps before link settings read.
drm/amdgpu: update goldensetting for renoir
drm/dp_mst: Have DP_Tx send one msg at a time
drm/dp_mst: clear time slots for ports invalid
drm/i915/pmu: Do not use colons or dashes in PMU names
drm/rockchip: fix integer type used for storing dp data rate
drm/i915/gt: Mark ring->vma as active while pinned
drm/i915/gt: Mark context->state vma as active while pinned
drm/i915/gt: Skip trying to unbind in restore_ggtt_mappings
drm/i915: Add missing include file <linux/math64.h>
drm/virtio: add missing virtio_gpu_array_lock_resv call
Ilie Halip [Wed, 15 Jan 2020 11:32:42 +0000 (13:32 +0200)]
riscv: delete temporary files
Temporary files used in the VDSO build process linger on even after make
mrproper: vdso-dummy.o.tmp, vdso.so.dbg.tmp.
Delete them once they're no longer needed.
Signed-off-by: Ilie Halip <ilie.halip@gmail.com>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Linus Torvalds [Sat, 18 Jan 2020 21:02:12 +0000 (13:02 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- a resctrl fix for uninitialized objects found by debugobjects
- a resctrl memory leak fix
- fix the unintended re-enabling of the of SME and SEV CPU flags if
memory encryption was disabled at bootup via the MSR space"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Ensure clearing of SME/SEV features is maintained
x86/resctrl: Fix potential memory leak
x86/resctrl: Fix an imbalance in domain_remove_cpu()
Linus Torvalds [Sat, 18 Jan 2020 21:00:59 +0000 (13:00 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
"Three fixes: fix link failure on Alpha, fix a Sparse warning and
annotate/robustify a lockless access in the NOHZ code"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/sched: Annotate lockless access to last_jiffies_update
lib/vdso: Make __cvdso_clock_getres() static
time/posix-stubs: Provide compat itimer supoprt for alpha
Linus Torvalds [Sat, 18 Jan 2020 20:57:41 +0000 (12:57 -0800)]
Merge branch 'smp-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull cpu/SMT fix from Ingo Molnar:
"Fix a build bug on CONFIG_HOTPLUG_SMT=y && !CONFIG_SYSFS kernels"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/SMT: Fix x86 link error without CONFIG_SYSFS
Linus Torvalds [Sat, 18 Jan 2020 20:56:36 +0000 (12:56 -0800)]
Merge branch 'ras-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 RAS fix from Ingo Molnar:
"Fix a thermal throttling race that can result in easy to trigger boot
crashes on certain Ice Lake platforms"
* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/therm_throt: Do not access uninitialized therm_work
Linus Torvalds [Sat, 18 Jan 2020 20:55:19 +0000 (12:55 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Tooling fixes, three Intel uncore driver fixes, plus an AUX events fix
uncovered by the perf fuzzer"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Remove PCIe3 unit for SNR
perf/x86/intel/uncore: Fix missing marker for snr_uncore_imc_freerunning_events
perf/x86/intel/uncore: Add PCI ID of IMC for Xeon E3 V5 Family
perf: Correctly handle failed perf_get_aux_event()
perf hists: Fix variable name's inconsistency in hists__for_each() macro
perf map: Set kmap->kmaps backpointer for main kernel map chunks
perf report: Fix incorrectly added dimensions as switch perf data file
tools lib traceevent: Fix memory leakage in filter_event
Linus Torvalds [Sat, 18 Jan 2020 20:53:28 +0000 (12:53 -0800)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
"Three fixes:
- Fix an rwsem spin-on-owner crash, introduced in v5.4
- Fix a lockdep bug when running out of stack_trace entries,
introduced in v5.4
- Docbook fix"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWN
futex: Fix kernel-doc notation warning
locking/lockdep: Fix buffer overrun problem in stack_trace[]
Linus Torvalds [Sat, 18 Jan 2020 20:52:18 +0000 (12:52 -0800)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"Fix a recent regression in the Ingenic SoCs irqchip driver that floods
the syslog"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/ingenic: Get rid of the legacy IRQ domain
Linus Torvalds [Sat, 18 Jan 2020 20:50:14 +0000 (12:50 -0800)]
Merge branch 'efi-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"Three EFI fixes:
- Fix a slow-boot-scrolling regression but making sure we use WC for
EFI earlycon framebuffer mappings on x86
- Fix a mixed EFI mode boot crash
- Disable paging explicitly before entering startup_32() in mixed
mode bootup"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efistub: Disable paging at mixed mode entry
efi/libstub/random: Initialize pointer variables to zero for mixed mode
efi/earlycon: Fix write-combine mapping on x86
Linus Torvalds [Sat, 18 Jan 2020 20:29:13 +0000 (12:29 -0800)]
Merge branch 'core-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull rseq fixes from Ingo Molnar:
"Two rseq bugfixes:
- CLONE_VM !CLONE_THREAD didn't work properly, the kernel would end
up corrupting the TLS of the parent. Technically a change in the
ABI but the previous behavior couldn't resonably have been relied
on by applications so this looks like a valid exception to the ABI
rule.
- Make the RSEQ_FLAG_UNREGISTER ABI behavior consistent with the
handling of other flags. This is not thought to impact any
applications either"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Unregister rseq for clone CLONE_VM
rseq: Reject unknown flags on rseq unregister
Linus Torvalds [Sat, 18 Jan 2020 20:23:31 +0000 (12:23 -0800)]
Merge tag 'for-linus-2020-01-18' of git://git./linux/kernel/git/brauner/linux
Pull thread fixes from Christian Brauner:
"Here is an urgent fix for ptrace_may_access() permission checking.
Commit
69f594a38967 ("ptrace: do not audit capability check when
outputing /proc/pid/stat") introduced the ability to opt out of audit
messages for accesses to various proc files since they are not
violations of policy.
While doing so it switched the check from ns_capable() to
has_ns_capability{_noaudit}(). That means it switched from checking
the subjective credentials (ktask->cred) of the task to using the
objective credentials (ktask->real_cred). This is appears to be wrong.
ptrace_has_cap() is currently only used in ptrace_may_access() And is
used to check whether the calling task (subject) has the
CAP_SYS_PTRACE capability in the provided user namespace to operate on
the target task (object). According to the cred.h comments this means
the subjective credentials of the calling task need to be used.
With this fix we switch ptrace_has_cap() to use security_capable() and
thus back to using the subjective credentials.
As one example where this might be particularly problematic, Jann
pointed out that in combination with the upcoming IORING_OP_OPENAT{2}
feature, this bug might allow unprivileged users to bypass the
capability checks while asynchronously opening files like /proc/*/mem,
because the capability checks for this would be performed against
kernel credentials.
To illustrate on the former point about this being exploitable: When
io_uring creates a new context it records the subjective credentials
of the caller. Later on, when it starts to do work it creates a kernel
thread and registers a callback. The callback runs with kernel creds
for ktask->real_cred and ktask->cred.
To prevent this from becoming a full-blown 0-day io_uring will call
override_cred() and override ktask->cred with the subjective
credentials of the creator of the io_uring instance. With
ptrace_has_cap() currently looking at ktask->real_cred this override
will be ineffective and the caller will be able to open arbitray proc
files as mentioned above.
Luckily, this is currently not exploitable but would be so once
IORING_OP_OPENAT{2} land in v5.6. Let's fix it now.
To minimize potential regressions I successfully ran the criu
testsuite. criu makes heavy use of ptrace() and extensively hits
ptrace_may_access() codepaths and has a good change of detecting any
regressions.
Additionally, I succesfully ran the ptrace and seccomp kernel tests"
* tag 'for-linus-2020-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
Linus Torvalds [Sat, 18 Jan 2020 20:18:55 +0000 (12:18 -0800)]
Merge tag 's390-5.5-5' of git://git./linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix printing misleading Secure-IPL enabled message when it is not.
- Fix a race condition between host ap bus and guest ap bus doing
device reset in crypto code.
- Fix sanity check in CCA cipher key function (CCA AES cipher key
support), which fails otherwise.
* tag 's390-5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/setup: Fix secure ipl message
s390/zcrypt: move ap device reset from bus to driver code
s390/zcrypt: Fix CCA cipher key gen with clear key value function
Linus Torvalds [Sat, 18 Jan 2020 20:12:36 +0000 (12:12 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three fixes in drivers with no impact to core code.
The mptfusion fix is enormous because the driver API had to be
rethreaded to pass down the necessary iocp pointer, but once that's
done a significant chunk of code is deleted.
The other two patches are small"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: mptfusion: Fix double fetch bug in ioctl
scsi: storvsc: Correctly set number of hardware queues for IDE disk
scsi: fnic: fix invalid stack access
Linus Torvalds [Sat, 18 Jan 2020 20:08:57 +0000 (12:08 -0800)]
Merge tag 'char-misc-5.5-rc7' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are some small fixes for 5.5-rc7
Included here are:
- two lkdtm fixes
- coresight build fix
- Documentation update for the hw process document
All of these have been in linux-next with no reported issues"
* tag 'char-misc-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
Documentation/process: Add Amazon contact for embargoed hardware issues
lkdtm/bugs: fix build error in lkdtm_UNSET_SMEP
lkdtm/bugs: Make double-fault test always available
coresight: etm4x: Fix unused function warning
Linus Torvalds [Sat, 18 Jan 2020 20:06:09 +0000 (12:06 -0800)]
Merge tag 'staging-5.5-rc7' of git://git./linux/kernel/git/gregkh/staging
Pull staging and IIO driver fixes from Greg KH:
"Here are some small staging and iio driver fixes for 5.5-rc7
All of them are for some small reported issues. Nothing major, full
details in the shortlog.
All have been in linux-next with no reported issues"
* tag 'staging-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: comedi: ni_routes: allow partial routing information
staging: comedi: ni_routes: fix null dereference in ni_find_route_source()
iio: light: vcnl4000: Fix scale for vcnl4040
iio: buffer: align the size of scan bytes to size of the largest element
iio: chemical: pms7003: fix unmet triggered buffer dependency
iio: imu: st_lsm6dsx: Fix selection of ST_LSM6DS3_ID
iio: adc: ad7124: Fix DT channel configuration
Linus Torvalds [Sat, 18 Jan 2020 20:02:33 +0000 (12:02 -0800)]
Merge tag 'usb-5.5-rc7' of git://git./linux/kernel/git/gregkh/usb
Pull USB driver fixes from Greg KH:
"Here are some small USB driver and core fixes for 5.5-rc7
There's one fix for hub wakeup issues and a number of small usb-serial
driver fixes and device id updates.
The hub fix has been in linux-next for a while with no reported
issues, and the usb-serial ones have all passed 0-day with no
problems"
* tag 'usb-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: serial: quatech2: handle unbound ports
USB: serial: keyspan: handle unbound ports
USB: serial: io_edgeport: add missing active-port sanity check
USB: serial: io_edgeport: handle unbound ports on URB completion
USB: serial: ch341: handle unbound port at reset_resume
USB: serial: suppress driver bind attributes
USB: serial: option: add support for Quectel RM500Q in QDL mode
usb: core: hub: Improved device recognition on remote wakeup
USB: serial: opticon: fix control-message timeouts
USB: serial: option: Add support for Quectel RM500Q
USB: serial: simple: Add Motorola Solutions TETRA MTP3xxx and MTP85xx
Aleksa Sarai [Fri, 6 Dec 2019 14:13:38 +0000 (01:13 +1100)]
Documentation: path-lookup: include new LOOKUP flags
Now that we have new LOOKUP flags, we should document them in the
relevant path-walking documentation. And now that we've settled on a
common name for nd_jump_link() style symlinks ("magic links"), use that
term where magic-link semantics are described.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Aleksa Sarai [Sat, 18 Jan 2020 12:08:00 +0000 (23:08 +1100)]
selftests: add openat2(2) selftests
Test all of the various openat2(2) flags. A small stress-test of a
symlink-rename attack is included to show that the protections against
".."-based attacks are sufficient.
The main things these self-tests are enforcing are:
* The struct+usize ABI for openat2(2) and copy_struct_from_user() to
ensure that upgrades will be handled gracefully (in addition,
ensuring that misaligned structures are also handled correctly).
* The -EINVAL checks for openat2(2) are all correctly handled to avoid
userspace passing unknown or conflicting flag sets (most
importantly, ensuring that invalid flag combinations are checked).
* All of the RESOLVE_* semantics (including errno values) are
correctly handled with various combinations of paths and flags.
* RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
attack that has been responsible for several CVEs (and likely will
be responsible for several more).
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Aleksa Sarai [Sat, 18 Jan 2020 12:07:59 +0000 (23:07 +1100)]
open: introduce openat2(2) syscall
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).
Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.
In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.
Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).
/* Syscall Prototype. */
/*
* open_how is an extensible structure (similar in interface to
* clone3(2) or sched_setattr(2)). The size parameter must be set to
* sizeof(struct open_how), to allow for future extensions. All future
* extensions will be appended to open_how, with their zero value
* acting as a no-op default.
*/
struct open_how { /* ... */ };
int openat2(int dfd, const char *pathname,
struct open_how *how, size_t size);
/* Description. */
The initial version of 'struct open_how' contains the following fields:
flags
Used to specify openat(2)-style flags. However, any unknown flag
bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
will result in -EINVAL. In addition, this field is 64-bits wide to
allow for more O_ flags than currently permitted with openat(2).
mode
The file mode for O_CREAT or O_TMPFILE.
Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
resolve
Restrict path resolution (in contrast to O_* flags they affect all
path components). The current set of flags are as follows (at the
moment, all of the RESOLVE_ flags are implemented as just passing
the corresponding LOOKUP_ flag).
RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
RESOLVE_BENEATH => LOOKUP_BENEATH
RESOLVE_IN_ROOT => LOOKUP_IN_ROOT
open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.
Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).
After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.
/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.
In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).
/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).
Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.
Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).
[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit
629e014bb834 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/
20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVs
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
David S. Miller [Sat, 18 Jan 2020 13:38:30 +0000 (14:38 +0100)]
Merge branch 'bnxt_en-fixes'
Michael Chan says:
====================
bnxt_en: Bug fixes.
3 small bug fix patches. The 1st two are aRFS fixes and the last one
fixes a fatal driver load failure on some kernels without PCIe
extended config space support enabled.
Please also queue these for -stable. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 17 Jan 2020 05:32:47 +0000 (00:32 -0500)]
bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal.
DSN read can fail, for example on a kdump kernel without PCIe extended
config space support. If DSN read fails, don't set the
BNXT_FLAG_DSN_VALID flag and continue loading. Check the flag
to see if the stored DSN is valid before using it. Only VF reps
creation should fail without valid DSN.
Fixes: 03213a996531 ("bnxt: move bp->switch_id initialization to PF probe")
Reported-by: Marc Smith <msmith626@gmail.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 17 Jan 2020 05:32:46 +0000 (00:32 -0500)]
bnxt_en: Fix ipv6 RFS filter matching logic.
Fix bnxt_fltr_match() to match ipv6 source and destination addresses.
The function currently only checks ipv4 addresses and will not work
corrently on ipv6 filters.
Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Fri, 17 Jan 2020 05:32:45 +0000 (00:32 -0500)]
bnxt_en: Fix NTUPLE firmware command failures.
The NTUPLE related firmware commands are sent to the wrong firmware
channel, causing all these commands to fail on new firmware that
supports the new firmware channel. Fix it by excluding the 3
NTUPLE firmware commands from the list for the new firmware channel.
Fixes: 760b6d33410c ("bnxt_en: Add support for 2nd firmware message channel.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christian Brauner [Wed, 15 Jan 2020 13:42:34 +0000 (14:42 +0100)]
ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
Commit
69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat")
introduced the ability to opt out of audit messages for accesses to various
proc files since they are not violations of policy. While doing so it
somehow switched the check from ns_capable() to
has_ns_capability{_noaudit}(). That means it switched from checking the
subjective credentials of the task to using the objective credentials. This
is wrong since. ptrace_has_cap() is currently only used in
ptrace_may_access() And is used to check whether the calling task (subject)
has the CAP_SYS_PTRACE capability in the provided user namespace to operate
on the target task (object). According to the cred.h comments this would
mean the subjective credentials of the calling task need to be used.
This switches ptrace_has_cap() to use security_capable(). Because we only
call ptrace_has_cap() in ptrace_may_access() and in there we already have a
stable reference to the calling task's creds under rcu_read_lock() there's
no need to go through another series of dereferences and rcu locking done
in ns_capable{_noaudit}().
As one example where this might be particularly problematic, Jann pointed
out that in combination with the upcoming IORING_OP_OPENAT feature, this
bug might allow unprivileged users to bypass the capability checks while
asynchronously opening files like /proc/*/mem, because the capability
checks for this would be performed against kernel credentials.
To illustrate on the former point about this being exploitable: When
io_uring creates a new context it records the subjective credentials of the
caller. Later on, when it starts to do work it creates a kernel thread and
registers a callback. The callback runs with kernel creds for
ktask->real_cred and ktask->cred. To prevent this from becoming a
full-blown 0-day io_uring will call override_cred() and override
ktask->cred with the subjective credentials of the creator of the io_uring
instance. With ptrace_has_cap() currently looking at ktask->real_cred this
override will be ineffective and the caller will be able to open arbitray
proc files as mentioned above.
Luckily, this is currently not exploitable but will turn into a 0-day once
IORING_OP_OPENAT{2} land in v5.6. Fix it now!
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Jann Horn <jannh@google.com>
Fixes: 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Dave Airlie [Sat, 18 Jan 2020 02:54:10 +0000 (12:54 +1000)]
Merge tag 'drm-misc-fixes-2020-01-16' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
virtio: maintain obj reservation lock when submitting cmds (Gerd)
rockchip: increase link rate var size to accommodate rates (Tobias)
mst: serialize down messages and clear timeslots are on unplug (Wayne)
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Tobias Schramm <t.schramm@manjaro.org>
Cc: Wayne Lin <Wayne.Lin@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Sean Paul <sean@poorly.run>
Link: https://patchwork.freedesktop.org/patch/msgid/20200116162856.GA11524@art_vandelay
Dave Airlie [Sat, 18 Jan 2020 02:53:53 +0000 (12:53 +1000)]
Merge tag 'drm-intel-fixes-2020-01-16' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- uAPI fix: Remove dash and colon from PMU names to comply with tools/perf
- Fix for include file that was indirectly included
- Two fixes to make sure VMA are marked active for error capture
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200116161419.GA13594@jlahtine-desk.ger.corp.intel.com
Esben Haabendal [Fri, 17 Jan 2020 20:05:37 +0000 (21:05 +0100)]
mtd: rawnand: gpmi: Restore nfc timing setup after suspend/resume
As we reset the GPMI block at resume, the timing parameters setup by a
previous exec_op is lost. Rewriting GPMI timing registers on first exec_op
after resume fixes the problem.
Fixes: ef347c0cfd61 ("mtd: rawnand: gpmi: Implement exec_op")
Cc: stable@vger.kernel.org
Signed-off-by: Esben Haabendal <esben@geanix.com>
Acked-by: Han Xu <han.xu@nxp.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Esben Haabendal [Fri, 17 Jan 2020 20:05:36 +0000 (21:05 +0100)]
mtd: rawnand: gpmi: Fix suspend/resume problem
On system resume, the gpmi clock must be enabled before accessing gpmi
block. Without this, resume causes something like
[ 661.348790] gpmi_reset_block(
5cbb0f7e): module reset timeout
[ 661.348889] gpmi-nand
1806000.gpmi-nand: Error setting GPMI : -110
[ 661.348928] PM: dpm_run_callback(): platform_pm_resume+0x0/0x44 returns -110
[ 661.348961] PM: Device
1806000.gpmi-nand failed to resume: error -110
Fixes: ef347c0cfd61 ("mtd: rawnand: gpmi: Implement exec_op")
Cc: stable@vger.kernel.org
Signed-off-by: Esben Haabendal <esben@geanix.com>
Acked-by: Han Xu <han.xu@nxp.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Michael Walle [Thu, 16 Jan 2020 09:37:00 +0000 (10:37 +0100)]
mtd: spi-nor: Fix quad enable for Spansion like flashes
The commit
7b678c69c0ca ("mtd: spi-nor: Merge spansion Quad Enable
methods") forgot to actually set the QE bit in some cases. Thus this
breaks quad mode accesses to flashes which support readback of the
status register-2. Fix it.
Fixes: 7b678c69c0ca ("mtd: spi-nor: Merge spansion Quad Enable methods")
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Tudor Ambarus <tudor.ambarus@microchip.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Vignesh Raghavendra [Wed, 8 Jan 2020 05:13:43 +0000 (10:43 +0530)]
mtd: spi-nor: Fix selection of 4-byte addressing opcodes on Spansion
mtd->size is still unassigned when running spansion_post_sfdp_fixups()
hook, therefore use nor->params.size to determine the size of flash device.
This makes sure that 4-byte addressing opcodes are used on Spansion
flashes that are larger than 16MiB and don't have SFDP 4BAIT table
populated.
Fixes: 92094ebc385e ("mtd: spi-nor: Add spansion_post_sfdp_fixups()")
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Reviewed-by: Tudor Ambarus <tudor.ambarus@microchip.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Linus Torvalds [Fri, 17 Jan 2020 19:25:45 +0000 (11:25 -0800)]
Merge tag 'io_uring-5.5-2020-01-16' of git://git.kernel.dk/linux-block
Pull io_uring fixes form Jens Axboe:
- Ensure ->result is always set when IO is retried (Bijan)
- In conjunction with the above, fix a regression in polled IO issue
when retried (me/Bijan)
- Don't setup async context for read/write fixed, otherwise we may
wrongly map the iovec on retry (me)
- Cancel io-wq work if we fail getting mm reference (me)
- Ensure dependent work is always initialized correctly (me)
- Only allow original task to submit IO, don't allow it from a passed
ring fd (me)
* tag 'io_uring-5.5-2020-01-16' of git://git.kernel.dk/linux-block:
io_uring: only allow submit from owning task
io_uring: ensure workqueue offload grabs ring mutex for poll list
io_uring: clear req->result always before issuing a read/write request
io_uring: be consistent in assigning next work from handler
io-wq: cancel work if we fail getting a mm reference
io_uring: don't setup async context for read/write fixed
Linus Torvalds [Fri, 17 Jan 2020 19:21:05 +0000 (11:21 -0800)]
Merge tag 'for-5.5-rc6-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that have been in the works during last twp weeks.
All have a user visible effect and are stable material:
- scrub: properly update progress after calling cancel ioctl, calling
'resume' would start from the beginning otherwise
- fix subvolume reference removal, after moving out of the original
path the reference is not recognized and will lead to transaction
abort
- fix reloc root lifetime checks, could lead to crashes when there's
subvolume cleaning running in parallel
- fix memory leak when quotas get disabled in the middle of extent
accounting
- fix transaction abort in case of balance being started on degraded
mount on eg. RAID1"
* tag 'for-5.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check rw_devices, not num_devices for balance
Btrfs: always copy scrub arguments back to user space
btrfs: relocation: fix reloc_root lifespan and access
btrfs: fix memory leak in qgroup accounting
btrfs: do not delete mismatched root refs
btrfs: fix invalid removal of root ref
btrfs: rework arguments of btrfs_unlink_subvol
Greg Kroah-Hartman [Fri, 17 Jan 2020 18:40:06 +0000 (19:40 +0100)]
Merge tag 'usb-serial-5.5-rc7' of https://git./linux/kernel/git/johan/usb-serial into usb-linus
Johan writes:
USB-serial fixes for 5.5-rc7
Here are a few fixes for issues related to unbound port devices which
could lead to NULL-pointer dereferences. Notably the bind attributes for
usb-serial (port) drivers are removed as almost none of the drivers can
handle individual ports going away once they've been bound.
Included are also some new device ids.
All but the unbound-port fixes have been in linux-next with no reported
issues.
Signed-off-by: Johan Hovold <johan@kernel.org>
* tag 'usb-serial-5.5-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
USB: serial: quatech2: handle unbound ports
USB: serial: keyspan: handle unbound ports
USB: serial: io_edgeport: add missing active-port sanity check
USB: serial: io_edgeport: handle unbound ports on URB completion
USB: serial: ch341: handle unbound port at reset_resume
USB: serial: suppress driver bind attributes
USB: serial: option: add support for Quectel RM500Q in QDL mode
USB: serial: opticon: fix control-message timeouts
USB: serial: option: Add support for Quectel RM500Q
USB: serial: simple: Add Motorola Solutions TETRA MTP3xxx and MTP85xx
Linus Torvalds [Fri, 17 Jan 2020 16:42:02 +0000 (08:42 -0800)]
Merge tag 'fuse-fixes-5.5-rc7' of git://git./linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
"Fix a regression in the last release affecting the ftp module of the
gvfs filesystem"
* tag 'fuse-fixes-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix fuse_send_readpages() in the syncronous read case
Linus Torvalds [Fri, 17 Jan 2020 16:38:35 +0000 (08:38 -0800)]
Merge tag 'sound-5.5-rc7' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This became bigger than I have hoped for rc7. But, the only large LOC
is for stm32 fixes that are simple rewriting of register access
helpers, while the rest are all nice and small fixes:
- A few ASoC fixes for the remaining probe error handling bugs
- ALSA sequencer core fix for racy proc file accesses
- Revert the option rename of snd-hda-intel to make compatible again
- Various device-specific fixes"
* tag 'sound-5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: seq: Fix racy access for queue timer in proc read
ALSA: usb-audio: fix sync-ep altsetting sanity check
ASoC: msm8916-wcd-digital: Reset RX interpolation path after use
ASoC: msm8916-wcd-analog: Fix MIC BIAS Internal1
ASoC: cros_ec_codec: Make the device acpi compatible
ASoC: sti: fix possible sleep-in-atomic
ASoC: msm8916-wcd-analog: Fix selected events for MIC BIAS External1
ASoC: hdac_hda: Fix error in driver removal after failed probe
ASoC: SOF: Intel: fix HDA codec driver probe with multiple controllers
ASoC: SOF: Intel: lower print level to dbg if we will reinit DSP
ALSA: dice: fix fallback from protocol extension into limited functionality
ALSA: firewire-tascam: fix corruption due to spin lock without restoration in SoftIRQ context
ALSA: hda: Rename back to dmic_detect option
ASoC: stm32: dfsdm: fix 16 bits record
ASoC: stm32: sai: fix possible circular locking
ASoC: Fix NULL dereference at freeing
ASoC: Intel: bytcht_es8316: Fix Irbis NB41 netbook quirk
ASoC: rt5640: Fix NULL dereference on module unload
Johan Hovold [Fri, 17 Jan 2020 14:35:26 +0000 (15:35 +0100)]
USB: serial: quatech2: handle unbound ports
Check for NULL port data in the modem- and line-status handlers to avoid
dereferencing a NULL pointer in the unlikely case where a port device
isn't bound to a driver (e.g. after an allocation failure on port
probe).
Note that the other (stubbed) event handlers qt2_process_xmit_empty()
and qt2_process_flush() would need similar sanity checks in case they
are ever implemented.
Fixes: f7a33e608d9a ("USB: serial: add quatech2 usb to serial driver")
Cc: stable <stable@vger.kernel.org> # 3.5
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Johan Hovold [Fri, 17 Jan 2020 09:50:25 +0000 (10:50 +0100)]
USB: serial: keyspan: handle unbound ports
Check for NULL port data in the control URB completion handlers to avoid
dereferencing a NULL pointer in the unlikely case where a port device
isn't bound to a driver (e.g. after an allocation failure on port
probe()).
Fixes: 0ca1268e109a ("USB Serial Keyspan: add support for USA-49WG & USA-28XG")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Johan Hovold [Fri, 17 Jan 2020 09:50:24 +0000 (10:50 +0100)]
USB: serial: io_edgeport: add missing active-port sanity check
The driver receives the active port number from the device, but never
made sure that the port number was valid. This could lead to a
NULL-pointer dereference or memory corruption in case a device sends
data for an invalid port.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Johan Hovold [Fri, 17 Jan 2020 09:50:23 +0000 (10:50 +0100)]
USB: serial: io_edgeport: handle unbound ports on URB completion
Check for NULL port data in the shared interrupt and bulk completion
callbacks to avoid dereferencing a NULL pointer in case a device sends
data for a port device which isn't bound to a driver (e.g. due to a
malicious device having unexpected endpoints or after an allocation
failure on port probe).
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Johan Hovold [Fri, 17 Jan 2020 09:50:22 +0000 (10:50 +0100)]
USB: serial: ch341: handle unbound port at reset_resume
Check for NULL port data in reset_resume() to avoid dereferencing a NULL
pointer in case the port device isn't bound to a driver (e.g. after a
failed control request at port probe).
Fixes: 1ded7ea47b88 ("USB: ch341 serial: fix port number changed after resume")
Cc: stable <stable@vger.kernel.org> # 2.6.30
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Josef Bacik [Fri, 10 Jan 2020 16:11:24 +0000 (11:11 -0500)]
btrfs: check rw_devices, not num_devices for balance
The fstest btrfs/154 reports
[ 8675.381709] BTRFS: Transaction aborted (error -28)
[ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935
[ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs]
[ 8675.402672] RSP: 0018:
ffffb2090888fb00 EFLAGS:
00010286
[ 8675.404413] RAX:
0000000000000000 RBX:
ffff92026dfa91c8 RCX:
0000000000000001
[ 8675.406609] RDX:
0000000000000000 RSI:
ffffffff8e100899 RDI:
ffffffff8e100971
[ 8675.408775] RBP:
ffff920247c61660 R08:
0000000000000000 R09:
0000000000000000
[ 8675.410978] R10:
0000000000000000 R11:
0000000000000000 R12:
00000000ffffffe4
[ 8675.412647] R13:
ffff92026db74000 R14:
ffff920247c616b8 R15:
ffff92026dfbc000
[ 8675.413994] FS:
00007fd5e57248c0(0000) GS:
ffff92027d800000(0000) knlGS:
0000000000000000
[ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 8675.417833] CR2:
0000564aa51682d8 CR3:
000000006dcbc004 CR4:
0000000000160ee0
[ 8675.419801] Call Trace:
[ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs]
[ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs]
[ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs]
[ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs]
[ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0
[ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs]
[ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs]
[ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0
[ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400
[ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0
[ 8675.436618] ? _raw_spin_unlock+0x1f/0x30
[ 8675.438093] ? __handle_mm_fault+0x499/0x740
[ 8675.439619] ? do_vfs_ioctl+0x56e/0x770
[ 8675.441034] do_vfs_ioctl+0x56e/0x770
[ 8675.442411] ksys_ioctl+0x3a/0x70
[ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8675.445333] __x64_sys_ioctl+0x16/0x20
[ 8675.446705] do_syscall_64+0x50/0x210
[ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left
We now use btrfs_can_overcommit() to see if we can flip a block group
read only. Before this would fail because we weren't taking into
account the usable un-allocated space for allocating chunks. With my
patches we were allowed to do the balance, which is technically correct.
The test is trying to start balance on degraded mount. So now we're
trying to allocate a chunk and cannot because we want to allocate a
RAID1 chunk, but there's only 1 device that's available for usage. This
results in an ENOSPC.
But we shouldn't even be making it this far, we don't have enough
devices to restripe. The problem is we're using btrfs_num_devices(),
that also includes missing devices. That's not actually what we want, we
need to use rw_devices.
The chunk_mutex is not needed here, rw_devices changes only in device
add, remove or replace, all are excluded by EXCL_OP mechanism.
Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add stacktrace, update changelog, drop chunk_mutex ]
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 16 Jan 2020 11:29:20 +0000 (11:29 +0000)]
Btrfs: always copy scrub arguments back to user space
If scrub returns an error we are not copying back the scrub arguments
structure to user space. This prevents user space to know how much
progress scrub has done if an error happened - this includes -ECANCELED
which is returned when users ask for scrub to stop. A particular use
case, which is used in btrfs-progs, is to resume scrub after it is
canceled, in that case it relies on checking the progress from the scrub
arguments structure and then use that progress in a call to resume
scrub.
So fix this by always copying the scrub arguments structure to user
space, overwriting the value returned to user space with -EFAULT only if
copying the structure failed to let user space know that either that
copying did not happen, and therefore the structure is stale, or it
happened partially and the structure is probably not valid and corrupt
due to the partial copy.
Reported-by: Graham Cobb <g.btrfs@cobb.uk.net>
Link: https://lore.kernel.org/linux-btrfs/d0a97688-78be-08de-ca7d-bcb4c7fb397e@cobb.uk.net/
Fixes: 06fe39ab15a6a4 ("Btrfs: do not overwrite scrub error with fault error in scrub ioctl")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Graham Cobb <g.btrfs@cobb.uk.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Fri, 17 Jan 2020 14:03:11 +0000 (06:03 -0800)]
Merge tag 'gpio-v5.5-4' of git://git./linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"This reverts the GPIOLIB_IRQCHIP in the ThunderX driver.
ThunderX is a piece of Arm-based server chip. I converted the driver
to hierarchical gpiochip without access to real silicon and failed
miserably since I didn't take MSI's into account.
Kevin Hao helpfully stepped in and fixed it properly, let's revert it
for v5.5 and put the proper conversion into v5.6"
* tag 'gpio-v5.5-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
Revert "gpio: thunderx: Switch to GPIOLIB_IRQCHIP"
Linus Torvalds [Fri, 17 Jan 2020 13:54:18 +0000 (05:54 -0800)]
Merge tag 'block-5.5-2020-01-16' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Three fixes that should go into this release:
- The 32-bit segment size fix that I mentioned last week (Ming)
- Use uint for the block size (Mikulas)
- A null_blk zone write handling fix (Damien)"
* tag 'block-5.5-2020-01-16' of git://git.kernel.dk/linux-block:
block: fix an integer overflow in logical block size
null_blk: Fix zone write handling
block: fix get_max_segment_size() overflow on 32bit arch
Florian Fainelli [Thu, 16 Jan 2020 21:08:58 +0000 (13:08 -0800)]
net: systemport: Fixed queue mapping in internal ring map
We would not be transmitting using the correct SYSTEMPORT transmit queue
during ndo_select_queue() which looks up the internal TX ring map
because while establishing the mapping we would be off by 4, so for
instance, when we populate switch port mappings we would be doing:
switch port 0, queue 0 -> ring index #0
switch port 0, queue 1 -> ring index #1
...
switch port 0, queue 3 -> ring index #3
switch port 1, queue 0 -> ring index #8 (4 + 4 * 1)
...
instead of using ring index #4. This would cause our ndo_select_queue()
to use the fallback queue mechanism which would pick up an incorrect
ring for that switch port. Fix this by using the correct switch queue
number instead of SYSTEMPORT queue number.
Fixes: 25c440704661 ("net: systemport: Simplify queue mapping logic")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 16 Jan 2020 20:55:48 +0000 (12:55 -0800)]
net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec
With the implementation of the system reset controller we lost a setting
that is currently applied by the bootloader and which configures the IMP
port for 2Gb/sec, the default is 1Gb/sec. This is needed given the
number of ports and applications we expect to run so bring back that
setting.
Fixes: 01b0ac07589e ("net: dsa: bcm_sf2: Add support for optional reset controller line")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Thu, 16 Jan 2020 18:43:27 +0000 (20:43 +0200)]
net: dsa: sja1105: Don't error out on disabled ports with no phy-mode
The sja1105_parse_ports_node function was tested only on device trees
where all ports were enabled. Fix this check so that the driver
continues to probe only with the ports where status is not "disabled",
as expected.
Fixes: 8aa9ebccae87 ("net: dsa: Introduce driver for NXP SJA1105 5-port L2 switch")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Grzeschik [Thu, 16 Jan 2020 13:16:31 +0000 (14:16 +0100)]
net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset
According to the Datasheet this bit should be 0 (Normal operation) in
default. With the FORCE_LINK_GOOD bit set, it is not possible to get a
link. This patch sets FORCE_LINK_GOOD to the default value after
resetting the phy.
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>