Steve Wise [Thu, 1 Mar 2018 21:57:51 +0000 (13:57 -0800)]
RDMA/nldev: provide detailed CQ information
Implement the RDMA nldev netlink interface for dumping detailed
CQ information.
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 1 Mar 2018 21:57:44 +0000 (13:57 -0800)]
RDMA/nldev: provide detailed CM_ID information
Implement RDMA nldev netlink interface to get detailed CM_ID information.
Because cm_id's are attached to rdma devices in various work queue
contexts, the pid and task information at restrak_add() time is sometimes
not useful. For example, an nvme/f host connection cm_id ends up being
bound to a device in a work queue context and the resulting pid at attach
time no longer exists after connection setup. So instead we mark all
cm_id's created via the rdma_ucm as "user", and all others as "kernel".
This required tweaking the restrack code a little. It also required
wrapping some rdma_cm functions to allow passing the module name string.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 1 Mar 2018 21:57:36 +0000 (13:57 -0800)]
RDMA/CM: move rdma_id_private to cma_priv.h
Move struct rdma_id_private to a new header cma_priv.h so the resource
tracking services in core/nldev.c can read useful information about cm_ids.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 1 Mar 2018 21:57:29 +0000 (13:57 -0800)]
RDMA/nldev: common resource dumpit function
Create a common dumpit function that can be used by all common resource
types. This reduces code replication and simplifies the code as we add
more resource types.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 1 Mar 2018 21:57:22 +0000 (13:57 -0800)]
RDMA/restrack: clean up res_to_dev()
Simplify res_to_dev() to make it easier to read/maintain.
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Yishai Hadas [Mon, 26 Feb 2018 13:02:21 +0000 (15:02 +0200)]
IB/mlx4: Move mlx4_uverbs_ex_query_device_resp to include/uapi/
This struct is involved in the user API for mlx4 and should not be hidden
inside a driver header file.
Fixes: 09d208b258a2 ("IB/mlx4: Add report for RSS capabilities by vendor channel")
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Doug Ledford [Wed, 7 Mar 2018 20:40:29 +0000 (15:40 -0500)]
Merge tag 'mlx5-updates-2018-02-28-1' of git://git./linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-28-1 (IPSec-1)
This series consists of some fixes and refactors for the mlx5 drivers,
especially around the FPGA and flow steering. Most of them are trivial
fixes and are the foundation of allowing IPSec acceleration from user-space.
We use flow steering abstraction in order to accelerate IPSec packets.
When a user creates a steering rule, [s]he states that we'll carry an
encrypt/decrypt flow action (using a specific configuration) for every
packet which conforms to a certain match. Since currently offloading these
packets is done via FPGA, we'll add another set of flow steering ops.
These ops will execute the required FPGA commands and then call the
standard steering ops.
In order to achieve this, we need that the commands will get all the
required information. Therefore, we pass the fte object and embed the
flow_action struct inside the fte. In addition, we add the shim layer
that will later be used for alternating between the standard and the
FPGA steering commands.
Some fixes, like " net/mlx5e: Wait for FPGA command responses with a timeout"
are very relevant for user-space applications, as these applications could
be killed, but we still want to wait for the FPGA and update the kernel's
database.
Regards,
Aviad and Matan
Signed-off-by: Doug Ledford <dledford@redhat.com>
Zhu Yanjun [Wed, 7 Mar 2018 05:47:57 +0000 (00:47 -0500)]
IB/rxe: change the function rxe_init_device_param type
The function rxe_init_device_param always return 0. So the function
type is changed to void.
CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Zhu Yanjun [Tue, 27 Feb 2018 11:04:33 +0000 (06:04 -0500)]
IB/rxe: remove unnecessary rxe in rxe_send
In the function rxe_send, the variable rxe is not used in it.
So it should be removed.
CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Zhu Yanjun [Tue, 27 Feb 2018 11:04:32 +0000 (06:04 -0500)]
IB/rxe: remove unnecessary skb_clone
In send_atomic_ack function, it is not necessary to make a
skb_clone. To gain better performance (high throughput and
low latency), this skb_clone is removed.
The following tests are made.
server client
--------- ---------
|1.1.1.1|<----rxe-channel--->|1.1.1.2|
--------- ---------
On server: rping -s -a 1.1.1.1 -v -C 1000 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1000 -S 512
The kernel config CONFIG_DEBUG_KMEMLEAK is enabled on both server
and client.
This test runs for several hours. There is no memory leak and the whole
system can work well.
Based on the above network, the following tests are made.
Server: ibv_rc_pingpong -d rxe0 -g 1
Client: ibv_rc_pingpong -d rxe0 -g 1 1.1.1.1
The test results on Server(10 tests are made).
Before:
Throughput is 137.07 Mbit/sec
Latency is 517.76 usec/iter
After:
Throughput is 148.85 Mbit/sec
Latency is 476.64 usec/iter
The throughput is enhanced and the latency is reduced.
CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Bart Van Assche [Fri, 2 Mar 2018 21:14:15 +0000 (13:14 -0800)]
IB/srpt: Add RDMA/CM support
Add a parameter for configuring the port on which the ib_srpt driver
listens for incoming RDMA/CM connections, namely
/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port. The default
value for this parameter is 0 which means "do not listen for incoming
RDMA/CM connections". Add RDMA/CM support to all code that handles
connection state changes. Modify srpt_init_nodeacl() such that ACLs can
be configured for IPv4 and IPv6 addresses.
Note: incoming connection requests are only accepted for ports that
have been enabled. See also the "if (!sport->enabled)" code in the
connection request handler. See also the following configfs attribute:
/sys/kernel/config/target/srpt/$port/$port/enable.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Aviad Yehezkel [Sun, 18 Feb 2018 13:00:54 +0000 (15:00 +0200)]
net/mlx5: Flow steering cmd interface should get the fte when deleting
Previously, deleting a flow steering entry only got the index.
Since the FPGA implementation of FTE's deletion might need to dig
inside the FTE itself, we would like to get the FTE's context.
Changing the interface to pass the FTE context.
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Boris Pismenny [Sun, 20 Aug 2017 12:13:08 +0000 (15:13 +0300)]
{net,IB}/mlx5: Add flow steering helpers
Add helper functions that check if a protocol is
part of a flow steering match criteria.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Matan Barak [Thu, 9 Nov 2017 12:12:15 +0000 (12:12 +0000)]
net/mlx5: Embed mlx5_flow_act into fs_fte
fte objects contain the match value and action. Currently, extending
the actions require in adding them both to the API and fs_fte.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aviad Yehezkel [Sun, 18 Feb 2018 11:17:17 +0000 (13:17 +0200)]
net/mlx5: Add empty egress namespace to flow steering core
Currently, we don't support egress flow steering namespace in mlx5
flow steering core implementation. However, when we want to encrypt
a packet, we model it as a flow steering rule in the egress path.
To overcome this, we add an empty egress namespace to flow steering.
This namespace is initialized only when ipsec support exists.
In the future, this will grow to a full blown full steering
implementation, resembling the ingress path.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Matan Barak [Sun, 20 Aug 2017 12:46:51 +0000 (15:46 +0300)]
net/mlx5: Add shim layer between fs and cmd
The shim layer allows each namespace to define possibly different
functionality for add/delete/update commands. The shim layer
introduced here, will be used to support flow steering with the FPGA.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Matan Barak [Wed, 16 Aug 2017 06:43:48 +0000 (09:43 +0300)]
{net,IB}/mlx5: Add has_tag to mlx5_flow_act
The has_tag member will indicate whether a tag action was specified
in flow specification.
A flow tag 0 = MLX5_FS_DEFAULT_FLOW_TAG is assumed a valid flow tag
that is currently used by mlx5 RDMA driver, whereas in HW flow_tag = 0
means that the user doesn't care about flow_tag. HW always provide
a flow_tag = 0 if all flow tags requested on a specific flow are 0.
So we need a way (in the driver) to differentiate between a user really
requesting flow_tag = 0 and a user who does not care, in order to be
able to report conflicting flow tags on a specific flow.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Boris Pismenny [Wed, 16 Aug 2017 06:33:30 +0000 (09:33 +0300)]
IB/mlx5: Pass mlx5_flow_act struct instead of multiple arguments
Group and pass all function arguments of parse_flow_attr call in one
common struct mlx5_flow_act.
This patch passes all the action arguments of parse_flow_attr in one common
struct mlx5_flow_act. It allows us to scale the number of actions without adding
new arguments to the function.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Sun, 19 Nov 2017 15:51:13 +0000 (15:51 +0000)]
net/mlx5: FPGA and IPSec initialization to be before flow steering
Some flow steering namespace initialization (i.e. egress namespace)
might depend on FPGA capabilities. Changing the initialization order
such that the FPGA will be initialized before flow steering.
Flow steering fs cmds initialization might depend on
IPSec capabilities. Changing the initialization order such
that the IPSec will be initialized before flow steering as well.
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aviad Yehezkel [Mon, 29 Jan 2018 11:09:12 +0000 (13:09 +0200)]
net/mlx5e: Removed not need synchronize_rcu
This is already done by xfrm layer between state_dev_del callback
to state_dev_free callback.
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aviad Yehezkel [Sun, 28 Jan 2018 15:25:35 +0000 (17:25 +0200)]
net/mlx5e: Fixed sleeping inside atomic context
We can't allocate with GFP_KERNEL inside spinlock.
Actually ida_simple doesn't require spinlock so remove it.
Fixes: 547eede070eb ("net/mlx5e: IPSec, Innova IPSec offload infrastructure")
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aviad Yehezkel [Sun, 11 Feb 2018 15:12:44 +0000 (17:12 +0200)]
net/mlx5e: Wait for FPGA command responses with a timeout
Generally, FPGA IPSec commands must always complete.
We want to wait for one minute for them to complete gracefully also
when killing a process.
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aviad Yehezkel [Thu, 22 Feb 2018 15:40:52 +0000 (17:40 +0200)]
net/mlx5: Fixed compilation issue when CONFIG_MLX5_ACCEL is disabled
IPSec init and cleanup functions also depends on linux/mlx5/driver.h.
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aviad Yehezkel [Wed, 31 Jan 2018 13:07:33 +0000 (15:07 +0200)]
IB/mlx5: Removed not used parameters
Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Gustavo A. R. Silva [Mon, 5 Mar 2018 23:36:47 +0000 (17:36 -0600)]
RDMA/bnxt_re/qplib_sp: Use true and false for boolean values
Assign true or false to boolean variables instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Sergey Gorenko [Mon, 5 Mar 2018 18:15:56 +0000 (20:15 +0200)]
IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
If a HCA supports the SG_GAPS_REG feature then fewer memory regions
are required per command. This patch reduces the number of memory
regions that is allocated per SRP session.
Signed-off-by: Sergey Gorenko <sergeygo@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Mon, 5 Mar 2018 17:58:33 +0000 (09:58 -0800)]
IB/hfi1: Add a missing rcu_read_unlock()
This patch avoids that sparse reports the following:
drivers/infiniband/hw/hfi1/driver.c:251:13: warning: context imbalance in 'rcv_hdrerr' - different lock contexts for basic block
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Arushi [Sat, 3 Mar 2018 16:24:57 +0000 (21:54 +0530)]
infiniband: hw: Drop unnecessary continue
Continue at the bottom of a loop are removed.
Issue found using drop_continue.cocci Coccinelle script.
Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Shiraz Saleem [Fri, 2 Mar 2018 21:17:14 +0000 (15:17 -0600)]
i40iw: Implement get_vector_affinity API
Storage ULPs (like NVMEoF) benefit from exposing affinity mapping
per completion vector to find the optimal multi-queue affinity
assignments. The ULPs call the verbs API ib_get_vector_affinity
introduced in commit
c66cd353bbe ("RDMA/core: expose affinity mappings per
completion vector") to get the underlying devices affinity mappings.
Add support in driver to expose the affinity masks per MSI-X
completion vector.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Shiraz Saleem [Fri, 2 Mar 2018 21:17:13 +0000 (15:17 -0600)]
i40iw: Improve CM node lookup time on connection setup
Currently all CM nodes involved in a connection are
maintained in a connected_node list per dev. During
connection setup, we need to search this every time
we receive a packet on the iWARP LAN Queue (ILQ) and
this can be pretty inefficient for large number of
connections.
Fix this by organizing the CM nodes in two lists -
accelerated list and non-accelerated list. The search
on ILQ receive would be limited to only non accelerated
nodes. When a node moves to RTS, it is added to the
accelerated list.
Benchmarking ucmatose 16k connections shows a 20%
improvement in test completion time.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mustafa Ismail [Fri, 2 Mar 2018 21:17:12 +0000 (15:17 -0600)]
i40iw: Refactor handling of txpend list
Currently the TX pending lists for IEQ and ILQ are
handled separately. The handling of both can be
consolidated in i40iw_poll_completion.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Thu, 1 Mar 2018 22:00:30 +0000 (14:00 -0800)]
IB/srpt: Fix an out-of-bounds stack access in srpt_zerolength_write()
Avoid triggering an out-of-bounds stack access by changing the type
of 'wr' from ib_send_wr into ib_rdma_wr.
This patch fixes the following KASAN bug report:
BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x7a9/0x9a0 [rdma_rxe]
Read of size 8 at addr
ffff880068197a48 by task kworker/2:1/44
Workqueue: ib_cm cm_work_handler [ib_cm]
Call Trace:
dump_stack+0x8e/0xcd
print_address_description+0x6f/0x280
kasan_report+0x25a/0x380
__asan_load8+0x54/0x90
rxe_post_send+0x7a9/0x9a0 [rdma_rxe]
srpt_zerolength_write+0xf0/0x180 [ib_srpt]
srpt_cm_rtu_recv+0x68/0x110 [ib_srpt]
srpt_rdma_cm_handler+0xbb/0x15b [ib_srpt]
cma_ib_handler+0x1aa/0x4a0 [rdma_cm]
cm_process_work+0x30/0x100 [ib_cm]
cm_work_handler+0xa86/0x351b [ib_cm]
process_one_work+0x475/0x9f0
worker_thread+0x69/0x690
kthread+0x1ad/0x1d0
ret_from_fork+0x3a/0x50
Fixes: aaf45bd83eba ("IB/srpt: Detect session shutdown reliably")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Thu, 1 Mar 2018 22:00:29 +0000 (14:00 -0800)]
RDMA/rxe: Fix an out-of-bounds read
This patch avoids that KASAN reports the following when the SRP initiator
calls srp_post_send():
==================================================================
BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x5c4/0x980 [rdma_rxe]
Read of size 8 at addr
ffff880066606e30 by task 02-mq/1074
CPU: 2 PID: 1074 Comm: 02-mq Not tainted 4.16.0-rc3-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x85/0xc7
print_address_description+0x65/0x270
kasan_report+0x231/0x350
rxe_post_send+0x5c4/0x980 [rdma_rxe]
srp_post_send.isra.16+0x149/0x190 [ib_srp]
srp_queuecommand+0x94d/0x1670 [ib_srp]
scsi_dispatch_cmd+0x1c2/0x550 [scsi_mod]
scsi_queue_rq+0x843/0xa70 [scsi_mod]
blk_mq_dispatch_rq_list+0x143/0xac0
blk_mq_do_dispatch_ctx+0x1c5/0x260
blk_mq_sched_dispatch_requests+0x2bf/0x2f0
__blk_mq_run_hw_queue+0xdb/0x160
__blk_mq_delay_run_hw_queue+0xba/0x100
blk_mq_run_hw_queue+0xf2/0x190
blk_mq_sched_insert_request+0x163/0x2f0
blk_execute_rq+0xb0/0x130
scsi_execute+0x14e/0x260 [scsi_mod]
scsi_probe_and_add_lun+0x366/0x13d0 [scsi_mod]
__scsi_scan_target+0x18a/0x810 [scsi_mod]
scsi_scan_target+0x11e/0x130 [scsi_mod]
srp_create_target+0x1522/0x19e0 [ib_srp]
kernfs_fop_write+0x180/0x210
__vfs_write+0xb1/0x2e0
vfs_write+0xf6/0x250
SyS_write+0x99/0x110
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
The buggy address belongs to the page:
page:
ffffea0001998180 count:0 mapcount:0 mapping:
0000000000000000 index:0x0
flags: 0x4000000000000000()
raw:
4000000000000000 0000000000000000 0000000000000000 00000000ffffffff
raw:
dead000000000100 dead000000000200 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff880066606d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1
ffff880066606d80: f1 00 f2 f2 f2 f2 f2 f2 f2 00 00 f2 f2 f2 f2 f2
>
ffff880066606e00: f2 00 00 00 00 00 f2 f2 f2 f3 f3 f3 f3 00 00 00
^
ffff880066606e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff880066606f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Moni Shoua <monis@mellanox.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Thu, 1 Mar 2018 22:00:28 +0000 (14:00 -0800)]
RDMA/core: Avoid that ib_drain_qp() triggers an out-of-bounds stack access
This patch fixes the following KASAN complaint:
==================================================================
BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x77d/0x9b0 [rdma_rxe]
Read of size 8 at addr
ffff880061aef860 by task 01/1080
CPU: 2 PID: 1080 Comm: 01 Not tainted 4.16.0-rc3-dbg+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x85/0xc7
print_address_description+0x65/0x270
kasan_report+0x231/0x350
rxe_post_send+0x77d/0x9b0 [rdma_rxe]
__ib_drain_sq+0x1ad/0x250 [ib_core]
ib_drain_qp+0x9/0x30 [ib_core]
srp_destroy_qp+0x51/0x70 [ib_srp]
srp_free_ch_ib+0xfc/0x380 [ib_srp]
srp_create_target+0x1071/0x19e0 [ib_srp]
kernfs_fop_write+0x180/0x210
__vfs_write+0xb1/0x2e0
vfs_write+0xf6/0x250
SyS_write+0x99/0x110
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
The buggy address belongs to the page:
page:
ffffea000186bbc0 count:0 mapcount:0 mapping:
0000000000000000 index:0x0
flags: 0x4000000000000000()
raw:
4000000000000000 0000000000000000 0000000000000000 00000000ffffffff
raw:
0000000000000000 ffffea000186bbe0 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff880061aef700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff880061aef780: 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00
>
ffff880061aef800: f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 f2 f2 f2 f2
^
ffff880061aef880: f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 f2 f2
ffff880061aef900: f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Fixes: 765d67748bcf ("IB: new common API for draining queues")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Colin Ian King [Thu, 1 Mar 2018 16:23:54 +0000 (16:23 +0000)]
infiniband: remove redundant assignment to pointer 'rdi'
The pointer rdi is being initialized with a value that is never read
and re-assigned immediately after, hence the initialization is redundant
and can be removed.
Cleans up clang warning:
drivers/infiniband/sw/rdmavt/vt.c:94:23: warning: Value stored to 'rdi'
during its initialization is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Hernán Gonzalez [Tue, 27 Feb 2018 22:07:58 +0000 (19:07 -0300)]
IB/rxe: Remove unused variable (char *rxe_qp_state_name[])
Note: This is compile only tested as I have no access to the hw. This
variable was not used anywhere in the code. Removing it saves 24 bytes.
add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-24 (-24)
Function old new delta
rxe_qp_state_name 24 - -24
Total: Before=
3348732, After=
3348708, chg -0.00%
Signed-off-by: Hernán Gonzalez <hernan@vanguardiasur.com.ar>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Hernán Gonzalez [Tue, 27 Feb 2018 22:05:43 +0000 (19:05 -0300)]
IB/qib: Move char *qib_sdma_state_names[] and constify while there.
Note: This is compile only tested as I have no access to the hw.
This variable was not used in qib_sdma.c but in qib_iba7322.c. Declaring it
there, as static, saves 56 bytes.
add/remove: 0/2 grow/shrink: 0/0 up/down: 0/-144 (-144)
Function old new delta
qib_sdma_state_names 56 - -56
qib_sdma_event_names 88 - -88
Total: Before=
2874565, After=
2874421, chg -0.01%
Signed-off-by: Hernán Gonzalez <hernan@vanguardiasur.com.ar>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Hernán Gonzalez [Tue, 27 Feb 2018 22:05:42 +0000 (19:05 -0300)]
IB/qib: Remove unused variable (char *qib_sdma_event_names[])
Note: This is compile only tested as I have no access to the hw.
This variable was not used anywhere in the code. Removing it saves 88
bytes.
add/remove: 0/1 grow/shrink: 0/0 up/down: 0/-88 (-88)
Function old new delta
qib_sdma_event_names 88 - -88
Total: Before=
2874565, After=
2874477, chg -0.00%
Signed-off-by: Hernán Gonzalez <hernan@vanguardiasur.com.ar>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Fri, 23 Feb 2018 22:09:26 +0000 (14:09 -0800)]
IB/srp: Use %pIS instead of inet_ntop()
Except for a minor log message change, this patch does not change
any functionality. For the introduction of %pIS, see also commit
1067964305df ("lib: vsprintf: add IPv4/v6 generic %p[Ii]S[pfs]
format specifier").
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Fri, 23 Feb 2018 22:09:25 +0000 (14:09 -0800)]
Revert "IB/srp: Avoid that a cable pull can trigger a kernel crash"
The caller of srp_ib_lookup_path() is responsible for holding a reference
on the SCSI host. That means that commit
8a0d18c62121 was not necessary.
Hence revert it.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Fri, 23 Feb 2018 22:09:24 +0000 (14:09 -0800)]
IB/srp: Fix srp_abort()
Before commit
e494f6a72839 ("[SCSI] improved eh timeout handler") it
did not really matter whether or not abort handlers like srp_abort()
called .scsi_done() when returning another value than SUCCESS. Since
that commit however this matters. Hence only call .scsi_done() when
returning SUCCESS.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Arnd Bergmann [Tue, 20 Feb 2018 20:56:27 +0000 (21:56 +0100)]
infiniband: bnxt_re: use BIT_ULL() for 64-bit bit masks
On 32-bit targets, we otherwise get a warning about an impossible constant
integer expression:
In file included from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/infiniband/hw/bnxt_re/ib_verbs.c:39:
drivers/infiniband/hw/bnxt_re/ib_verbs.c: In function 'bnxt_re_query_device':
include/linux/bitops.h:7:24: error: left shift count >= width of type [-Werror=shift-count-overflow]
#define BIT(nr) (1UL << (nr))
^~
drivers/infiniband/hw/bnxt_re/bnxt_re.h:61:34: note: in expansion of macro 'BIT'
#define BNXT_RE_MAX_MR_SIZE_HIGH BIT(39)
^~~
drivers/infiniband/hw/bnxt_re/bnxt_re.h:62:30: note: in expansion of macro 'BNXT_RE_MAX_MR_SIZE_HIGH'
#define BNXT_RE_MAX_MR_SIZE BNXT_RE_MAX_MR_SIZE_HIGH
^~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/bnxt_re/ib_verbs.c:149:25: note: in expansion of macro 'BNXT_RE_MAX_MR_SIZE'
ib_attr->max_mr_size = BNXT_RE_MAX_MR_SIZE;
^~~~~~~~~~~~~~~~~~~
Fixes: 872f3578241d ("RDMA/bnxt_re: Add support for MRs with Huge pages")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Arnd Bergmann [Tue, 20 Feb 2018 20:56:26 +0000 (21:56 +0100)]
infiniband: qplib_fp: fix pointer cast
Building for a 32-bit target results in a couple of warnings from casting
between a 32-bit pointer and a 64-bit integer:
drivers/infiniband/hw/bnxt_re/qplib_fp.c: In function 'bnxt_qplib_service_nq':
drivers/infiniband/hw/bnxt_re/qplib_fp.c:333:23: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
bnxt_qplib_arm_srq((struct bnxt_qplib_srq *)q_handle,
^
drivers/infiniband/hw/bnxt_re/qplib_fp.c:336:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
(struct bnxt_qplib_srq *)q_handle,
^
In file included from include/linux/byteorder/little_endian.h:5,
from arch/arm/include/uapi/asm/byteorder.h:22,
from include/asm-generic/bitops/le.h:6,
from arch/arm/include/asm/bitops.h:342,
from include/linux/bitops.h:38,
from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/infiniband/hw/bnxt_re/qplib_fp.c:39:
drivers/infiniband/hw/bnxt_re/qplib_fp.c: In function 'bnxt_qplib_create_srq':
include/uapi/linux/byteorder/little_endian.h:31:43: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
#define __cpu_to_le64(x) ((__force __le64)(__u64)(x))
^
include/linux/byteorder/generic.h:86:21: note: in expansion of macro '__cpu_to_le64'
#define cpu_to_le64 __cpu_to_le64
^~~~~~~~~~~~~
drivers/infiniband/hw/bnxt_re/qplib_fp.c:569:19: note: in expansion of macro 'cpu_to_le64'
req.srq_handle = cpu_to_le64(srq);
Using a uintptr_t as an intermediate works on all architectures.
Fixes: 37cb11acf1f7 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Markus Elfring [Sat, 27 Jan 2018 20:48:01 +0000 (21:48 +0100)]
RDMA/iwpm: Delete an error message for a failed memory allocation in iwpm_create_nlmsg()
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Markus Elfring [Sat, 27 Jan 2018 19:06:59 +0000 (20:06 +0100)]
IB/usnic: Delete an error message for a failed memory allocation in usnic_transport_init()
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Sun, 25 Feb 2018 11:39:52 +0000 (13:39 +0200)]
RDMA/mlx5: Refactor QP type check to be as early as possible
Perform QP type check in one place and fail as early as possible.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 28 Feb 2018 06:29:56 +0000 (08:29 +0200)]
mailmap: Map Leon Romanovsky's emails
Update .mailmap file to point to my primary open-source
related e-mail address.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:39 +0000 (12:18 +0200)]
IB/uverbs: Tidy uverbs_uobject_add
Maintaining the uobjects list is mandatory, hoist it into the common
rdma_alloc_commit_uobject() function and inline it as there is now
only one caller.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Doug Ledford [Wed, 28 Feb 2018 18:37:39 +0000 (13:37 -0500)]
Merge tag 'mlx5-updates-2018-02-23' of git://git./linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-update-2018-02-23 (IB representors)
From: Mark Bloch <markb@mellanox.com>
=========
Add IB representor when in switchdev mode
The following series adds support for an IB (RAW Ethernet only) device
representor which is created when the user switches to switchdev mode.
Today when switching to switchdev mode the only representors which are
created are net devices. Each netdev is a representor of a virtual
function and any data sent via the representor is received on the virtual
function, and any data sent via the virtual function is received by the
representor.
For the mlx5 driver the main use of this functionality is to be able to
use Open vSwitch on the hypervisor in order to manage/control traffic
from/to the virtual functions. Open vSwitch can also work with DPDK
devices and not just net devices, this series exposes an IB device, which
Mellanox PMD driver uses, which then can be used by Open vSwitch DPDK.
An IB device representor exposes only RAW Ethernet QP capabilities and
the ability to create flow rules to direct traffic to its RX queues. The
state of the IB device (ACTIVE/DOWN etc..) is based on the state of the
corresponding net device representor. No other RDMA/RoCE functionality is
currently supported and no GID table is exposed.
=========
Signed-off-by: Doug Ledford <dledford@redhat.com>
Mark Bloch [Mon, 15 Jan 2018 13:11:37 +0000 (13:11 +0000)]
IB/mlx5: Disable self loopback check when in switchdev mode
When in switchdev mode, there is no need to do self loopback checks
as we can't receive those packets, we insert steering rules to the
eswitch that make sure packets can't be looped back.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 23 Jan 2018 11:24:13 +0000 (11:24 +0000)]
net/mlx5: E-Switch, Reload IB interface when switching devlink modes
Up until this point it wasn't possible to activate IB representors
when switching to switchdev mode, remove this limitation.
We trigger reload of the PF IB interface in order to make sure that
already allocated resources are invalid and new resources will be opened
correctly with all the limitations of switchdev mode applied (only raw
packet capabilities, without RoCE). We also move the remove/add to a
place where the E-Switch mode is set/unset to better control when to
trigger this action, this will allow the IB side to start in the correct
mode.
For better code reuse, create a function which reloads an interface and
export it.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 23 Jan 2018 11:16:30 +0000 (11:16 +0000)]
IB/mlx5: Add proper representors support
This commit adds full support for IB representor:
1) Representors profile, We add two new profiles:
nic_rep_profile - This profile will be used to create an IB device that
represents the PF/UPLINK.
rep_profile - This profile will be used to create an IB device that
represents VFs. Each VF will be its own representor.
2) Proper load/unload callbacks, Those are called by the E-Switch when
moving to/from switchdev mode.
3) Different flow DB handling for when we in switchdev mode.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Mon, 29 Jan 2018 10:40:37 +0000 (10:40 +0000)]
IB/mlx5: E-Switch, Add rule to forward traffic to vport
In order to forward traffic from representor's SQ to the right virtual
function, every time an SQ is created also add the corresponding flow rule
to the FDB.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Mon, 22 Jan 2018 15:29:44 +0000 (15:29 +0000)]
IB/mlx5: Don't expose MR cache in switchdev mode
When enabling many VFs and switching to switchdev mode, the total amount
of mkeys we try to allocate when loading representors is very large and
may cause timeouts on allocations, the same issues was observed on VFs
and we employ the same fix that was done for them. We avoid allocating
the full MR cache on load but still allow it to be manipulated once the
IB device is loaded.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Mon, 6 Nov 2017 12:22:13 +0000 (12:22 +0000)]
IB/mlx5: When in switchdev mode, expose only raw packet capabilities
Currently in switchdev mode we allow only for raw packet QPs.
Expose the right capabilities and set the gid table length to 0, also
make sure we don't try to enable RoCE, so split the function
to enable RoCE so representors can enable only the notifier needed for
net device events.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 15:02:36 +0000 (15:02 +0000)]
IB/mlx5: Listen to netdev register/unresiter events in switchdev mode
Currently we listen to netdev register/unregister event based on PCI
device. When in switchdev mode PF and representors share the same PCI
device, so in order to pair ib device and netdev in switchdev mode
compare the netdev that triggered the event to that of the representor.
Expose a function that lets you receive the netdev associated what
a given representor.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:44:29 +0000 (14:44 +0000)]
IB/mlx5: Add match on vport when in switchdev mode
When we point to a representor, it means we are in switchdev mode.
The flow db is shared between PF and virtual function representors
so each rule created needs to have a match on its specific source port.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:42:35 +0000 (14:42 +0000)]
IB/mlx5: Allocate flow DB only on PF IB device
A flow DB is a shared resource between PF and representors,
need to allocate it only when creating the PF IB device.
Once we add IB representors, they will use the flow db which was
created by the PF.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:34:48 +0000 (14:34 +0000)]
IB/mlx5: Add basic regiser/unregister representors code
Create the basic infrastructure of registering and unregistering
IB representors. The load/unload callbacks are left empty and
proper implementation will be introduced in following patches.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:13:46 +0000 (14:13 +0000)]
net/mlx5: E-Switch, Add definition of IB representor
Create a new representor type: REP_IB. which will be initialized by an IB
device that is used as a logical representor of a eswitch vport (VF or
uplink) just like we have a net device today in switchdev mode.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 30 Jan 2018 10:46:34 +0000 (10:46 +0000)]
net/mlx5: E-Switch, Optimize HW steering tables in switchdev mode
Under switchdev mode we insert an eswitch miss rule causing any
unmatched traffic to be sent towards the PF vport. This miss rule can
be optimized if we break it to two, one case is for multicast traffic and
the other for unicast.
Breaking the miss rule into two (unicast and multicast) allows the firmware
to program the hardware in a more efficient way.
Using ConncetX-5 Ex with IXIA and testpmd (which use IB representors):
IXIA -> NIC -> PF -> IB representor -> NIC -> VF:
- Without this optimization: 9.2 MPPS.
- With this optimization: 18 MPPS.
VF -> NIC -> IB representor-> PF -> NIC -> IXIA:
- Without this optimization: 17 MPPS.
- With this optimization: 23.4 MPPS.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 23 Jan 2018 11:19:12 +0000 (11:19 +0000)]
net/mlx5: E-Switch, Increase number of FTEs in FDB in switchdev mode
The max FTE number should be the max number of SQs that can be opened.
Ethernet representors open one SQ each. Once we add IB representor this
will increase (depends on the user). For now lets start with 31
per IB representor and if needed increase in the future.
This increase only affects the number of FTEs in the slow path FDB,
offloaded rules (done via TC on the fast path portion of the FDB)
aren't affected.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:04:14 +0000 (14:04 +0000)]
net/mlx5: E-Switch, Move representors definition to a global scope
In preparation for IB representors, move representors structs to a global
scope, also expose functions needed for registration, unregistration,
eswitch mode and creating a flow rule to direct traffic from SQs to the
right VF.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Wed, 27 Sep 2017 07:24:18 +0000 (07:24 +0000)]
net/mlx5: E-Switch, Add callback to get representor device
Add a callback interface to get a protocol device (per representor type).
The Ethernet representors will expose their netdev via this interface.
This functionality can be later used by IB representor in order to find the
corresponding net device representor.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:44 +0000 (18:12 +0200)]
RDMA/verbs: Return proper error code for not supported system call
The proper return error is -EOPNOTSUPP and not -ENOSYS, so update
all places in verbs.c to match this semantics.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:43 +0000 (18:12 +0200)]
RDMA/uverbs: Reduce number of command header flags checks
Simplify the code by directly checking the availability of extended
command flog instead of doing multiple shift operations.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:42 +0000 (18:12 +0200)]
RDMA/uverbs: Replace user's types with kernel's types
The internal to kernel variable declarations don't need to be
declared with user types. This patch converts such occurrences
appeared in ib_uverbs_write().
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:41 +0000 (18:12 +0200)]
RDMA/uverbs: Refactor the header validation logic
Move all header validation logic to be performed before SRCU read lock.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:40 +0000 (18:12 +0200)]
RDMa/uverbs: Copy ex_hdr outside of SRCU read lock
The SRCU read lock protects the IB device pointer
and doesn't need to be called before copying user
provided header.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:39 +0000 (18:12 +0200)]
RDMA/uverbs: Move uncontext check before SRCU read lock
There is no need to take SRCU lock before checking
file->ucontext, so move it do it before it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:38 +0000 (18:12 +0200)]
RDMA/uverbs: Properly check command supported mask
The check based on index is not sufficient because
IB_USER_VERBS_EX_CMD_CREATE_CQ = IB_USER_VERBS_CMD_CREATE_CQ
and IB_USER_VERBS_CMD_CREATE_CQ <= IB_USER_VERBS_CMD_OPEN_QP,
so if we execute IB_USER_VERBS_EX_CMD_CREATE_CQ this code checks
ib_dev->uverbs_cmd_mask not ib_dev->uverbs_ex_cmd_mask.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:37 +0000 (18:12 +0200)]
RDMA/uverbs: Refactor command header processing
Move all command header processing into separate function
and perform those checks before acquiring SRCU read lock.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:36 +0000 (18:12 +0200)]
RDMA/uverbs: Unify return values of not supported command
The non-existing command is supposed to return -EOPNOTSUPP, but the
current code returns different errors for different flows for the
same failure. This patch unifies those flows.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:35 +0000 (18:12 +0200)]
RDMA/uverbs: Return not supported error code for unsupported commands
Command that doesn't exist means that it is not supported,
so update code to return -EOPNOTSUPP in case of failure.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:34 +0000 (18:12 +0200)]
RDMA/uverbs: Fail as early as possible if not enough header data was provided
Fail as early as possible if not enough header data
was provided.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:33 +0000 (18:12 +0200)]
RDMA/uverbs: Refactor flags checks and update return value
Since commit
f21519b23c1b ("IB/core: extended command: an
improved infrastructure for uverbs commands"), the uverbs
supports extra flags as an input to the command interface.
However actually, there is only one flag available and used,
so it is better to refactor the code, so the resolution and
report to the users is done as early as possible.
As part of this change, we changed the return value of failure case
from ENOSYS to be EINVAL to be consistent with the rest flags checks.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:32 +0000 (18:12 +0200)]
RDMA/uverbs: Update sizeof users
Update sizeof() users to be consistent with coding style.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:31 +0000 (18:12 +0200)]
RDMA/uverbs: Convert command mask validity check function to be bool
The function validate_command_mask() returns only two results: success
or failure, so convert it to return bool instead of 0 and -1.
Reported-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Fri, 23 Feb 2018 03:27:20 +0000 (22:27 -0500)]
Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next
There is a 14 patch series waiting to come into for-next that has a
dependecy on code submitted into this kernel's for-rc series. So, merge
the for-rc branch into the current for-next in order to make the patch
series apply cleanly.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Fri, 23 Feb 2018 01:52:28 +0000 (20:52 -0500)]
Merge tag 'mlx5-updates-2018-02-21' of git://git./linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-21
This series includes shared code updates for mlx5 core driver for both
netdev and rdma subsystems.
By Saeed,
First six patches of the series are meant to address a performance issue
and should provide a performance boost for multi core IRQ interrupt hungry
workloads. The issue is fixed in the first patch, all other patches are
meant to refactor the code in light of this fix.
The problem it comes to fix, is a shared spinlock accessed across all HCA
IRQs which protects the CQ database. To solve this we simply move the CQ
database and its spinlock to be per EQ (IRQ), thus per core.
By Yonatan,
Fragmented completion queue (CQ) for RDMA,
core driver implementation to create fragmented CQ buffers rather than
one large contiguous memory buffer, the implementation scheme already
exist and used by the netdev CQs, the patch shares that code with the
rdma CQ creation flow and makes use of the new API in mlx5_ib driver.
Thanks,
Saeed.
Merged into rdma-next tree as well as net-next tree to prevent conflicts
in future patches between the two trees.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 08:25:01 +0000 (10:25 +0200)]
RDMA/uverbs: Fix kernel panic while using XRC_TGT QP type
Attempt to modify XRC_TGT QP type from the user space (ibv_xsrq_pingpong
invocation) will trigger the following kernel panic. It is caused by the
fact that such QPs missed uobject initialization.
[ 17.408845] BUG: unable to handle kernel NULL pointer dereference at
0000000000000048
[ 17.412645] IP: rdma_lookup_put_uobject+0x9/0x50
[ 17.416567] PGD 0 P4D 0
[ 17.419262] Oops: 0000 [#1] SMP PTI
[ 17.422915] CPU: 0 PID: 455 Comm: ibv_xsrq_pingpo Not tainted 4.16.0-rc1+ #86
[ 17.424765] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 17.427399] RIP: 0010:rdma_lookup_put_uobject+0x9/0x50
[ 17.428445] RSP: 0018:
ffffb8c7401e7c90 EFLAGS:
00010246
[ 17.429543] RAX:
0000000000000000 RBX:
ffffb8c7401e7cf8 RCX:
0000000000000000
[ 17.432426] RDX:
0000000000000001 RSI:
0000000000000000 RDI:
0000000000000000
[ 17.437448] RBP:
0000000000000000 R08:
00000000000218f0 R09:
ffffffff8ebc4cac
[ 17.440223] R10:
fffff6038052cd80 R11:
ffff967694b36400 R12:
ffff96769391f800
[ 17.442184] R13:
ffffb8c7401e7cd8 R14:
0000000000000000 R15:
ffff967699f60000
[ 17.443971] FS:
00007fc29207d700(0000) GS:
ffff96769fc00000(0000) knlGS:
0000000000000000
[ 17.446623] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 17.448059] CR2:
0000000000000048 CR3:
000000001397a000 CR4:
00000000000006b0
[ 17.449677] Call Trace:
[ 17.450247] modify_qp.isra.20+0x219/0x2f0
[ 17.451151] ib_uverbs_modify_qp+0x90/0xe0
[ 17.452126] ib_uverbs_write+0x1d2/0x3c0
[ 17.453897] ? __handle_mm_fault+0x93c/0xe40
[ 17.454938] __vfs_write+0x36/0x180
[ 17.455875] vfs_write+0xad/0x1e0
[ 17.456766] SyS_write+0x52/0xc0
[ 17.457632] do_syscall_64+0x75/0x180
[ 17.458631] entry_SYSCALL_64_after_hwframe+0x21/0x86
[ 17.460004] RIP: 0033:0x7fc29198f5a0
[ 17.460982] RSP: 002b:
00007ffccc71f018 EFLAGS:
00000246 ORIG_RAX:
0000000000000001
[ 17.463043] RAX:
ffffffffffffffda RBX:
0000000000000078 RCX:
00007fc29198f5a0
[ 17.464581] RDX:
0000000000000078 RSI:
00007ffccc71f050 RDI:
0000000000000003
[ 17.466148] RBP:
0000000000000000 R08:
0000000000000078 R09:
00007ffccc71f050
[ 17.467750] R10:
000055b6cf87c248 R11:
0000000000000246 R12:
00007ffccc71f300
[ 17.469541] R13:
000055b6cf8733a0 R14:
0000000000000000 R15:
0000000000000000
[ 17.471151] Code: 00 00 0f 1f 44 00 00 48 8b 47 48 48 8b 00 48 8b 40 10 e9 0b 8b 68 00 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 53 89 f5 <48> 8b 47 48 48 89 fb 40 0f b6 f6 48 8b 00 48 8b 40 20 e8 e0 8a
[ 17.475185] RIP: rdma_lookup_put_uobject+0x9/0x50 RSP:
ffffb8c7401e7c90
[ 17.476841] CR2:
0000000000000048
[ 17.477764] ---[ end trace
1dbcc5354071a712 ]---
[ 17.478880] Kernel panic - not syncing: Fatal exception
[ 17.480277] Kernel Offset: 0xd000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Fixes: 2f08ee363fe0 ("RDMA/restrack: don't use uaccess_kernel()")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Selvin Xavier [Fri, 16 Feb 2018 05:20:13 +0000 (21:20 -0800)]
RDMA/bnxt_re: Avoid system hang during device un-reg
BNXT_RE_FLAG_TASK_IN_PROG doesn't handle multiple work
requests posted together. Track schedule of multiple
workqueue items by maintaining a per device counter
and proceed with IB dereg only if this counter is zero.
flush_workqueue is no longer required from
NETDEV_UNREGISTER path.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Selvin Xavier [Fri, 16 Feb 2018 05:20:12 +0000 (21:20 -0800)]
RDMA/bnxt_re: Fix system crash during load/unload
During driver unload, the driver proceeds with cleanup
without waiting for the scheduled events. So the device
pointers get freed up and driver crashes when the events
are scheduled later.
Flush the bnxt_re_task work queue before starting
device removal.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Selvin Xavier [Fri, 16 Feb 2018 05:20:11 +0000 (21:20 -0800)]
RDMA/bnxt_re: Synchronize destroy_qp with poll_cq
Avoid system crash when destroy_qp is invoked while
the driver is processing the poll_cq. Synchronize these
functions using the cq_lock.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Devesh Sharma [Fri, 16 Feb 2018 05:20:10 +0000 (21:20 -0800)]
RDMA/bnxt_re: Unpin SQ and RQ memory if QP create fails
Driver leaves the QP memory pinned if QP create command
fails from the FW. Avoids this scenario by adding a proper
exit path if the FW command fails.
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Devesh Sharma [Fri, 16 Feb 2018 05:20:08 +0000 (21:20 -0800)]
RDMA/bnxt_re: Disable atomic capability on bnxt_re adapters
More testing needs to be done before enabling this feature.
Disabling the feature temporarily
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 15 Feb 2018 02:43:36 +0000 (18:43 -0800)]
RDMA/restrack: don't use uaccess_kernel()
uaccess_kernel() isn't sufficient to determine if an rdma resource is
user-mode or not. For example, resources allocated in the add_one()
function of an ib_client get falsely labeled as user mode, when they
are kernel mode allocations. EG: mad qps.
The result is that these qps are skipped over during a nldev query
because of an erroneous namespace mismatch.
So now we determine if the resource is user-mode by looking at the object
struct's uobject or similar pointer to know if it was allocated for user
mode applications.
Fixes: 02d8883f520e ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 14 Feb 2018 12:38:43 +0000 (14:38 +0200)]
RDMA/verbs: Check existence of function prior to accessing it
Update all the flows to ensure that function pointer exists prior
to accessing it.
This is much safer than checking the uverbs_ex_mask variable, especially
since we know that test isn't working properly and will be removed
in -next.
This prevents a user triggereable oops.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Mon, 12 Feb 2018 17:50:25 +0000 (09:50 -0800)]
IB/srp: Fix completion vector assignment algorithm
Ensure that cv_end is equal to ibdev->num_comp_vectors for the
NUMA node with the highest index. This patch improves spreading
of RDMA channels over completion vectors and thereby improves
performance, especially on systems with only a single NUMA node.
This patch drops support for the comp_vector login parameter by
ignoring the value of that parameter since I have not found a
good way to combine support for that parameter and automatic
spreading of RDMA channels over completion vectors.
Fixes: d92c0da71a35 ("IB/srp: Add multichannel support")
Reported-by: Alexander Schmid <alex@modula-shop-systems.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Alexander Schmid <alex@modula-shop-systems.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adit Ranadive [Thu, 15 Feb 2018 20:36:46 +0000 (12:36 -0800)]
RDMA/vmw_pvrdma: Fix usage of user response structures in ABI file
This ensures that we return the right structures back to userspace.
Otherwise, it looks like the reserved fields in the response structures
in userspace might have uninitialized data in them.
Fixes: 8b10ba783c9d ("RDMA/vmw_pvrdma: Add shared receive queue support")
Fixes: 29c8d9eba550 ("IB: Add vmw_pvrdma driver")
Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 14 Feb 2018 10:35:40 +0000 (12:35 +0200)]
RDMA/uverbs: Sanitize user entered port numbers prior to access it
==================================================================
BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs+0x6f2/0x8c0
Read of size 4 at addr
ffff88006476a198 by task syzkaller697701/265
CPU: 0 PID: 265 Comm: syzkaller697701 Not tainted 4.15.0+ #90
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
? show_regs_print_info+0x17/0x17
? lock_contended+0x11a0/0x11a0
print_address_description+0x83/0x3e0
kasan_report+0x18c/0x4b0
? copy_ah_attr_from_uverbs+0x6f2/0x8c0
? copy_ah_attr_from_uverbs+0x6f2/0x8c0
? lookup_get_idr_uobject+0x120/0x200
? copy_ah_attr_from_uverbs+0x6f2/0x8c0
copy_ah_attr_from_uverbs+0x6f2/0x8c0
? modify_qp+0xd0e/0x1350
modify_qp+0xd0e/0x1350
ib_uverbs_modify_qp+0xf9/0x170
? ib_uverbs_query_qp+0xa70/0xa70
ib_uverbs_write+0x7f9/0xef0
? attach_entity_load_avg+0x8b0/0x8b0
? ib_uverbs_query_qp+0xa70/0xa70
? uverbs_devnode+0x110/0x110
? cyc2ns_read_end+0x10/0x10
? print_irqtrace_events+0x280/0x280
? sched_clock_cpu+0x18/0x200
? _raw_spin_unlock_irq+0x29/0x40
? _raw_spin_unlock_irq+0x29/0x40
? _raw_spin_unlock_irq+0x29/0x40
? time_hardirqs_on+0x27/0x670
__vfs_write+0x10d/0x700
? uverbs_devnode+0x110/0x110
? kernel_read+0x170/0x170
? _raw_spin_unlock_irq+0x29/0x40
? finish_task_switch+0x1bd/0x7a0
? finish_task_switch+0x194/0x7a0
? prandom_u32_state+0xe/0x180
? rcu_read_unlock+0x80/0x80
? security_file_permission+0x93/0x260
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x1e/0x8b
RIP: 0033:0x433c29
RSP: 002b:
00007ffcf2be82a8 EFLAGS:
00000217
Allocated by task 62:
kasan_kmalloc+0xa0/0xd0
kmem_cache_alloc+0x141/0x480
dup_fd+0x101/0xcc0
copy_process.part.62+0x166f/0x4390
_do_fork+0x1cb/0xe90
kernel_thread+0x34/0x40
call_usermodehelper_exec_work+0x112/0x260
process_one_work+0x929/0x1aa0
worker_thread+0x5c6/0x12a0
kthread+0x346/0x510
ret_from_fork+0x3a/0x50
Freed by task 259:
kasan_slab_free+0x71/0xc0
kmem_cache_free+0xf3/0x4c0
put_files_struct+0x225/0x2c0
exit_files+0x88/0xc0
do_exit+0x67c/0x1520
do_group_exit+0xe8/0x380
SyS_exit_group+0x1e/0x20
entry_SYSCALL_64_fastpath+0x1e/0x8b
The buggy address belongs to the object at
ffff88006476a000
which belongs to the cache files_cache of size 832
The buggy address is located 408 bytes inside of
832-byte region [
ffff88006476a000,
ffff88006476a340)
The buggy address belongs to the page:
page:
ffffea000191da80 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
flags: 0x4000000000008100(slab|head)
raw:
4000000000008100 0000000000000000 0000000000000000 0000000100080008
raw:
0000000000000000 0000000100000001 ffff88006bcf7a80 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88006476a080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88006476a100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
ffff88006476a180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88006476a200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88006476a280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: 44c58487d51a ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 14 Feb 2018 10:35:39 +0000 (12:35 +0200)]
RDMA/uverbs: Fix circular locking dependency
Avoid circular locking dependency by calling
to uobj_alloc_commit() outside of xrcd_tree_mutex lock.
======================================================
WARNING: possible circular locking dependency detected
4.15.0+ #87 Not tainted
------------------------------------------------------
syzkaller401056/269 is trying to acquire lock:
(&uverbs_dev->xrcd_tree_mutex){+.+.}, at: [<
000000006c12d2cd>] uverbs_free_xrcd+0xd2/0x360
but task is already holding lock:
(&ucontext->uobjects_lock){+.+.}, at: [<
00000000da010f09>] uverbs_cleanup_ucontext+0x168/0x730
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&ucontext->uobjects_lock){+.+.}:
__mutex_lock+0x111/0x1720
rdma_alloc_commit_uobject+0x22c/0x600
ib_uverbs_open_xrcd+0x61a/0xdd0
ib_uverbs_write+0x7f9/0xef0
__vfs_write+0x10d/0x700
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
entry_SYSCALL_64_fastpath+0x1e/0x8b
-> #0 (&uverbs_dev->xrcd_tree_mutex){+.+.}:
lock_acquire+0x19d/0x440
__mutex_lock+0x111/0x1720
uverbs_free_xrcd+0xd2/0x360
remove_commit_idr_uobject+0x6d/0x110
uverbs_cleanup_ucontext+0x2f0/0x730
ib_uverbs_cleanup_ucontext.constprop.3+0x52/0x120
ib_uverbs_close+0xf2/0x570
__fput+0x2cd/0x8d0
task_work_run+0xec/0x1d0
do_exit+0x6a1/0x1520
do_group_exit+0xe8/0x380
SyS_exit_group+0x1e/0x20
entry_SYSCALL_64_fastpath+0x1e/0x8b
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ucontext->uobjects_lock);
lock(&uverbs_dev->xrcd_tree_mutex);
lock(&ucontext->uobjects_lock);
lock(&uverbs_dev->xrcd_tree_mutex);
*** DEADLOCK ***
3 locks held by syzkaller401056/269:
#0: (&file->cleanup_mutex){+.+.}, at: [<
00000000c9f0c252>] ib_uverbs_close+0xac/0x570
#1: (&ucontext->cleanup_rwsem){++++}, at: [<
00000000b6994d49>] uverbs_cleanup_ucontext+0xf6/0x730
#2: (&ucontext->uobjects_lock){+.+.}, at: [<
00000000da010f09>] uverbs_cleanup_ucontext+0x168/0x730
stack backtrace:
CPU: 0 PID: 269 Comm: syzkaller401056 Not tainted 4.15.0+ #87
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
? uverbs_cleanup_ucontext+0x168/0x730
? console_unlock+0x502/0xbd0
print_circular_bug.isra.24+0x35e/0x396
? print_circular_bug_header+0x12e/0x12e
? find_usage_backwards+0x30/0x30
? entry_SYSCALL_64_fastpath+0x1e/0x8b
validate_chain.isra.28+0x25d1/0x40c0
? check_usage+0xb70/0xb70
? graph_lock+0x160/0x160
? find_usage_backwards+0x30/0x30
? cyc2ns_read_end+0x10/0x10
? print_irqtrace_events+0x280/0x280
? __lock_acquire+0x93d/0x1630
__lock_acquire+0x93d/0x1630
lock_acquire+0x19d/0x440
? uverbs_free_xrcd+0xd2/0x360
__mutex_lock+0x111/0x1720
? uverbs_free_xrcd+0xd2/0x360
? uverbs_free_xrcd+0xd2/0x360
? __mutex_lock+0x828/0x1720
? mutex_lock_io_nested+0x1550/0x1550
? uverbs_cleanup_ucontext+0x168/0x730
? __lock_acquire+0x9a9/0x1630
? mutex_lock_io_nested+0x1550/0x1550
? uverbs_cleanup_ucontext+0xf6/0x730
? lock_contended+0x11a0/0x11a0
? uverbs_free_xrcd+0xd2/0x360
uverbs_free_xrcd+0xd2/0x360
remove_commit_idr_uobject+0x6d/0x110
uverbs_cleanup_ucontext+0x2f0/0x730
? sched_clock_cpu+0x18/0x200
? uverbs_close_fd+0x1c0/0x1c0
ib_uverbs_cleanup_ucontext.constprop.3+0x52/0x120
ib_uverbs_close+0xf2/0x570
? ib_uverbs_remove_one+0xb50/0xb50
? ib_uverbs_remove_one+0xb50/0xb50
__fput+0x2cd/0x8d0
task_work_run+0xec/0x1d0
do_exit+0x6a1/0x1520
? fsnotify_first_mark+0x220/0x220
? exit_notify+0x9f0/0x9f0
? entry_SYSCALL_64_fastpath+0x5/0x8b
? entry_SYSCALL_64_fastpath+0x5/0x8b
? trace_hardirqs_on_thunk+0x1a/0x1c
? time_hardirqs_on+0x27/0x670
? time_hardirqs_off+0x27/0x490
? syscall_return_slowpath+0x6c/0x460
? entry_SYSCALL_64_fastpath+0x5/0x8b
do_group_exit+0xe8/0x380
SyS_exit_group+0x1e/0x20
entry_SYSCALL_64_fastpath+0x1e/0x8b
RIP: 0033:0x431ce9
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: fd3c7904db6e ("IB/core: Change idr objects to use the new schema")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 14 Feb 2018 10:35:38 +0000 (12:35 +0200)]
RDMA/uverbs: Fix bad unlock balance in ib_uverbs_close_xrcd
There is no matching lock for this mutex. Git history suggests this is
just a missed remnant from an earlier version of the function before
this locking was moved into uverbs_free_xrcd.
Originally this lock was protecting the xrcd_table_delete()
=====================================
WARNING: bad unlock balance detected!
4.15.0+ #87 Not tainted
-------------------------------------
syzkaller223405/269 is trying to release lock (&uverbs_dev->xrcd_tree_mutex) at:
[<
00000000b8703372>] ib_uverbs_close_xrcd+0x195/0x1f0
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syzkaller223405/269:
#0: (&uverbs_dev->disassociate_srcu){....}, at: [<
000000005af3b960>] ib_uverbs_write+0x265/0xef0
stack backtrace:
CPU: 0 PID: 269 Comm: syzkaller223405 Not tainted 4.15.0+ #87
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
? ib_uverbs_write+0x265/0xef0
? console_unlock+0x502/0xbd0
? ib_uverbs_close_xrcd+0x195/0x1f0
print_unlock_imbalance_bug+0x131/0x160
lock_release+0x59d/0x1100
? ib_uverbs_close_xrcd+0x195/0x1f0
? lock_acquire+0x440/0x440
? lock_acquire+0x440/0x440
__mutex_unlock_slowpath+0x88/0x670
? wait_for_completion+0x4c0/0x4c0
? rdma_lookup_get_uobject+0x145/0x2f0
ib_uverbs_close_xrcd+0x195/0x1f0
? ib_uverbs_open_xrcd+0xdd0/0xdd0
ib_uverbs_write+0x7f9/0xef0
? cyc2ns_read_end+0x10/0x10
? ib_uverbs_open_xrcd+0xdd0/0xdd0
? uverbs_devnode+0x110/0x110
? cyc2ns_read_end+0x10/0x10
? cyc2ns_read_end+0x10/0x10
? sched_clock_cpu+0x18/0x200
__vfs_write+0x10d/0x700
? uverbs_devnode+0x110/0x110
? kernel_read+0x170/0x170
? __fget+0x358/0x5d0
? security_file_permission+0x93/0x260
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x1e/0x8b
RIP: 0033:0x4335c9
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: fd3c7904db6e ("IB/core: Change idr objects to use the new schema")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 14 Feb 2018 10:35:37 +0000 (12:35 +0200)]
RDMA/restrack: Increment CQ restrack object before committing
Once the uobj is committed it is immediately possible another thread
could destroy it, which worst case, can result in a use-after-free
of the restrack objects.
Cc: syzkaller <syzkaller@googlegroups.com>
Fixes: 08f294a1524b ("RDMA/core: Add resource tracking for create and destroy CQs")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Tue, 13 Feb 2018 10:18:41 +0000 (12:18 +0200)]
RDMA/uverbs: Protect from command mask overflow
The command number is not bounds checked against the command mask before it
is shifted, resulting in an ubsan hit. This does not cause malfunction since
the command number is eventually bounds checked, but we can make this ubsan
clean by moving the bounds check to before the mask check.
================================================================================
UBSAN: Undefined behaviour in
drivers/infiniband/core/uverbs_main.c:647:21
shift exponent 207 is too large for 64-bit type 'long long unsigned int'
CPU: 0 PID: 446 Comm: syz-executor3 Not tainted 4.15.0-rc2+ #61
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
ubsan_epilogue+0xe/0x81
__ubsan_handle_shift_out_of_bounds+0x293/0x2f7
? debug_check_no_locks_freed+0x340/0x340
? __ubsan_handle_load_invalid_value+0x19b/0x19b
? lock_acquire+0x440/0x440
? lock_acquire+0x19d/0x440
? __might_fault+0xf4/0x240
? ib_uverbs_write+0x68d/0xe20
ib_uverbs_write+0x68d/0xe20
? __lock_acquire+0xcf7/0x3940
? uverbs_devnode+0x110/0x110
? cyc2ns_read_end+0x10/0x10
? sched_clock_cpu+0x18/0x200
? sched_clock_cpu+0x18/0x200
__vfs_write+0x10d/0x700
? uverbs_devnode+0x110/0x110
? kernel_read+0x170/0x170
? __fget+0x35b/0x5d0
? security_file_permission+0x93/0x260
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x448e29
RSP: 002b:
00007f033f567c58 EFLAGS:
00000246 ORIG_RAX:
0000000000000001
RAX:
ffffffffffffffda RBX:
00007f033f5686bc RCX:
0000000000448e29
RDX:
0000000000000060 RSI:
0000000020001000 RDI:
0000000000000012
RBP:
000000000070bea0 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
00000000ffffffff
R13:
00000000000056a0 R14:
00000000006e8740 R15:
0000000000000000
================================================================================
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.5
Fixes: 2dbd5186a39c ("IB/core: IB/core: Allow legacy verbs through extended interfaces")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:40 +0000 (12:18 +0200)]
IB/uverbs: Fix unbalanced unlock on error path for rdma_explicit_destroy
If remove_commit fails then the lock is left locked while the uobj still
exists. Eventually the kernel will deadlock.
lockdep detects this and says:
test/4221 is leaving the kernel with locks still held!
1 lock held by test/4221:
#0: (&ucontext->cleanup_rwsem){.+.+}, at: [<
000000001e5c7523>] rdma_explicit_destroy+0x37/0x120 [ib_uverbs]
Fixes: 4da70da23e9b ("IB/core: Explicitly destroy an object while keeping uobject")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:38 +0000 (12:18 +0200)]
IB/uverbs: Improve lockdep_check
This is really being used as an assert that the expected usecnt
is being held and implicitly that the usecnt is valid. Rename it to
assert_uverbs_usecnt and tighten the checks to only accept valid
values of usecnt (eg 0 and < -1 are invalid).
The tigher checkes make the assertion cover more cases and is more
likely to find bugs via syzkaller/etc.
Fixes: 3832125624b7 ("IB/core: Add support for idr types")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Tue, 13 Feb 2018 10:18:37 +0000 (12:18 +0200)]
RDMA/uverbs: Protect from races between lookup and destroy of uobjects
The race is between lookup_get_idr_uobject and
uverbs_idr_remove_uobj -> uverbs_uobject_put.
We deliberately do not call sychronize_rcu after the idr_remove in
uverbs_idr_remove_uobj for performance reasons, instead we call
kfree_rcu() during uverbs_uobject_put.
However, this means we can obtain pointers to uobj's that have
already been released and must protect against krefing them
using kref_get_unless_zero.
==================================================================
BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
Read of size 4 at addr
ffff88005fda1ac8 by task syz-executor2/441
CPU: 1 PID: 441 Comm: syz-executor2 Not tainted 4.15.0-rc2+ #56
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xd4
print_address_description+0x73/0x290
kasan_report+0x25c/0x370
? copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
? uverbs_try_lock_object+0x68/0xc0
? modify_qp.isra.7+0xdc4/0x10e0
modify_qp.isra.7+0xdc4/0x10e0
ib_uverbs_modify_qp+0xfe/0x170
? ib_uverbs_query_qp+0x970/0x970
? __lock_acquire+0xa11/0x1da0
ib_uverbs_write+0x55a/0xad0
? ib_uverbs_query_qp+0x970/0x970
? ib_uverbs_query_qp+0x970/0x970
? ib_uverbs_open+0x760/0x760
? futex_wake+0x147/0x410
? sched_clock_cpu+0x18/0x180
? check_prev_add+0x1680/0x1680
? do_futex+0x3b6/0xa30
? sched_clock_cpu+0x18/0x180
__vfs_write+0xf7/0x5c0
? ib_uverbs_open+0x760/0x760
? kernel_read+0x110/0x110
? lock_acquire+0x370/0x370
? __fget+0x264/0x3b0
vfs_write+0x18a/0x460
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x448e29
RSP: 002b:
00007f443fee0c58 EFLAGS:
00000246 ORIG_RAX:
0000000000000001
RAX:
ffffffffffffffda RBX:
00007f443fee16bc RCX:
0000000000448e29
RDX:
0000000000000078 RSI:
00000000209f8000 RDI:
0000000000000012
RBP:
000000000070bea0 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000246 R12:
00000000ffffffff
R13:
0000000000008e98 R14:
00000000006ebf38 R15:
0000000000000000
Allocated by task 1:
kmem_cache_alloc_trace+0x16c/0x2f0
mlx5_alloc_cmd_msg+0x12e/0x670
cmd_exec+0x419/0x1810
mlx5_cmd_exec+0x40/0x70
mlx5_core_mad_ifc+0x187/0x220
mlx5_MAD_IFC+0xd7/0x1b0
mlx5_query_mad_ifc_gids+0x1f3/0x650
mlx5_ib_query_gid+0xa4/0xc0
ib_query_gid+0x152/0x1a0
ib_query_port+0x21e/0x290
mlx5_port_immutable+0x30f/0x490
ib_register_device+0x5dd/0x1130
mlx5_ib_add+0x3e7/0x700
mlx5_add_device+0x124/0x510
mlx5_register_interface+0x11f/0x1c0
mlx5_ib_init+0x56/0x61
do_one_initcall+0xa3/0x250
kernel_init_freeable+0x309/0x3b8
kernel_init+0x14/0x180
ret_from_fork+0x24/0x30
Freed by task 1:
kfree+0xeb/0x2f0
mlx5_free_cmd_msg+0xcd/0x140
cmd_exec+0xeba/0x1810
mlx5_cmd_exec+0x40/0x70
mlx5_core_mad_ifc+0x187/0x220
mlx5_MAD_IFC+0xd7/0x1b0
mlx5_query_mad_ifc_gids+0x1f3/0x650
mlx5_ib_query_gid+0xa4/0xc0
ib_query_gid+0x152/0x1a0
ib_query_port+0x21e/0x290
mlx5_port_immutable+0x30f/0x490
ib_register_device+0x5dd/0x1130
mlx5_ib_add+0x3e7/0x700
mlx5_add_device+0x124/0x510
mlx5_register_interface+0x11f/0x1c0
mlx5_ib_init+0x56/0x61
do_one_initcall+0xa3/0x250
kernel_init_freeable+0x309/0x3b8
kernel_init+0x14/0x180
ret_from_fork+0x24/0x30
The buggy address belongs to the object at
ffff88005fda1ab0
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 24 bytes inside of
32-byte region [
ffff88005fda1ab0,
ffff88005fda1ad0)
The buggy address belongs to the page:
page:
00000000d5655c19 count:1 mapcount:0 mapping: (null)
index:0xffff88005fda1fc0
flags: 0x4000000000000100(slab)
raw:
4000000000000100 0000000000000000 ffff88005fda1fc0 0000000180550008
raw:
ffffea00017f6780 0000000400000004 ffff88006c803980 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88005fda1980: fc fc fb fb fb fb fc fc fb fb fb fb fc fc fb fb
ffff88005fda1a00: fb fb fc fc fb fb fb fb fc fc 00 00 00 00 fc fc
ffff88005fda1a80: fb fb fb fb fc fc fb fb fb fb fc fc fb fb fb fb
ffff88005fda1b00: fc fc 00 00 00 00 fc fc fb fb fb fb fc fc fb fb
ffff88005fda1b80: fb fb fc fc fb fb fb fb fc fc fb fb fb fb fc fc
==================================================================@
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: 3832125624b7 ("IB/core: Add support for idr types")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:36 +0000 (12:18 +0200)]
IB/uverbs: Hold the uobj write lock after allocate
This clarifies the design intention that time between allocate and
commit has the uobj exclusive to the caller. We already guarantee
this by delaying publishing the uobj pointer via idr_insert,
fd_install, list_add, etc.
Additionally holding the usecnt lock during this period provides
extra clarity and more protection against future mistakes.
Fixes: 3832125624b7 ("IB/core: Add support for idr types")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Tue, 13 Feb 2018 10:18:35 +0000 (12:18 +0200)]
IB/uverbs: Fix possible oops with duplicate ioctl attributes
If the same attribute is listed twice by the user in the ioctl attribute
list then error unwind can cause the kernel to deref garbage.
This happens when an object with WRITE access is sent twice. The second
parse properly fails but corrupts the state required for the error unwind
it triggers.
Fixing this by making duplicates in the attribute list invalid. This is
not something we need to support.
The ioctl interface is currently recommended to be disabled in kConfig.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>