Geetha sowjanya [Tue, 16 Oct 2018 11:27:20 +0000 (16:57 +0530)]
octeontx2-af: Support for disabling NIX RQ/SQ/CQ contexts
This patch adds support for a RVU PF/VF to disable all RQ/SQ/CQ
contexts of a NIX LF via mbox. This will be used by PF/VF drivers
upon teardown or while freeing up HW resources.
A HW context which is not INIT'ed cannot be modified and a
RVU PF/VF driver may or may not INIT all the RQ/SQ/CQ contexts.
So a bitmap is introduced to keep track of enabled NIX RQ/SQ/CQ
contexts, so that only enabled hw contexts are disabled upon LF
teardown.
Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Stanislaw Kardach <skardach@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:19 +0000 (16:57 +0530)]
octeontx2-af: NIX AQ instruction enqueue support
Add support for a RVU PF/VF to submit instructions to NIX AQ
via mbox. Instructions can be to init/write/read RQ/SQ/CQ/RSS
contexts. In case of read, context will be returned as part of
response to the mbox msg received.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:18 +0000 (16:57 +0530)]
octeontx2-af: Alloc bitmaps for NIX Tx scheduler queues
Allocate bitmaps and memory for PFVF mapping info for
maintaining NIX transmit scheduler queues maintenance.
PF/VF drivers will request for alloc, free e.t.c of
Tx schedulers via mailbox.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:17 +0000 (16:57 +0530)]
octeontx2-af: NIX LSO config for TSOv4/v6 offload
Config LSO formats for TSOv4 and TSOv6 offloads.
These formats tell HW which fields in the TCP packet's
headers have to be updated while performing segmentation
offload.
Also report PF/VF drivers the LSO format indices as part
of response to NIX_LF_ALLOC mbox msg. These indices are
used in SQE extension headers while framing SQE for pkt
transmission with TSO offload.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:16 +0000 (16:57 +0530)]
octeontx2-af: NIX block LF initialization
Upon receiving NIX_LF_ALLOC mbox message allocate memory for
NIXLF's CQ, SQ, RQ, CINT, QINT and RSS HW contexts and configure
respective base iova HW. Enable caching of contexts into NIX NDC.
Return SQ buffer (SQB) size, this PF/VF MAC address etc info
e.t.c to the mbox msg sender.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:15 +0000 (16:57 +0530)]
octeontx2-af: NIX block admin queue init
Initialize NIX admin queue (AQ) i.e alloc memory for
AQ instructions and for the results. All NIX LFs will submit
instructions to AQ to init/write/read RQ/SQ/CQ/RSS contexts
and in case of read, get context from result memory.
Also before configuring/using NIX block calibrate X2P bus
and check if NIX interfaces like CGX and LBK are in active
and working state.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geetha sowjanya [Tue, 16 Oct 2018 11:27:14 +0000 (16:57 +0530)]
octeontx2-af: Support for disabling NPA Aura/Pool contexts
This patch adds support for a RVU PF/VF to disable all Aura/Pool
contexts of a NPA LF via mbox. This will be used by PF/VF drivers
upon teardown or while freeing up HW resources.
A HW context which is not INIT'ed cannot be modified and a
RVU PF/VF driver may or may not INIT all the Aura/Pool contexts.
So a bitmap is introduced to keep track of enabled NPA Aura/Pool
contexts, so that only enabled hw contexts are disabled upon LF
teardown.
Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Stanislaw Kardach <skardach@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:13 +0000 (16:57 +0530)]
octeontx2-af: NPA AQ instruction enqueue support
Add support for a RVU PF/VF to submit instructions to NPA AQ
via mbox. Instructions can be to init/write/read Aura/Pool/Qint
contexts. In case of read, context will be returned as part of
response to the mbox msg received.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:12 +0000 (16:57 +0530)]
octeontx2-af: NPA block LF initialization
Upon receiving NPA_LF_ALLOC mbox message allocate memory for
NPALF's aura, pool and qint contexts and configure the same
to HW. Enable caching of contexts into NPA NDC.
Return pool related info like stack size, num pointers per
stack page e.t.c to the mbox msg sender.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:11 +0000 (16:57 +0530)]
octeontx2-af: NPA block admin queue init
Initialize NPA admin queue (AQ) i.e alloc memory for
AQ instructions and for the results. All NPA LFs will submit
instructions to AQ to init/write/read Aura/Pool contexts
and in case of read, get context from result memory.
Added some common APIs for allocating memory for a queue
and get IOVA in return, these APIs will be used by
NIX AQ and for other purposes.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geetha sowjanya [Tue, 16 Oct 2018 11:27:10 +0000 (16:57 +0530)]
octeontx2-af: Enable or disable CGX internal loopback
Add support to enable or disable internal loopback mode in CGX.
New mbox IDs CGX_INTLBK_ENABLE/DISABLE added for this.
Signed-off-by: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linu Cherian [Tue, 16 Oct 2018 11:27:09 +0000 (16:57 +0530)]
octeontx2-af: Forward CGX link notifications to PFs
Upon receiving notification from firmware the CGX event handler
in the AF driver gets the current link info such as status, speed,
duplex etc from CGX driver and sends it across to PFs who have
registered to receive such notifications.
To support above
- Mbox messaging support for sending msgs from AF to PF has been added.
- Added mbox msgs so that PFs can register/unregister for link events.
- Link notifications are sent to PF under two scenarioss.
1. When a asynchronous link change notification is received from
firmware with notification flag turned on for that PF.
2. Upon notification turn on request, the current link status is
send to the PF.
Also added a new mailbox msg using which RVU PF/VF can retrieve
their mapped CGX LMAC's current link info. Link info includes
status, speed, duplex and lmac type.
Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vidhya Raman [Tue, 16 Oct 2018 11:27:08 +0000 (16:57 +0530)]
octeontx2-af: Support for MAC address filters in CGX
This patch adds support for setting MAC address filters in CGX
for PF interfaces. Also PF interfaces can be put in promiscuous
mode. Dataplane PFs access this functionality using mailbox
messages to the AF driver.
Signed-off-by: Vidhya Raman <vraman@marvell.com>
Signed-off-by: Stanislaw Kardach <skardach@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christina Jacob [Tue, 16 Oct 2018 11:27:07 +0000 (16:57 +0530)]
octeontx2-af: Support to retrieve CGX LMAC stats
This patch adds support for a RVU PF/VF driver to retrieve
it's mapped CGX LMAC Rx and Tx stats from AF via mbox.
New mailbox msg is added is added.
Signed-off-by: Christina Jacob <cjacob@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:06 +0000 (16:57 +0530)]
octeontx2-af: CGX Rx/Tx enable/disable mbox handlers
Added new mailbox msgs for RVU PF/VFs to request AF
to enable/disable their mapped CGX::LMAC Rx & Tx.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Linu Cherian <lcherian@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Tue, 16 Oct 2018 11:27:05 +0000 (16:57 +0530)]
octeontx2-af: Improve register polling loop
Instead of looping on a integer timeout, use time_before(jiffies),
so that maximum poll time is capped.
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 18 Oct 2018 00:45:08 +0000 (17:45 -0700)]
Merge branch 'mlxsw-Add-VxLAN-support'
Ido Schimmel says:
====================
mlxsw: Add VxLAN support
This patchset adds support for VxLAN offload in the mlxsw driver.
With regards to the forwarding plane, VxLAN support is composed from two
main parts: Encapsulation and decapsulation.
In the device, NVE encapsulation (and VxLAN in particular) takes place
in the bridge. A packet can be encapsulated using VxLAN either because
it hit an FDB entry that forwards it to the router with the IP of the
remote VTEP or because it was flooded, in which case it is sent to a
list of remote VTEPs (in addition to local ports). In either case, the
VNI is derived from the filtering identifier (FID) the packet was
classified to at ingress and the underlay source IP is taken from a
device global configuration.
VxLAN decapsulation takes place in the underlay router, where packets
that hit a local route that corresponds to the source IP of the local
VTEP are decapsulated and injected to the bridge. The packets are
classified to a FID based on the VNI they came with.
The first six patches export the required APIs in the VxLAN and mlxsw
drivers in order to allow for the introduction of the NVE core in the
next two patches. The NVE core is designed to support a variety of NVE
encapsulations (e.g., VxLAN, NVGRE) and different ASICs, but currently
only VxLAN and Spectrum are supported. Spectrum-2 support will be added
in the future.
The last 10 patches add support for VxLAN decapsulation and
encapsulation and include the addition of the required switchdev APIs in
the VxLAN driver. These APIs allow capable drivers to get a notification
about the addition / deletion of FDB entries to / from the VxLAN's FDB.
Subsequent patchset will add selftests (generic and mlxsw-specific),
data plane learning, FDB extack and vetoing and support for VLAN-aware
bridges (one VNI per VxLAN device model).
v2:
* Implement netif_is_vxlan() using rtnl_link_ops->kind (Jakub & Stephen)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:32 +0000 (08:53 +0000)]
mlxsw: spectrum_switchdev: Add support for VxLAN encapsulation
In the device, VxLAN encapsulation takes place in the FDB table where
certain {MAC, FID} entries are programmed with an underlay unicast IP.
MAC addresses that are not programmed in the FDB are flooded to the
relevant local ports and also to a list of underlay unicast IPs that are
programmed using the all zeros MAC address in the VxLAN driver.
One difference between the hardware and software data paths is the fact
that in the software data path there are two FDB lookups prior to the
encapsulation of the packet. First in the bridge's FDB table using {MAC,
VID} and another in the VxLAN's FDB table using {MAC, VNI}.
Therefore, when a new VxLAN FDB entry is notified, it is only programmed
to the device if there is a corresponding entry in the bridge's FDB
table. Similarly, when a new bridge FDB entry pointing to the VxLAN
device is notified, it is only programmed to the device if there is a
corresponding entry in the VxLAN's FDB table.
Note that the above scheme will result in a discrepancy between both
data paths if only one FDB table is populated in the software data path.
For example, if only the bridge's FDB is populated with an entry
pointing to a VxLAN device, then a packet hitting the entry will only be
flooded by the kernel to remote VTEPs whereas the device will also flood
the packets to other local ports member in the VLAN.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:31 +0000 (08:53 +0000)]
mlxsw: spectrum: Enable VxLAN enslavement to bridges
Enslavement of VxLAN devices to offloaded bridges was never forbidden by
mlxsw, but this patch makes sure the required configuration is performed
in order to allow VxLAN encapsulation and decapsulation to take place in
the device.
The patch handles both the case where a VxLAN device is enslaved to an
already offloaded bridge and the case where the first mlxsw port is
enslaved to a bridge that already has VxLAN device configured.
Invalid configurations are sanitized and an error string is returned via
extack.
Since encapsulation and decapsulation do not occur when the VxLAN device
is down, the driver makes sure to enable / disable these functionalities
based on NETDEV_PRE_UP and NETDEV_DOWN events.
Note that NETDEV_PRE_UP is used in favor of NETDEV_UP, as the former
allows to veto the operation, if necessary.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:29 +0000 (08:53 +0000)]
bridge: switchdev: Allow clearing FDB entry offload indication
Currently, an FDB entry only ceases being offloaded when it is deleted.
This changes with VxLAN encapsulation.
Devices capable of performing VxLAN encapsulation usually have only one
FDB table, unlike the software data path which has two - one in the
bridge driver and another in the VxLAN driver.
Therefore, bridge FDB entries pointing to a VxLAN device are only
offloaded if there is a corresponding entry in the VxLAN FDB.
Allow clearing the offload indication in case the corresponding entry
was deleted from the VxLAN FDB.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 17 Oct 2018 08:53:27 +0000 (08:53 +0000)]
vxlan: Notify for each remote of a removed FDB entry
When notifications are sent about FDB activity, and an FDB entry with
several remotes is removed, the notification is sent only for the first
destination. That makes it impossible to distinguish between the case
where only this first remote is removed, and the one where the FDB entry
is removed as a whole.
Therefore send one notification for each remote of a removed FDB entry.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 17 Oct 2018 08:53:26 +0000 (08:53 +0000)]
vxlan: Support marking RDSTs as offloaded
Offloaded bridge FDB entries are marked with NTF_OFFLOADED. Implement a
similar mechanism for VXLAN, where a given remote destination can be
marked as offloaded.
To that end, introduce a new event, SWITCHDEV_VXLAN_FDB_OFFLOADED,
through which the marking is communicated to the vxlan driver. To
identify which RDST should be marked as offloaded, an
switchdev_notifier_vxlan_fdb_info is passed to the listeners. The
"offloaded" flag in that object determines whether the offloaded mark
should be set or cleared.
When sending offloaded FDB entries over netlink, mark them with
NTF_OFFLOADED.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 17 Oct 2018 08:53:24 +0000 (08:53 +0000)]
vxlan: Add vxlan_fdb_find_uc() for FDB querying
A switchdev-capable driver that is aware of VXLAN may need to query
VXLAN FDB. In the particular case of mlxsw, this functionality is
limited to querying UC FDBs. Those being easier to deal with than the
general case of RDST chain traversal, introduce an interface to query
specifically UC FDBs: vxlan_fdb_find_uc().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 17 Oct 2018 08:53:22 +0000 (08:53 +0000)]
vxlan: Add switchdev notifications
When offloading VXLAN devices, drivers need to know about events in
VXLAN FDB database. Since VXLAN models a bridge, it is natural to
distribute the VXLAN FDB notifications using the pre-existing switchdev
notification mechanism.
To that end, introduce two new notification types:
SWITCHDEV_VXLAN_FDB_ADD_TO_DEVICE and SWITCHDEV_VXLAN_FDB_DEL_TO_DEVICE.
Introduce a new function, vxlan_fdb_switchdev_call_notifiers() to send
the new notifier types, and a struct switchdev_notifier_vxlan_fdb_info
to communicate the details of the FDB entry under consideration.
Invoke the new function from vxlan_fdb_notify().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:20 +0000 (08:53 +0000)]
net: Add netif_is_vxlan()
Add the ability to determine whether a netdev is a VxLAN netdev by
calling the above mentioned function that checks the netdev's
rtnl_link_ops.
This will allow modules to identify netdev events involving a VxLAN
netdev and act accordingly. For example, drivers capable of VxLAN
offload will need to configure the underlying device when a VxLAN netdev
is being enslaved to an offloaded bridge.
Convert nfp to use the newly introduced helper.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:19 +0000 (08:53 +0000)]
mlxsw: spectrum_router: Configure matching local routes for NVE decap
When a local route that matches the source IP of an offloaded NVE tunnel
is notified, the driver needs to program it to perform NVE decapsulation
instead of merely trapping packets to the CPU.
This patch complements "mlxsw: spectrum_router: Enable local routes
promotion to perform NVE decap" where existing local routes were
promoted to perform NVE decapsulation.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:17 +0000 (08:53 +0000)]
mlxsw: spectrum_fid: Clear NVE configuration when destroying 802.1D FIDs
802.1D FIDs are used to represent VLAN-unaware bridges and currently
this is the only type of FID that supports NVE configuration.
Since the NVE tunnel device does not take a reference on the FID, it is
possible for the FID to be destroyed when it still has NVE
configuration.
Therefore, when destroying the FID make sure to disable its NVE
configuration.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:16 +0000 (08:53 +0000)]
mlxsw: spectrum_nve: Implement VxLAN operations
The common NVE core expects each encapsulation type to implement a
certain set of operations that are specific to this type and the
currently used ASIC. These operations include things such as the ability
to determine whether a certain NVE configuration can be offloaded and
ASIC-specific initialization for this type.
Implement these operations for VxLAN on the Spectrum ASIC. Spectrum-2
support will be added by a future patchset.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:14 +0000 (08:53 +0000)]
mlxsw: spectrum_nve: Implement common NVE core
The Spectrum ASIC supports different types of NVE encapsulations (e.g.,
VxLAN, NVGRE) with more types to be supported by future ASICs.
Despite being different, all these encapsulations share some common
functionality such as the enablement of NVE encapsulation on a given
filtering identifier (FID) and the addition of remote VTEPs to the
linked-list of VTEPs that traffic should be flooded to.
Implement this common core and allow different ASICs to register
different operations for different encapsulation types.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:12 +0000 (08:53 +0000)]
inet: Refactor INET_ECN_decapsulate()
Drivers that support tunnel decapsulation (IPinIP or NVE) need to
configure the underlying device to conform to the behavior outlined in
RFC 6040 with respect to the ECN bits.
This behavior is implemented by INET_ECN_decapsulate() which requires an
skb to be passed where the ECN CE bit can be potentially set. Since
these drivers do not need to mark an skb, but only configure the device
to do so, factor out the business logic to __INET_ECN_decapsulate() and
potentially perform the marking in INET_ECN_decapsulate().
This allows drivers to invoke __INET_ECN_decapsulate() and configure the
device.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:10 +0000 (08:53 +0000)]
vxlan: Export address checking functions
Drivers that support VxLAN offload need to be able to sanitize the
configuration of the VxLAN device and accept / reject its offload.
For example, mlxsw requires that the local IP of the VxLAN device be set
and that packets be flooded to unicast IP(s) and not to a multicast
group.
Expose the functions that perform such checks.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:08 +0000 (08:53 +0000)]
mlxsw: spectrum_router: Allow querying VR ID based on table ID
In the device, different VRFs (routing tables) are represented using
different virtual routers (VRs) and thus the kernel's table IDs are
mapped to VR IDs.
Allow internal users of the IP router to query the VR ID based on a
kernel table ID.
This is needed - for example - when configuring the underlay VR where
VxLAN encapsulated packets will undergo an L3 lookup. In this case, the
kernel's table ID is derived from the VxLAN device's configuration.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:07 +0000 (08:53 +0000)]
mlxsw: spectrum_router: Enable local routes promotion to perform NVE decap
When an NVE tunnel with an IP underlay (e.g., VxLAN) is configured the
local route to the tunnel's source IP needs to be promoted to perform
NVE decapsulation.
Expose an API in the unicast IP router to promote / demote local routes.
The case where a local route is configured after the creation of the NVE
tunnel will be handled in a subsequent patch in the set.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:05 +0000 (08:53 +0000)]
mlxsw: spectrum_fid: Add APIs to lookup FID without creating it
Current APIs only allow looking for a FID and creating it in case it
does not exist.
With VxLAN, in case the bridge to which the VxLAN device was enslaved
does not already have a corresponding FID, then it means that something
went wrong that we need to be aware of.
Add an API to look up a FID, but without creating it in order to catch
above-mentioned situation.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 17 Oct 2018 08:53:03 +0000 (08:53 +0000)]
mlxsw: spectrum_fid: Allow setting and clearing NVE properties on FID
In the device, the VNI and the list of remote VTEPs a packet should be
flooded to is a property of the filtering identifier (FID).
During encapsulation, the VNI is taken from the FID the packet was
classified to. During decapsulation, the overlay packet is injected into
a bridge and classified to a FID based on the VNI it came with.
Allow NVE configuration for a FID. Currently, this is only supported
with 802.1D FIDs which are used for VLAN-unaware bridges. However, NVE
configuration is going to be supported with 802.1Q FIDs which is why the
related fields are placed in the common FID struct.
Since the device requires a 1:1 mapping between FID and VNI, the driver
maintains a hashtable keyed by VNI and checks if the VNI is already
associated with an existing FID.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 16 Oct 2018 19:31:35 +0000 (21:31 +0200)]
tcp, ulp: remove socket lock assertion on ULP cleanup
Eric reported that syzkaller triggered a splat in tcp_cleanup_ulp()
where assertion sock_owned_by_me() failed. This happened through
inet_csk_prepare_forced_close() first releasing the socket lock,
then calling into tcp_done(newsk) which is called after the
inet_csk_prepare_forced_close() and therefore without the socket
lock held. The sock_owned_by_me() assertion can generally be
removed as the only place where tcp_cleanup_ulp() is called from
now is out of inet_csk_destroy_sock() -> sk->sk_prot->destroy()
where socket is in dead state and unreachable. Therefore, add a
comment why the check is not needed instead.
Fixes: 8b9088f806e1 ("tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 17:09:59 +0000 (10:09 -0700)]
Merge branch 'hns3-Some-cleanup-and-bugfix-for-desc-filling'
Yunsheng Lin says:
====================
Some cleanup and bugfix for desc filling
When retransmiting packets, skb_cow_head which is called in
hns3_set_tso may clone a new header. And driver will clear the
checksum of the header after doing DMA map, so HW will read the
old header whose L3 checksum is not cleared and calculate a
wrong L3 checksum.
Also When sending a big fragment using multiple buffer descriptor,
hns3 does one maping, but do multiple unmapping when tx is done,
which may cause unmapping problem.
This patchset does some cleanup before fixing the above problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Tue, 16 Oct 2018 11:58:52 +0000 (19:58 +0800)]
net: hns3: fix for multiple unmapping DMA problem
When sending a big fragment using multiple buffer descriptor,
hns3 does one maping, but do multiple unmapping when tx is done,
which may cause unmapping problem.
To fix it, this patch makes sure the value of desc_cb.length of
the non-first bd is zero. If desc_cb.length is zero, we do not
unmap the buffer.
Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Tue, 16 Oct 2018 11:58:51 +0000 (19:58 +0800)]
net: hns3: rename hns_nic_dma_unmap
To keep symmetrical, this patch renames hns_nic_dma_unmap to
hns3_clear_desc.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Tue, 16 Oct 2018 11:58:50 +0000 (19:58 +0800)]
net: hns3: add handling for big TX fragment
This patch unifies big tx fragment handling for tso and non-tso
case.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Tue, 16 Oct 2018 11:58:49 +0000 (19:58 +0800)]
net: hns3: move DMA map into hns3_fill_desc
To solve the L3 checksum error problem which happens when driver
does not clear L3 checksum, DMA map should be done after calling
skb_cow_head.
This patch moves DMA map into hns3_fill_desc to ensure that DMA
map is done after calling skb_cow_head.
Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Tue, 16 Oct 2018 11:58:48 +0000 (19:58 +0800)]
net: hns3: remove hns3_fill_desc_tso
This patch removes hns3_fill_desc_tso in preparation for
fixing some desc filling bug, because for tso or non-tso
case, we will use the unified hns3_fill_desc.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 17:04:29 +0000 (10:04 -0700)]
Merge branch 'qed-Align-PTT-and-add-various-link-modes'
Rahul Verma says:
====================
Align PTT and add various link modes.
This series aligns the ptt propagation as local ptt or global ptt.
Adds new transceiver modes, speed capabilities and board config,
which is utilized to display the enhanced link modes, media types
and speed. Enhances the link with detailed information.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:22 +0000 (03:59 -0700)]
qed: Prevent link getting down in case of autoneg-off.
Newly added link modes are required to be added
during setting link modes. If the new link mode
is not available during qed_set_link, it may cause
link getting down due to empty supported capability,
being passed to MFW, after setting autoneg off/on
with current/supported speed.
Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:21 +0000 (03:59 -0700)]
qede: Check available link modes before link set from ethtool.
Set link mode after checking available "supported" link caps
of the port.
Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:20 +0000 (03:59 -0700)]
qed: Add supported link and advertise link to display in ethtool.
Added transceiver type, speed capability and board types
in HSI, are utilizing to display the accurate link
information in ethtool.
Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:19 +0000 (03:59 -0700)]
qed: Added supported transceiver modes, speed capability and board config to HSI.
Added transceiver modes with different speed and media type,
speed capability and supported board types in HSI, which
will be utilizing to display correct specification of link
modes and speed type.
Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:18 +0000 (03:59 -0700)]
qed: Align local and global PTT to propagate through the APIs.
Align the use of local PTT to propagate through the qed_mcp* API's.
Global ptt should not be used.
Register access should be done through layers. Register address is
mapped into a PTT, PF translation table. Several interface functions
require a PTT to direct read/write into register. There is a pool of
PTT maintained, and several PTT are used simultaneously to access
device registers in different flows. Same PTT should not be used in
flows that can run concurrently.
To avoid running out of PTT resources, too many PTT should not be
acquired without releasing them. Every PF has a global PTT, which is
used throughout the life of PF, in most important flows for register
access. Generic functions acquire the PTT locally and release after
the use. This patch aligns the use of Global PTT and Local PTT
accordingly.
Signed-off-by: Rahul Verma <rahul.verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Tue, 16 Oct 2018 10:50:56 +0000 (18:50 +0800)]
net: aquantia: make function aq_fw2x_update_stats static
Fixes the following sparse warning:
drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c:282:5: warning:
symbol 'aq_fw2x_update_stats' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 07:14:18 +0000 (00:14 -0700)]
Merge branch 'net-Kernel-side-filtering-for-route-dumps'
David Ahern says:
====================
net: Kernel side filtering for route dumps
Implement kernel side filtering of route dumps by protocol (e.g., which
routing daemon installed the route), route type (e.g., unicast), table
id and nexthop device.
iproute2 has been doing this filtering in userspace for years; pushing
the filters to the kernel side reduces the amount of data the kernel
sends and reduces wasted cycles on both sides processing unwanted data.
These initial options provide a huge improvement for efficiently
examining routes on large scale systems.
v2
- better handling of requests for a specific table. Rather than walking
the hash of all tables, lookup the specific table and dump it
- refactor mr_rtm_dumproute moving the loop over the table into a
helper that can be invoked directly
- add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
it is returned even when the dump returns nothing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:51 +0000 (18:56 -0700)]
net/ipv4: Bail early if user only wants prefix entries
Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the
flag is set in the dump request, just return.
In the process of this change, move the CLONE check to use the new
filter flags.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:50 +0000 (18:56 -0700)]
net/ipv6: Bail early if user only wants cloned entries
Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user
requests a route dump for only cloned entries, no sense walking the FIB
and returning everything.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:49 +0000 (18:56 -0700)]
net/mpls: Handle kernel side filtering of route dumps
Update the dump request parsing in MPLS for the non-INET case to
enable kernel side filtering. If INET is disabled the only filters
that make sense for MPLS are protocol and nexthop device.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:48 +0000 (18:56 -0700)]
net: Enable kernel side filtering of route dumps
Update parsing of route dump request to enable kernel side filtering.
Allow filtering results by protocol (e.g., which routing daemon installed
the route), route type (e.g., unicast), table id and nexthop device. These
amount to the low hanging fruit, yet a huge improvement, for dumping
routes.
ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
be used to look up the device index without taking a reference. From
there filter->dev is only used during dump loops with the lock still held.
Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
have been filtered should no entries be returned.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:47 +0000 (18:56 -0700)]
net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
Implement kernel side filtering of routes by egress device index and
table id. If the table id is given in the filter, lookup table and
call mr_table_dump directly for it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:46 +0000 (18:56 -0700)]
ipmr: Refactor mr_rtm_dumproute
Move per-table loops from mr_rtm_dumproute to mr_table_dump and export
mr_table_dump for dumps by specific table id.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:45 +0000 (18:56 -0700)]
net/mpls: Plumb support for filtering route dumps
Implement kernel side filtering of routes by egress device index and
protocol. MPLS uses only a single table and route type.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:44 +0000 (18:56 -0700)]
net/ipv6: Plumb support for filtering route dumps
Implement kernel side filtering of routes by table id, egress device
index, protocol, and route type. If the table id is given in the filter,
lookup the table and call fib6_dump_table directly for it.
Move the existing route flags check for prefix only routes to the new
filter.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:43 +0000 (18:56 -0700)]
net/ipv4: Plumb support for filtering route dumps
Implement kernel side filtering of routes by table id, egress device index,
protocol and route type. If the table id is given in the filter, lookup the
table and call fib_table_dump directly for it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:42 +0000 (18:56 -0700)]
net: Add struct for fib dump filter
Add struct fib_dump_filter for options on limiting which routes are
returned in a dump request. The current list is table id, protocol,
route type, rtm_flags and nexthop device index. struct net is needed
to lookup the net_device from the index.
Declare the filter for each route dump handler and plumb the new
arguments from dump handlers to ip_valid_fib_dump_req.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:41 +0000 (18:56 -0700)]
netlink: Add answer_flags to netlink_callback
With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED
flag is set on a message back to the user if the data returned is
influenced by some input attributes. Normally this can be done as
messages are added to the skb, but if the filter results in no data
being returned, the user could be confused as to why.
This patch adds answer_flags to the netlink_callback allowing dump
handlers to set the NLM_F_DUMP_FILTERED at a minimum in the
NLMSG_DONE message ensuring the flag gets back to the user.
The netlink_callback space is initialized to 0 via a memset in
__netlink_dump_start, so init of the new answer_flags is covered.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 06:21:07 +0000 (23:21 -0700)]
Merge git://git./linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-16
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
sk_msg BPF integration for the latter, from Daniel and John.
2) Enable BPF syscall side to indicate for maps that they do not support
a map lookup operation as opposed to just missing key, from Prashant.
3) Add bpftool map create command which after map creation pins the
map into bpf fs for further processing, from Jakub.
4) Add bpftool support for attaching programs to maps allowing sock_map
and sock_hash to be used from bpftool, from John.
5) Improve syscall BPF map update/delete path for map-in-map types to
wait a RCU grace period for pending references to complete, from Daniel.
6) Couple of follow-up fixes for the BPF socket lookup to get it
enabled also when IPv6 is compiled as a module, from Joe.
7) Fix a generic-XDP bug to handle the case when the Ethernet header
was mangled and thus update skb's protocol and data, from Jesper.
8) Add a missing BTF header length check between header copies from
user space, from Wenwen.
9) Minor fixups in libbpf to use __u32 instead u32 types and include
proper perf_event.h uapi header instead of perf internal one, from Yonghong.
10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
to bpftool's build, from Jiri.
11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
with_addr.sh script from flow dissector selftest, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Mon, 15 Oct 2018 19:25:13 +0000 (21:25 +0200)]
net: phy: merge phy_start_aneg and phy_start_aneg_priv
After commit
9f2959b6b52d ("net: phy: improve handling delayed work")
the sync parameter isn't needed any longer in phy_start_aneg_priv().
This allows to merge phy_start_aneg() and phy_start_aneg_priv().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Mon, 15 Oct 2018 19:06:15 +0000 (19:06 +0000)]
hv_netvsc: fix vf serial matching with pci slot info
The VF device's serial number is saved as a string in PCI slot's
kobj name, not the slot->number. This patch corrects the netvsc
driver, so the VF device can be successfully paired with synthetic
NIC.
Fixes: 00d7ddba1143 ("hv_netvsc: pair VF based on serial number")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:56:42 +0000 (22:56 -0700)]
Merge branch 'tcp-second-round-for-EDT-conversion'
Eric Dumazet says:
====================
tcp: second round for EDT conversion
First round of EDT patches left TCP stack in a non optimal state.
- High speed flows suffered from loss of performance, addressed
by the first patch of this series.
- Second patch brings pacing to the current state of networking,
since we now reach ~100 Gbit on a single TCP flow.
- Third patch implements a mitigation for scheduling delays,
like the one we did in sch_fq in the past.
- Fourth patch removes one special case in sch_fq for ACK packets.
- Fifth patch removes a serious perfomance cost for TCP internal
pacing. We should setup the high resolution timer only if
really needed.
- Sixth patch fixes a typo in BBR.
- Last patch is one minor change in cdg congestion control.
Neal Cardwell also has a patch series fixing BBR after
EDT adoption.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:58 +0000 (09:37 -0700)]
tcp: cdg: use tcp high resolution clock cache
We store in tcp socket a cache of most recent high resolution
clock, there is no need to call local_clock() again, since
this cache is good enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Mon, 15 Oct 2018 16:37:57 +0000 (09:37 -0700)]
tcp_bbr: fix typo in bbr_pacing_margin_percent
There was a typo in this parameter name.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:56 +0000 (09:37 -0700)]
tcp: optimize tcp internal pacing
When TCP implements its own pacing (when no fq packet scheduler is used),
it is arming high resolution timer after a packet is sent.
But in many cases (like TCP_RR kind of workloads), this high resolution
timer expires before the application attempts to write the following
packet. This overhead also happens when the flow is ACK clocked and
cwnd limited instead of being limited by the pacing rate.
This leads to extra overhead (high number of IRQ)
Now tcp_wstamp_ns is reserved for the pacing timer only
(after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
we can setup the timer only when a packet is about to be sent,
and if tcp_wstamp_ns is in the future.
This leads to a ~10% performance increase in TCP_RR workloads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:55 +0000 (09:37 -0700)]
net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:54 +0000 (09:37 -0700)]
tcp: mitigate scheduling jitter in EDT pacing model
In commit
fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers
drifts") we added a mitigation for scheduling jitter in fq packet scheduler.
This patch does the same in TCP stack, now it is using EDT model.
Note that this mitigation is valid for both external (fq packet scheduler)
or internal TCP pacing.
This uses the same strategy than the above commit, allowing
a time credit of half the packet currently sent.
Consider following case :
An skb is sent, after an idle period of 300 usec.
The air-time (skb->len/pacing_rate) is 500 usec
Instead of setting the pacing timer to now+500 usec,
it will use now+min(500/2, 300) -> now+250usec
This is like having a token bucket with a depth of half
an skb.
Tested:
tc qdisc replace dev eth0 root pfifo_fast
Before
netperf -P0 -H remote -- -q
1000000000 # 8000Mbit
540000 262144 262144 10.00 7710.43
After :
netperf -P0 -H remote -- -q
1000000000 # 8000 Mbit
540000 262144 262144 10.00 7999.75 # Much closer to 8000Mbit target
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:53 +0000 (09:37 -0700)]
net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0
1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,
f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:
256699822585 segs_out:
177279177
segs_in:
3916318 data_segs_out:
177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:
2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:52 +0000 (09:37 -0700)]
tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.
This causes major regressions at high speed flows.
Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.
tcp_wstamp_ns is only updated when a packet is sent.
Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.
Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Li RongQing [Mon, 15 Oct 2018 11:00:31 +0000 (19:00 +0800)]
net: bridge: fix a possible memory leak in __vlan_add
After per-port vlan stats, vlan stats should be released
when fail to add vlan
Fixes: 9163a0fc1f0c0 ("net: bridge: add support for per-port vlan stats")
CC: bridge@lists.linux-foundation.org
cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Mon, 15 Oct 2018 10:31:03 +0000 (11:31 +0100)]
rxrpc: Add /proc/net/rxrpc/peers to display peer list
Add /proc/net/rxrpc/peers to display the list of peers currently active.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Mon, 15 Oct 2018 03:07:16 +0000 (03:07 +0000)]
fore200e: fix missing unlock on error in bsq_audit()
Add the missing unlock before return from function bsq_audit()
in the error handling case.
Fixes: 1d9d8be91788 ("fore200e: check for dma mapping failures")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:44:33 +0000 (22:44 -0700)]
Merge branch 'bnxt_en-Add-support-for-new-57500-chips'
Michael Chan says:
====================
bnxt_en: Add support for new 57500 chips.
This patch-set is larger than normal because I wanted a complete series
to add basic support for the new 57500 chips. The new chips have the
following main differences compared to legacy chips:
1. Requires the PF driver to allocate DMA context memory as a backing
store.
2. New NQ (notification queue) for interrupt events.
3. One or more CP rings can be associated with an NQ.
4. 64-bit doorbells.
Most other structures and firmware APIs are compatible with legacy
devices with some exceptions. For example, ring groups are no longer
used and RSS table format has changed.
The patch-set includes the usual firmware spec. update, some refactoring
and restructuring, and adding the new code to add basic support for the
new class of devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:59 +0000 (07:02 -0400)]
bnxt_en: Add PCI ID for BCM57508 device.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:58 +0000 (07:02 -0400)]
bnxt_en: Add new NAPI poll function for 57500 chips.
Add a new poll function that polls for NQ events. If the NQ event is
a CQ notification, we locate the CP ring from the cq_handle and call
__bnxt_poll_work() to handle RX/TX events on the CP ring.
Add a new has_more_work field in struct bnxt_cp_ring_info to indicate
budget has been reached. __bnxt_poll_cqs_done() is called to update or
ARM the CP rings if budget has not been reached or not. If budget
has been reached, the next bnxt_poll_p5() call will continue to poll
from the CQ rings directly. Otherwise, the NQ will be ARMed for the
next IRQ.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:57 +0000 (07:02 -0400)]
bnxt_en: Refactor bnxt_poll_work().
Separate the CP ring polling logic in bnxt_poll_work() into 2 separate
functions __bnxt_poll_work() and __bnxt_poll_work_done(). Since the logic
is separated, we need to add tx_pkts and events fields to struct bnxt_napi
to keep track of the events to handle between the 2 functions. We also
add had_work_done field to struct bnxt_cp_ring_info to indicate whether
some work was performed on the CP ring.
This is needed to better support the 57500 chips. We need to poll up to
2 separate CP rings before we update or ARM the CP rings on the 57500 chips.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:56 +0000 (07:02 -0400)]
bnxt_en: Add coalescing setup for 57500 chips.
On legacy chips, the CP ring may be shared between RX and TX and so only
setup the RX coalescing parameters in such a case. On 57500 chips, we
always have a dedicated CP ring for TX so we can always set up the
TX coalescing parameters in bnxt_hwrm_set_coal().
Also, the min_timer coalescing parameter applies to the NQ on the new
chips and a separate firmware call needs to be made to set it up.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:55 +0000 (07:02 -0400)]
bnxt_en: Use bnxt_cp_ring_info struct pointer as parameter for RX path.
In the RX code path, we current use the bnxt_napi struct pointer to
identify the associated RX/CP rings. Change it to use the struct
bnxt_cp_ring_info pointer instead since there are now up to 2
CP rings per MSIX.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:54 +0000 (07:02 -0400)]
bnxt_en: Add RSS support for 57500 chips.
RSS context allocation and RSS indirection table setup are very different
on the new chip. Refactor bnxt_setup_vnic() to call 2 different functions
to set up RSS for the vnic based on chip type. On the new chip, the
number of RSS contexts and the indirection table size depends on the
number of RX rings. Each indirection table entry is also different
on the new chip since ring groups are no longer used.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:53 +0000 (07:02 -0400)]
bnxt_en: Increase RSS context array count and skip ring groups on 57500 chips.
On the new 57500 chips, we need to allocate one RSS context for every
64 RX rings. In previous chips, only one RSS context per vnic is
required regardless of the number of RX rings. So increase the max
RSS context array count to 8.
Hardware ring groups are not used on the new chips. Note that the
software ring group structure is still maintained in the driver to
keep track of the rings associated with the vnic.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:52 +0000 (07:02 -0400)]
bnxt_en: Allocate/Free CP rings for 57500 series chips.
On the new 57500 chips, we allocate/free one CP ring for each RX ring or
TX ring separately. Using separate CP rings for RX/TX is an improvement
as TX events will no longer be stuck behind RX events.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:51 +0000 (07:02 -0400)]
bnxt_en: Modify bnxt_ring_alloc_send_msg() to support 57500 chips.
Firmware ring allocation semantics are slightly different for most
ring types on 57500 chips. Allocation/deallocation for NQ rings are
also added for the new chips.
A CP ring handle is also added so that from the NQ interrupt event,
we can locate the CP ring.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:50 +0000 (07:02 -0400)]
bnxt_en: Add helper functions to get firmware CP ring ID.
On the new 57500 chips, getting the associated CP ring ID associated with
an RX ring or TX ring is different than before. On the legacy chips,
we find the associated ring group and look up the CP ring ID. On the
57500 chips, each RX ring and TX ring has a dedicated CP ring even if
they share the MSIX. Use these helper functions at appropriate places
to get the CP ring ID.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:49 +0000 (07:02 -0400)]
bnxt_en: Allocate completion ring structures for 57500 series chips.
On 57500 chips, the original bnxt_cp_ring_info struct now refers to the
NQ. bp->cp_nr_rings refer to the number of NQs on 57500 chips. There
are now 2 pointers for the CP rings associated with RX and TX rings.
Modify bnxt_alloc_cp_rings() and bnxt_free_cp_rings() accordingly.
With multiple CP rings per NAPI, we need to add a pointer in
bnxt_cp_ring_info struct to point back to the bnxt_napi struct.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:48 +0000 (07:02 -0400)]
bnxt_en: Modify the ring reservation functions for 57500 series chips.
The ring reservation functions have to be modified for P5 chips in the
following ways:
- bnxt_cp_ring_info structs map to internal NQs as well as CP rings.
- Ring groups are not used.
- 1 CP ring must be available for each RX or TX ring.
- number of RSS contexts to reserve is multiples of 64 RX rings.
- RFS currently not supported.
Also, RX AGG rings are only used for jumbo frames, so we need to
unconditionally call bnxt_reserve_rings() in __bnxt_open_nic()
to see if we need to reserve AGG rings in case MTU has changed.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:47 +0000 (07:02 -0400)]
bnxt_en: Adjust MSIX and ring groups for 57500 series chips.
Store the maximum MSIX capability in PCIe config. space earlier. When
we call firmware to query capability, we need to compare the PCIe
MSIX max count with the firmware count and use the smaller one as
the MSIX count for 57500 (P5) chips.
The new chips don't use ring groups. But previous chips do and
the existing logic limits the available rings based on resource
calculations including ring groups. Setting the max ring groups to
the max rx rings will work on the new chips without changing the
existing logic.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:46 +0000 (07:02 -0400)]
bnxt_en: Re-structure doorbells.
The 57500 series chips have a new 64-bit doorbell format. Use a new
bnxt_db_info structure to unify the new and the old 32-bit doorbells.
Add a new bnxt_set_db() function to set up the doorbell addreses and
doorbell keys ahead of time. Modify and introduce new doorbell
helpers to help abstract and unify the old and new doorbells.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:45 +0000 (07:02 -0400)]
bnxt_en: Add 57500 new chip ID and basic structures.
57500 series is a new chip class (P5) that requires some driver changes
in the next several patches. This adds basic chip ID, doorbells, and
the notification queue (NQ) structures. Each MSIX is associated with an
NQ instead of a CP ring in legacy chips. Each NQ has up to 2 associated
CP rings for RX and TX. The same bnxt_cp_ring_info struct will be used
for the NQ.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:44 +0000 (07:02 -0400)]
bnxt_en: Configure context memory on new devices.
Call firmware to configure the DMA addresses of all context memory
pages on new devices requiring context memory.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:43 +0000 (07:02 -0400)]
bnxt_en: Check context memory requirements from firmware.
New device requires host context memory as a backing store. Call
firmware to check for context memory requirements and store the
parameters. Allocate host pages accordingly.
We also need to move the call bnxt_hwrm_queue_qportcfg() earlier
so that all the supported hardware queues and the IDs are known
before checking and allocating context memory.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:42 +0000 (07:02 -0400)]
bnxt_en: Add new flags to setup new page table PTE bits on newer devices.
Newer chips require the PTU_PTE_VALID bit to be set for every page
table entry for context memory and rings. Additional bits are also
required for page table entries for all rings. Add a flags field to
bnxt_ring_mem_info struct to specify these additional bits to be used
when setting up the pages tables as needed.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:41 +0000 (07:02 -0400)]
bnxt_en: Refactor bnxt_ring_struct.
Move the DMA page table and vmem fields in bnxt_ring_struct to a new
bnxt_ring_mem_info struct. This will allow context memory management
for a new device to re-use some of the existing infrastructure.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:40 +0000 (07:02 -0400)]
bnxt_en: Update interrupt coalescing logic.
New firmware spec. allows interrupt coalescing parameters, such as
maximums, timer units, supported features to be queried. Update
the driver to make use of the new call to query these parameters
and provide the legacy defaults if the call is not available.
Replace the hard-coded values with these parameters.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:39 +0000 (07:02 -0400)]
bnxt_en: Add maximum extended request length fw message support.
Support the max_ext_req_len field from the HWRM_VER_GET_RESPONSE.
If this field is valid and greater than the mailbox size, use the
short command format to send firmware messages greater than the
mailbox size. Newer devices use this method to send larger messages
to the firmware.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:38 +0000 (07:02 -0400)]
bnxt_en: Add additional extended port statistics.
Latest firmware spec. has some additional rx extended port stats and new
tx extended port stats added. We now need to check the size of the
returned rx and tx extended stats and determine how many counters are
valid. New counters added include CoS byte and packet counts for rx
and tx.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:37 +0000 (07:02 -0400)]
bnxt_en: Update firmware interface spec. to 1.10.0.3.
Among the new changes are trusted VF support, 200Gbps support, and new
API to dump ring information on the new chips.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:37:28 +0000 (22:37 -0700)]
Merge branch 'selftests-pmtu-Add-test-choice-and-captures'
Stefano Brivio says:
====================
selftests: pmtu: Add test choice and captures
This series adds a couple of features useful for debugging: 1/2
allows selecting single tests and 2/2 adds optional traffic
captures.
Semantics for current invocation of test script are preserved.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>