git.openwrt.org Git - openwrt/staging/blogic.git/log

net: dsa: Make deferred_xmit private to sja1105

There are 3 things that are wrong with the DSA deferred xmit mechanism:

1. Its introduction has made the DSA hotpath ever so slightly more
   inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs
   to be initialized to false for every transmitted frame, in order to
   figure out whether the driver requested deferral or not (a very rare
   occasion, rare even for the only driver that does use this mechanism:
   sja1105). That was necessary to avoid kfree_skb from freeing the skb.

2. Because L2 PTP is a link-local protocol like STP, it requires
   management routes and deferred xmit with this switch. But as opposed
   to STP, the deferred work mechanism needs to schedule the packet
   rather quickly for the TX timstamp to be collected in time and sent
   to user space. But there is no provision for controlling the
   scheduling priority of this deferred xmit workqueue. Too bad this is
   a rather specific requirement for a feature that nobody else uses
   (more below).

3. Perhaps most importantly, it makes the DSA core adhere a bit too
   much to the NXP company-wide policy "Innovate Where It Doesn't
   Matter". The sja1105 is probably the only DSA switch that requires
   some frames sent from the CPU to be routed to the slave port via an
   out-of-band configuration (register write) rather than in-band (DSA
   tag). And there are indeed very good reasons to not want to do that:
   if that out-of-band register is at the other end of a slow bus such
   as SPI, then you limit that Ethernet flow's throughput to effectively
   the throughput of the SPI bus. So hardware vendors should definitely
   not be encouraged to design this way. We do _not_ want more
   widespread use of this mechanism.

Luckily we have a solution for each of the 3 issues:

For 1, we can just remove that variable in the skb->cb and counteract
the effect of kfree_skb with skb_get, much to the same effect. The
advantage, of course, being that anybody who doesn't use deferred xmit
doesn't need to do any extra operation in the hotpath.

For 2, we can create a kernel thread for each port's deferred xmit work.
If the user switch ports are named swp0, swp1, swp2, the kernel threads
will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15
character length limit on kernel thread names). With this, the user can
change the scheduling priority with chrt $(pidof swp2_xmit).

For 3, we can actually move the entire implementation to the sja1105
driver.

So this patch deletes the generic implementation from the DSA core and
adds a new one, more adequate to the requirements of PTP TX
timestamping, in sja1105_main.c.

Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: sja1105: Always send through management routes in slot 0

I finally found out how the 4 management route slots are supposed to
be used, but.. it's not worth it.

The description from the comment I've just deleted in this commit is
still true: when more than 1 management slot is active at the same time,
the switch will match frames incoming [from the CPU port] on the lowest
numbered management slot that matches the frame's DMAC.

My issue was that one was not supposed to statically assign each port a
slot. Yes, there are 4 slots and also 4 non-CPU ports, but that is a
mere coincidence.

Instead, the switch can be used like this: every management frame gets a
slot at the right of the most recently assigned slot:

Send mgmt frame 1 through S0:    S0 x  x  x
Send mgmt frame 2 through S1:    S0 S1 x  x
Send mgmt frame 3 through S2:    S0 S1 S2 x
Send mgmt frame 4 through S3:    S0 S1 S2 S3

The difference compared to the old usage is that the transmission of
frames 1-4 doesn't need to wait until the completion of the management
route. It is safe to use a slot to the right of the most recently used
one, because by protocol nobody will program a slot to your left and
"steal" your route towards the correct egress port.

So there is a potential throughput benefit here.

But mgmt frame 5 has no more free slot to use, so it has to wait until
_all_ of S0, S1, S2, S3 are full, in order to use S0 again.

And that's actually exactly the problem: I was looking for something
that would bring more predictable transmission latency, but this is
exactly the opposite: 3 out of 4 frames would be transmitted quicker,
but the 4th would draw the short straw and have a worse worst-case
latency than before.

Useless.

Things are made even worse by PTP TX timestamping, which is something I
won't go deeply into here. Suffice to say that the fact there is a
driver-level lock on the SPI bus offsets any potential throughput gains
that parallelism might bring.

So there's no going back to the multi-slot scheme, remove the
"mgmt_slot" variable from sja1105_port and the dummy static assignment
made at probe time.

While passing by, also remove the assignment to casc_port altogether.
Don't pretend that we support cascaded setups.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Fix-10G-PHY-interface-types'

Russell King says:

====================
Fix 10G PHY interface types

Recent discussion has revealed that our current usage of the 10GKR
phy_interface_t is not correct. This is based on a misunderstanding
caused in part by the various specifications being difficult to
obtain. Now that a better understanding has been reached, we ought
to correct this.

This series introduce PHY_INTERFACE_MODE_10GBASER to replace the
existing usage of 10GKR mode, and document their differences in the
phylib documentation. Then switch PHY, SFP/phylink, the Marvell
PP2 network driver, and its associated comphy driver over to use
the correct interface mode. None of the existing platform usage
was actually using 10GBASE-KR.

In order to maintain compatibility with existing DT files, arrange
for the Marvell PP2 driver to rewrite the phy interface mode; this
allows other drivers to adopt correct behaviour w.r.t whether the
10G connection conforms to the backplane 10GBASE-KR protocol vs
normal 10GBASE-R protocol.

After applying these locally to net-next I've validated that the
only places which mention the old PHY_INTERFACE_MODE_10GKR
definition are:

Documentation/networking/phy.rst:``PHY_INTERFACE_MODE_10GKR``
drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c:        if (phy_mode == PHY_INTERFACE_MODE_10GKR)
drivers/net/phy/aquantia_main.c:                phydev->interface = PHY_INTERFACE_MODE_10GKR;
drivers/net/phy/aquantia_main.c:            phydev->interface != PHY_INTERFACE_MODE_10GKR &&
include/linux/phy.h:    PHY_INTERFACE_MODE_10GKR,
include/linux/phy.h:    case PHY_INTERFACE_MODE_10GKR:

which is as expected.  The only users of "10gbase-kr" in DT are:

arch/arm64/boot/dts/marvell/armada-7040-db.dts: phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-clearfog-gt-8k.dts:     phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-db.dts: phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-db.dts: phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-mcbin-singleshot.dts:   phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-mcbin-singleshot.dts:   phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts:      phy-mode = "10gbase-kr";arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts:      phy-mode = "10gbase-kr";arch/arm64/boot/dts/marvell/cn9130-db.dts:      phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/cn9131-db.dts:      phy-mode = "10gbase-kr";
arch/arm64/boot/dts/marvell/cn9132-db.dts:      phy-mode = "10gbase-kr";

which all use the mvpp2 driver, and these will be updated in a
separate patch to be submitted in the following kernel cycle.

v2: add comment to mvpp2 driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: switch to using PHY_INTERFACE_MODE_10GBASER rather than 10GKR

Switch network drivers, phy drivers, and SFP/phylink over to use the
more correct 10GBASE-R, rather than 10GBASE-KR. 10GBASE-KR is backplane
ethernet, which is 10GBASE-R with autonegotiation on top, which our
current usage on the affected platforms does not have.

The only remaining user of PHY_INTERFACE_MODE_10GKR is the Aquantia
PHY, which has a separate mode for 10GBASE-KR.

For Marvell mvpp2, we detect 10GBASE-KR, and rewrite it to 10GBASE-R
for compatibility with existing DT - this is the only network driver
at present that makes use of PHY_INTERFACE_MODE_10GKR.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add PHY_INTERFACE_MODE_10GBASER

Recent discussion has revealed that the use of PHY_INTERFACE_MODE_10GKR
is incorrect. Add a 10GBASE-R definition, document both the -R and -KR
versions, and the fact that 10GKR was used incorrectly.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ionic-add-sriov-support'

Shannon Nelson says:

====================
ionic: add sriov support

Set up the basic support for enabling SR-IOV devices in the
ionic driver.  Since most of the management work happens in
the NIC firmware, the driver becomes mostly a pass-through
for the network stack commands that want to control and
configure the VFs.

v4: changed "vf too big" checks to use pci_num_vf()
changed from vf[] array of pointers of individually allocated
  vf structs to single allocated vfs[] array of vf structs
added clean up of vfs[] on probe fail
added setup for vf stats dma

v3: added check in probe for pre-existing VFs
split out the alloc and dealloc of vf structs to better deal
  with pre-existing VFs (left enabled on remove)
restored the checks for vf too big because of a potential
  case where VFs are already enabled but driver failed to
  alloc the vf structs

v2: use pci_num_vf() and kcalloc()
remove checks for vf too big
add locking for the VF operations
disable VFs in ionic_remove() if they are still running
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: support sr-iov operations

Add the netdev ops for managing VFs. Since most of the
management work happens in the NIC firmware, the driver becomes
mostly a pass-through for the network stack commands that want
to control and configure the VFs.

We also tweak ionic_station_set() a little to allow for
the VFs that start off with a zero'd mac address.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: ionic_if bits for sr-iov support

Adds new AdminQ calls and their related structs for
supporting PF controls on VFs:
CMD_OPCODE_VF_GETATTR
CMD_OPCODE_VF_SETATTR

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: sxgbe: Rename Samsung to lowercase

Fix up inconsistent usage of upper and lowercase letters in "Samsung"
name.

"SAMSUNG" is not an abbreviation but a regular trademarked name.
Therefore it should be written with lowercase letters starting with
capital letter.

Although advertisement materials usually use uppercase "SAMSUNG", the
lowercase version is used in all legal aspects (e.g. on Wikipedia and in
privacy/legal statements on
https://www.samsung.com/semiconductor/privacy-global/).

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-phy-switch-to-using-fwnode_gpiod_get_index'

Dmitry Torokhov says:

====================
net: phy: switch to using fwnode_gpiod_get_index

This series switches phy drivers form using fwnode_get_named_gpiod() and
gpiod_get_from_of_node() that are scheduled to be removed in favor
of fwnode_gpiod_get_index() that behaves more like standard
gpiod_get_index() and will potentially handle secondary software
nodes in cases we need to augment platform firmware.

Now that the dependencies have been merged into networking tree the
patches can be applied there as well.

v3:
        - rebased on top of net-next

v2:
        - rebased on top of Linus' W devel branch
        - added David's ACKs
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: fixed_phy: switch to using fwnode_gpiod_get_index

gpiod_get_from_of_node() is being retired in favor of
[devm_]fwnode_gpiod_get_index(), that behaves similar to
[devm_]gpiod_get_index(), but can work with arbitrary firmware node. It
will also be able to support secondary software nodes.

Let's switch this driver over.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: fixed_phy: fix use-after-free when checking link GPIO

If we fail to locate GPIO for any reason other than deferral or
not-found-GPIO, we try to print device tree node info, however if might
be freed already as we called of_node_put() on it.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylink: switch to using fwnode_gpiod_get_index()

Instead of fwnode_get_named_gpiod() that I plan to hide away, let's use
the new fwnode_gpiod_get_index() that mimics gpiod_get_index(), but
works with arbitrary firmware node.

Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: vsc73xx: Remove dependency on CONFIG_OF

There is no build time dependency on CONFIG_OF, but we do need to make
sure we gate the initialization of the gpio_chip::of_node member with a
proper check on CONFIG_OF_GPIO. This enables the driver to build on
platforms that do not have CONFIG_OF enabled.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'WireGuard-bug-fixes-and-cleanups'

Jason A. Donenfeld says:

====================
WireGuard bug fixes and cleanups

I've been working through some personal notes and also the whole git
repo history of the out-of-tree module, looking for places where
tradeoffs were made (and subsequently forgotten about) for old kernels.
The first two patches in this series clean up those. The first one does
so in the self-tests and self-test harness, where we're now able to
expand test coverage by a bit, and we're now cooking away tests on every
commit to both the wireguard-linux repo and to net-next. The second one
removes a workaround for a skbuff.h bug that was fixed long ago.
Finally, the last patch in the series fixes in a bug unearthed by newer
Qualcomm chipsets running the rmnet_perf driver, which does UDP GRO.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

wireguard: socket: mark skbs as not on list when receiving via gro

Certain drivers will pass gro skbs to udp, at which point the udp driver
simply iterates through them and passes them off to encap_rcv, which is
where we pick up. At the moment, we're not attempting to coalesce these
into bundles, but we also don't want to wind up having cascaded lists of
skbs treated separately. The right behavior here, then, is to just mark
each incoming one as not on a list. This can be seen in practice, for
example, with Qualcomm's rmnet_perf driver.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

wireguard: queueing: do not account for pfmemalloc when clearing skb header

Before 8b7008620b84 ("net: Don't copy pfmemalloc flag in __copy_skb_
header()"), the pfmemalloc flag used to be between headers_start and
headers_end, which is a region we clear when preparing the packet for
encryption/decryption. This is a parameter we certainly want to
preserve, which is why 8b7008620b84 moved it out of there. The code here
was written in a world before 8b7008620b84, though, where we had to
manually account for it. This commit brings things up to speed.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

wireguard: selftests: remove ancient kernel compatibility code

Quite a bit of the test suite was designed to work with ancient kernels.
Thankfully we no longer have to deal with this. This commit updates
things that we can finally update and removes things that we can finally
remove, to avoid the build-up of the last several years as a result of
having to support ancient kernels. We can finally rely on suppress_
prefixlength being available. On the build side of things, the no-PIE
hack is no longer required, and we can bump some of the tools, repair
our m68k and i686-kvm support, and get better coverage of the static
branches used in the crypto lib and in udp_tunnel.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2020-01-04

This series contains updates to the igc driver only.

Sasha does some housekeeping on the igc driver to remove forward
declarations that are not needed after re-arranging several functions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

igc: Remove no need declaration of the igc_sw_init

We want to avoid forward-declarations of function if possible.
Rearrange the igc_sw_init function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_write_itr

We want to avoid forward-declarations of function if possible.
Rearrange the igc_write_itr function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_assign_vector

We want to avoid forward-declarations of function if possible.
Rearrange the igc_assign_vector function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_free_q_vector

We want to avoid forward-declarations of function if possible.
Rearrange the igc_free_q_vector function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_free_q_vectors

We want to avoid forward-declarations of function if possible.
Rearrange the igc_free_q_vectors function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_irq_disable

We want to avoid forward-declarations of function if possible.
Rearrange the igc_irq_disable function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_irq_enable

We want to avoid forward-declarations of function if possible.
Rearrange the igc_irq_enable function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_configure_msix

We want to avoid forward-declarations of function if possible.
Rearrange the igc_configure_msix function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_set_rx_mode

We want to avoid forward-declarations of function if possible.
Rearrange the igc_set_rx_mode function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_set_interrupt_capability

We want to avoid forward-declarations of function if possible.
Rearrange the igc_set_interrupt_capability function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_alloc_mapped_page

We want to avoid forward-declarations of function if possible.
Rearrange the igc_alloc_mapped_page function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_configure

We want to avoid forward-declarations of function if possible.
Rearrange the igc_configure function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_set_default_mac_filter

We want to avoid forward-declarations of function if possible.
Rearrange the igc_set_default_mac_filter function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_power_down_link

We want to avoid forward-declarations of function if possible.
Rearrange the igc_power_down_link function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove no need declaration of the igc_clean_tx_ring

We want to avoid forward-declarations of function if possible.
Rearrange the igc_clean_tx_ring function implementation.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch '100GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2020-01-03

This series contains updates to the ice driver only.

Brett adds support for UDP segmentation offload (USO) based on the work
Alex Duyck did for other Intel drivers. Refactored how the VF sets
spoof checking to resolve a couple of issues found in
ice_set_vf_spoofchk().  Adds the ability to track of the dflt_vsI
(default VSI), since we cannot have more than one default VSI.  Add a
macro for commonly used "for loop" used repeatedly in the code.  Cleaned
up and made the VF link flows all similar.  Refactor the flows of adding
and deleting MAC addresses in order to simplify the logic for error
conditions and setting/clearing the VF's default MAC address field.

Michal moves the setting of the default ITR value from ice_cfg_itr() to
the function we allocate queue vectors.  Adds support for saving and
restoring the ITR value for each queue.  Adds a check for all invalid
or unused parameters to log the information and return an error.

Vignesh cleans up the driver where we were trying to write to read-only
registers for the receive flex descriptors.

Tony changes a netdev_info() to netdev_dbg() when the MTU value is
changed.

Bruce suppresses a coverity reported error that was not really an error
by adding a code comment.

Mitch adds a check for a NULL receive descriptor to resolve a coverity
reported issue.

Krzysztof prevents a potential general protection fault by adding a
boundary check to see if the queue id is greater than the size of a UMEM
array.  Adds additional code comments to assist coverity in its scans to
prevent false positives.

Jake adds support for E822 devices to the driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ice: Add device ids for E822 devices

Add support for E822 devices

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Suppress Coverity warnings for xdp_rxq_info_reg

Coverity reports some of the calls to xdp_rxq_info_reg() as potential
issues, because the driver does not check its return value. However,
those calls are wrapped with "if (!xdp_rxq_info_is_reg(&ring->xdp_rxq))"
and this check alone is enough to be sure that the function will never
fail.

All possible states of xdp_rxq_info are:
- NEW,
- REGISTERED,
- UNREGISTERED,
- UNUSED.

The driver won't mark a queue as UNUSED under no circumstance, so the
return value can be ignored safely.

Add comments for Coverity right above calls to xdp_rxq_info_reg() to
suppress the warnings.

Signed-off-by: Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Add a boundary check in ice_xsk_umem()

In ice_xsk_umem(), variable qid which is later used as an array index,
is not validated for a possible boundary exceedance. Because of that,
a calling function might receive an invalid address, which causes
general protection fault when dereferenced.

To address this, add a boundary check to see if qid is greater than the
size of a UMEM array. Also, don't let user change vsi->num_xsk_umems
just by trying to setup a second UMEM if its value is already set up
(i.e. UMEM region has already been allocated for this VSI).

While at it, make sure that ring->zca.free pointer is always zeroed out
if there is no UMEM on a specified ring.

Signed-off-by: Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: add extra check for null Rx descriptor

In the case where the hardware gives us a null Rx descriptor, it is
theoretically possible that we could call one of our skb-construction
functions with no data pointer, which would cause a panic.

In real life, this will never happen - we only get null RX
descriptors as the final descriptor in a chain of otherwise-valid
descriptors. When this happens, the skb will be extant and we'll just
call ice_add_rx_frag(), which can deal with empty data buffers.

Unfortunately, Coverity does not have intimate knowledge of our
hardware, so we must add a check here.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: suppress checked_return error

Coverity reports an error that is not really an error; suppress it.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Demote MTU change print to debug

Following the changes of commit 12299132b3d3 ("net: ethernet: intel: Demote
MTU change prints to debug"), change the MTU change message to netdev_dbg()

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Enable ip link show on the PF to display VF unicast MAC(s)

Currently when there are SR-IOV VF(s) and the user does "ip link show <pf
interface>" the VF unicast MAC addresses all show 00:00:00:00:00:00
if the unicast MAC was set via VIRTCHNL (i.e. not administratively set
by the host PF).

This is misleading to the host administrator. Fix this by setting the
VF's dflt_lan_addr.addr when the VF's unicast MAC address is
configured via VIRTCHNL. There are a couple cases where we don't allow
the dflt_lan_addr.addr field to be written. First, If the VF's
pf_set_mac field is true and the VF is not trusted, then we don't allow
the dflt_lan_addr.addr to be modified. Second, if the
dflt_lan_addr.addr has already been set (i.e. via VIRTCHNL).

Also a small refactor was done to separate the flow for add and delete
MAC addresses in order to simplify the logic for error conditions
and set/clear the VF's dflt_lan_addr.addr field.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Fix VF link state when it's IFLA_VF_LINK_STATE_AUTO

Currently the flow for ice_set_vf_link_state() is not configuring link
the same as all other VF link configuration flows. Fix this by only
setting the necessary VF members in ice_set_vf_link_state() and then
call ice_vc_notify_link_state() to actually configure link for the
VF. This made ice_set_pfe_link_forced() unnecessary, so it was
deleted. Also, this commonizes the link flows for the VF to all call
ice_vc_notify_link_state().

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Remove Rx flex descriptor programming

Remove Rx flex descriptor metadata and flag programming; per specification
these registers cannot be written to as they are read only.

Signed-off-by: Vignesh Sridhar <vignesh.sridhar@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Return error on not supported ethtool -C parameters

Check for all unused parameters, if ethtool sent one of them,
print info about that and return error.

Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Restore interrupt throttle settings after VSI rebuild

After each rebuild driver deallocates q_vectors, so the interrupt
throttle rate (ITR) settings get lost.

Create a function to save and restore ITR for each queue. If a user
increases the number of queues, restore all the previous queue
settings for each existing queue, and the additional queues will
get the default setting.

Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Set default value for ITR in alloc function

When the user sets itr_setting to zero from ethtool -C, the driver changes
this value to default in ice_cfg_itr (for example after changing ring
param). Remove code that sets default value in ice_cfg_itr and move it to
place where the driver allocates q_vectors.

Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Add ice_for_each_vf() macro

Currently we do "for (i = 0; i < pf->num_alloc_vfs; i++)" all over the
place. Many other places use macros to contain this repeated for loop,
So create the macro ice_for_each_vf(pf, i) that does the same thing.

There were a couple places we were using one loop variable and a VF
iterator, which were changed to using a local variable within the
ice_for_each_vf() macro.

Also in ice_alloc_vfs() we were setting pf->num_alloc_vfs after doing
"for (i = 0; i < num_alloc_vfs; i++)". Instead assign pf->num_alloc_vfs
right after allocating memory for the pf->vf array.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Add code to keep track of current dflt_vsi

We can't have more than one default VSI so prevent another VSI from
overwriting the current dflt_vsi. This was achieved by adding the
following functions:

ice_is_dflt_vsi_in_use()
- Used to check if the default VSI is already being used.

ice_is_vsi_dflt_vsi()
- Used to check if VSI passed in is in fact the default VSI.

ice_set_dflt_vsi()
- Used to set the default VSI via a switch rule

ice_clear_dflt_vsi()
- Used to clear the default VSI via a switch rule.

Also, there was no need to introduce any locking because all mailbox
events and synchronization of switch filters for the PF happen in the
service task.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Fix VF spoofchk

There are many things wrong with the function
ice_set_vf_spoofchk().

1. The VSI being modified is the PF VSI, not the VF VSI.
2. We are enabling Rx VLAN pruning instead of Tx VLAN anti-spoof.
3. The spoofchk setting for each VF is not initialized correctly
or re-initialized correctly on reset.

To fix [1] we need to make sure we are modifying the VF VSI.
This is done by using the vf->lan_vsi_idx to index into the PF's
VSI array.

To fix [2] replace setting Rx VLAN pruning in ice_set_vf_spoofchk()
with setting Tx VLAN anti-spoof.

To Fix [3] we need to make sure the initial VSI settings match what
is done in ice_set_vf_spoofchk() for spoofchk=on. Also make sure
this also works for VF reset. This was done by modifying ice_vsi_init()
to account for the current spoofchk state of the VF VSI.

Because of these changes, Tx VLAN anti-spoof needs to be removed
from ice_cfg_vlan_pruning(). This is okay for the VF because this
is now controlled from the admin enabling/disabling spoofchk. For the
PF, Tx VLAN anti-spoof should not be set. This change requires us to
call ice_set_vf_spoofchk() when configuring promiscuous mode for
the VF which requires ice_set_vf_spoofchk() to move in order to prevent
a forward declaration prototype.

Also, add VLAN 0 by default when allocating a VF since the PF is unaware
if the guest OS is running the 8021q module. Without this, MDD events will
trigger on untagged traffic because spoofcheck is enabled by default. Due
to this change, ignore add/delete messages for VLAN 0 from VIRTCHNL since
this is added/deleted during VF initialization/teardown respectively and
should not be modified.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Support UDP segmentation offload

Based on the work done by Alex Duyck on other Intel drivers, add code to
support UDP segmentation offload (USO) for the ice driver.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

bna: remove set but not used variable 'pgoff'

drivers/net/ethernet/brocade/bna/bfa_ioc.c: In function
‘bfa_ioc_fwver_clear’:
drivers/net/ethernet/brocade/bna/bfa_ioc.c:1127:13: warning: variable
‘pgoff’ set but not used [-Wunused-but-set-variable]

It is never used, and so can be removed.

Signed-off-by: yu kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netsec: Change page pool nid to NUMA_NO_NODE

The current driver only exists on a non NUMA aware machine.
With 44768decb7c0 ("page_pool: handle page recycle for NUMA_NO_NODE condition")
applied we can safely change that to NUMA_NO_NODE and accommodate future
NUMA aware hardware using netsec network interface

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: Remove redundant BUG_ON() check in l2tp_pernet

Passing NULL to l2tp_pernet causes a crash via BUG_ON.
Dereferencing net in net_generic() also has the same effect.
This patch removes the redundant BUG_ON check on the same parameter.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Remove redundant BUG_ON() check in phonet_pernet

Passing NULL to phonet_pernet causes a crash via BUG_ON.
Dereferencing net in net_generic() also has the same effect.
This patch removes the redundant BUG_ON check on the same parameter.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: remove the check argument from __skb_gro_checksum_convert

The argument is always ignored, so remove it.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: remove set but not used variable 'lsettings'

Fixes gcc '-Wunused-but-set-variable' warning:

net/ethtool/linkmodes.c: In function 'ethnl_set_linkmodes':
net/ethtool/linkmodes.c:326:32: warning:
variable 'lsettings' set but not used [-Wunused-but-set-variable]
struct ethtool_link_settings *lsettings;
^
It is never used, so remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: use REXMIT_NEW instead of magic number

REXMIT_NEW is a macro for "FRTO-style
transmit of unsent/new packets", this patch
makes it more readable.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enetc: add support time specific departure base on the qos etf

ENETC implement time specific departure capability, which enables
the user to specify when a frame can be transmitted. When this
capability is enabled, the device will delay the transmission of
the frame so that it can be transmitted at the precisely specified time.
The delay departure time up to 0.5 seconds in the future. If the
departure time in the transmit BD has not yet been reached, based
on the current time, the packet will not be transmitted.

This driver was loaded by Qos driver ETF. User could load it by tc
commands. Here are the example commands:

tc qdisc add dev eth0 root handle 1: mqprio \
num_tc 8 map 0 1 2 3 4 5 6 7 hw 1
tc qdisc replace dev eth0 parent 1:8 etf \
clockid CLOCK_TAI delta 30000 offload

These example try to set queue mapping first and then set queue 7
with 30us ahead dequeue time.

Then user send test frame should set SO_TXTIME feature for socket.

There are also some limitations for this feature in hardware:
- Transmit checksum offloads and time specific departure operation
are mutually exclusive.
- Time Aware Shaper feature (Qbv) offload and time specific departure
operation are mutually exclusive.

Signed-off-by: Po Liu <Po.Liu@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fsl/fman: use resource_size

Use resource_size rather than a verbose computation on
the end and start fields.

The semantic patch that makes these changes is as follows:
(http://coccinelle.lip6.fr/)

<smpl>
@@ struct resource ptr; @@
- (ptr.end + 1 - ptr.start)
+ resource_size(&ptr)

@@ struct resource *ptr; @@
- (ptr->end + 1 - ptr->start)
+ resource_size(ptr)
</smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: ptp_clockmatrix: constify copied structure

The idtcm_caps structure is only copied into another structure,
so make it const.

The opportunity for this change was found using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: Remove unnecessary dependencies on I2C

Only the SFC4000 code, now moved to sfc-falcon, needed I2C.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-Add-support-for-L3-domains-to-MD5-auth'

David Ahern says:

====================
tcp: Add support for L3 domains to MD5 auth

With VRF, the scope of network addresses is limited to the L3 domain
the device is associated. MD5 keys are based on addresses, so proper
VRF support requires an L3 domain to be considered for the lookups.

Leverage the new TCP_MD5SIG_EXT option to add support for a device index
to MD5 keys. The __tcpm_pad entry in tcp_md5sig is renamed to tcpm_ifindex
and a new flag, TCP_MD5SIG_FLAG_IFINDEX, in tcpm_flags determines if the
entry is examined. This follows what was done for MD5 and prefixes with
commits
8917a777be3b ("tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix")
6797318e623d ("tcp: md5: add an address prefix for key lookup")

Handling both a device AND L3 domain is much more complicated for the
response paths. This set focuses only on L3 support - requiring the
device index to be an l3mdev (ie, VRF). Support for slave devices can
be added later if desired, much like the progression of support for
sockets bound to a VRF and then bound to a device in a VRF. Kernel
code is setup to explicitly call out that current lookup is for an L3
index, while the uapi just references a device index allowing its
meaning to include other devices in the future.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

fcnal-test: Add TCP MD5 tests for VRF

Add tests for new TCP MD5 API for L3 domains (VRF).

A new namespace is added to create a duplicate configuration between
the VRF and default VRF to verify overlapping config is handled properly.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fcnal-test: Add TCP MD5 tests

Add tests for existing TCP MD5 APIs - both single address
config and the new extended API for prefixes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nettest: Add support for TCP_MD5 extensions

Update nettest to implement TCP_MD5SIG_EXT for a prefix and a device.

Add a new option, -m, to specify a prefix and length to use with MD5
auth. The device option comes from the existing -d option. If either
are set and MD5 auth is requested, TCP_MD5SIG_EXT is used instead of
TCP_MD5SIG.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nettest: Return 1 on MD5 failure for server mode

On failure to set MD5 password, do_server should return 1 so that the
program exits with 1 rather than 255. This used for negative testing
when adding MD5 with device option.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add device index to tcp_md5sig

Add support for userspace to specify a device index to limit the scope
of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad
is renamed to tcpm_ifindex and the new field is only checked if the new
TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index
must point to an L3 master device (e.g., VRF). The API and error
handling are setup to allow the constraint to be relaxed in the future
to any device index.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: Add l3index to tcp_md5sig_key and md5 functions

Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.

With the key now based on an l3index, add the new parameter to the
lookup functions and consider the l3index when looking for a match.

The l3index comes from the skb when processing ingress packets leveraging
the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
v6 variants). When the sdif index is set it means the packet ingressed a
device that is part of an L3 domain and inet_iif points to the VRF device.
For egress, the L3 domain is determined from the socket binding and
sk_bound_dev_if.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4/tcp: Pass dif and sdif to tcp_v4_inbound_md5_hash

The original ingress device index is saved to the cb space of the skb
and the cb is moved during tcp processing. Since tcp_v4_inbound_md5_hash
can be called before and after the cb move, pass dif and sdif to it so
the caller can save both prior to the cb move. Both are used by a later
patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6/tcp: Pass dif and sdif to tcp_v6_inbound_md5_hash

The original ingress device index is saved to the cb space of the skb
and the cb is moved during tcp processing. Since tcp_v6_inbound_md5_hash
can be called before and after the cb move, pass dif and sdif to it so
the caller can save both prior to the cb move. Both are used by a later
patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4/tcp: Use local variable for tcp_md5_addr

Extract the typecast to (union tcp_md5_addr *) to a local variable
rather than the current long, inline declaration with function calls.

No functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: Fix alignment and code style of vxlan.c

Fixed Coding function and style issues

Signed-off-by: Niu Xilei <niu_xilei@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Allow-setting-default-port-priority'

Ido Schimmel says:

====================
mlxsw: Allow setting default port priority

Petr says:

When LLDP APP TLV selector 1 (EtherType) is used with PID of 0, the
corresponding entry specifies "default application priority [...] when
application priority is not otherwise specified."

mlxsw currently supports this type of APP entry, but uses it only as a
fallback for unspecified DSCP rules. However non-IP traffic is prioritized
according to port-default priority, not according to the DSCP-to-prio
tables, and thus it's currently not possible to prioritize such traffic
correctly.

This patchset extends the use of the abovementioned APP entry to also set
default port priority (in patches #1 and #2) and then (in patch #3) adds a
selftest.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Add a self-test for port-default priority

Send non-IP traffic to a port and observe that it gets prioritized
according to the lldptool app=$prio,1,0 rules.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_dcb: Allow setting default port priority

When APP TLV selector 1 (EtherType) is used with PID of 0, the
corresponding entry specifies "default application priority [...] when
application priority is not otherwise specified."

mlxsw currently supports this type of APP entry, but uses it only as a
fallback for unspecified DSCP rules. However non-IP traffic is prioritized
according to port-default priority, not according to the DSCP-to-prio
tables, and thus it's currently not possible to prioritize such traffic
correctly.

Extend the use of the abovementioned APP entry to also set default port
priority.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: reg: Add QoS Port DSCP to Priority Mapping Register

Add QPDP. This register controls the port default Switch Priority and
Color. The default Switch Priority and Color are used for frames where the
trust state uses default values. Currently there are two cases where this
applies: a port is in trust-PCP state, but a packet arrives untagged; and a
port is in trust-DSCP state, but a non-IP packet arrives.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'page_pool-NUMA-node-handling-fixes'

Jesper Dangaard Brouer says:

====================
page_pool: NUMA node handling fixes

The recently added NUMA changes (merged for v5.5) to page_pool, it both
contains a bug in handling NUMA_NO_NODE condition, and added code to
the fast-path.

This patchset fixes the bug and moves code out of fast-path. The first
patch contains a fix that should be considered for 5.5. The second
patch reduce code size and overhead in case CONFIG_NUMA is disabled.

Currently the NUMA_NO_NODE setting bug only affects driver 'ti_cpsw'
(drivers/net/ethernet/ti/), but after this patchset, we plan to move
other drivers (netsec and mvneta) to use NUMA_NO_NODE setting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

page_pool: help compiler remove code in case CONFIG_NUMA=n

When kernel is compiled without NUMA support, then page_pool NUMA
config setting (pool->p.nid) doesn't make any practical sense. The
compiler cannot see that it can remove the code paths.

This patch avoids reading pool->p.nid setting in case of !CONFIG_NUMA,
in allocation and numa check code, which helps compiler to see the
optimisation potential. It leaves update code intact to keep API the
same.

$ ./scripts/bloat-o-meter net/core/page_pool.o-numa-enabled \
                           net/core/page_pool.o-numa-disabled
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-113 (-113)
Function                                     old     new   delta
page_pool_create                             401     398      -3
__page_pool_alloc_pages_slow                 439     426     -13
page_pool_refill_alloc_cache                 425     328     -97
Total: Before=3611, After=3498, chg -3.13%

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

page_pool: handle page recycle for NUMA_NO_NODE condition

The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.

The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
recycle non-reusable pages"), were to have RX-pages that belongs to the
same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
illustrated by the performance measurements.

This patch moves the NAPI checks out of fast-path, and at the same time
solves the NUMA_NO_NODE issue.

First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
as the the preferred nid.  It is only in rare situations, where
e.g. NUMA zone runs dry, that page gets doesn't get allocated from
preferred nid.  The page_pool API allows drivers to control the nid
themselves via controlling pool->p.nid.

This patch moves the NAPI check to when alloc cache is refilled, via
dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
pages from remote NUMA into the ptr_ring, as the dequeue/consume step
will check the NUMA node. All current drivers using page_pool will
alloc/refill RX-ring from same CPU running softirq/NAPI process.

Drivers that control the nid explicitly, also use page_pool_update_nid
when changing nid runtime.  To speed up transision to new nid the alloc
cache is now flushed on nid changes.  This force pages to come from
ptr_ring, which does the appropate nid check.

For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
node, we accept that transitioning the alloc cache doesn't happen
immediately. The preferred nid change runtime via consulting
numa_mem_id() based on the CPU processing RX-packets.

Notice, to avoid stressing the page buddy allocator and avoid doing too
much work under softirq with preempt disabled, the NUMA check at
ptr_ring dequeue will break the refill cycle, when detecting a NUMA
mismatch. This will cause a slower transition, but its done on purpose.

Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
Reported-by: Li RongQing <lirongqing@baidu.com>
Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2019-12-31

This series contains updates to e1000e, igb and igc only.

Robert Beckett provide an igb change to assist in keeping packets from
being dropped due to receive descriptor ring being full when receive
flow control is enabled.  Create a separate function to setup SRRCTL to
ease in reuse and ensure that setting of the drop enable bit only if
receive flow control is not enabled.

Sasha adds support for scatter gather support in igc.  Improve the
direct memory address mapping flow by optimizing/simplifying and more
clear.  Update igc to use pci_release_mem_regions() instead of
pci_release_selected_regions().  Clean up function header comments to
align with the actual code.  Adds support for 64 bit DMA access, to help
handle socket buffer fragments in high memory.  Adds legacy power
management support in igc by implementing suspend, resume,
runtime_suspend/resume, and runtime_idle callbacks.  Clean up references
to Serdes interface in igc since that interface is not supported for
i225 devices.

Alex replaces the pr_info calls with netdev_info in all cases related to
netdev link state, as suggested by Joe Perches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

igc: Remove serdes comments from a description of methods

Serdes interface is not applicable for i225 devices.
Remove this from comments and make comments more clearly.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

e1000e: Use netdev_info instead of pr_info for link messages

Replace the pr_info calls with netdev_info in all cases related to the
netdevice link state.

As a result of this patch the link messages will change as shown below.
Before:
e1000e: ens3 NIC Link is Down
e1000e: ens3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

After:
e1000e 0000:00:03.0 ens3: NIC Link is Down
e1000e 0000:00:03.0 ens3: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Add legacy power management support

Add suspend, resume, runtime_suspend, runtime_resume and
runtime_idle callbacks implementation.

Reported-by: kbuild test robot <lpk@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge git://git./linux/kernel/git/netdev/net

Simple overlapping changes in bpf land wrt. bpf_helper_defs.h
handling.

Signed-off-by: David S. Miller <davem@davemloft.net>

igc: Add 64 bit DMA access support

On relevant platforms ndo_start_xmit can handle socket buffer
fragments in high memory

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Fix parameter descriptions for a several functions

igc_watchdog, igc_set_interrupt_capability, igc_init_interrupt_scheme,
__igc_open and __igc_close parameter descriptions has not reflected
functions meaning. Add meaningful description.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Fix the parameter description for igc_alloc_rx_buffers

The function description for igc_alloc_rx_buffers has not reflected
the function meaning. Add meaningful description.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Remove excess parameter description from igc_is_non_eop

The function description for igc_is_non_eop includes an extra @skb
parameter description. This parameter doesn't exist on the function, so
remove it.

Suggested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Prefer to use the pci_release_mem_regions method

Use the pci_release_mem_regions method instead of the
pci_release_selected_regions method

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Improve the DMA mapping flow

Improve the probe flow and set both the DMA mask and the coherent
to the same thing. Make the flow optimized and cleared.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igc: Add scatter gather support

Scatter gather is used to do DMA data transfers of data that is written to
noncontiguous areas of memory.
This patch enables scatter gather support.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: dont drop packets if rx flow control is enabled

If Rx flow control has been enabled (via autoneg or forced), packets
should not be dropped due to Rx descriptor ring exhaustion. Instead
pause frames should be used to apply back pressure. This only applies
if VFs are not in use.

Move SRRCTL setup to its own function for easy reuse and only set drop
enable bit if Rx flow control is not enabled.

Since v1: always enable dropping of packets if VFs in use.

Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge git://git./linux/kernel/git/netdev/net

Pull networking fixes from David Miller:

1) Fix big endian overflow in nf_flow_table, from Arnd Bergmann.

2) Fix port selection on big endian in nft_tproxy, from Phil Sutter.

3) Fix precision tracking for unbound scalars in bpf verifier, from
    Daniel Borkmann.

4) Fix integer overflow in socket rcvbuf check in UDP, from Antonio
    Messina.

5) Do not perform a neigh confirmation during a pmtu update over a
    tunnel, from Hangbin Liu.

6) Fix DMA mapping leak in dpaa_eth driver, from Madalin Bucur.

7) Various PTP fixes for sja1105 dsa driver, from Vladimir Oltean.

8) Add missing to dummy definition of of_mdiobus_child_is_phy(), from
    Geert Uytterhoeven

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
  hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
  net/sched: add delete_empty() to filters and use it in cls_flower
  tcp: Fix highest_sack and highest_sack_seq
  ptp: fix the race between the release of ptp_clock and cdev
  net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S
  Documentation: net: dsa: sja1105: Remove text about taprio base-time limitation
  net: dsa: sja1105: Remove restriction of zero base-time for taprio offload
  net: dsa: sja1105: Really make the PTP command read-write
  net: dsa: sja1105: Take PTP egress timestamp by port, not mgmt slot
  cxgb4/cxgb4vf: fix flow control display for auto negotiation
  mlxsw: spectrum: Use dedicated policer for VRRP packets
  mlxsw: spectrum_router: Skip loopback RIFs during MAC validation
  net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs
  net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit device
  net_sched: sch_fq: properly set sk->sk_pacing_status
  bnx2x: Fix accounting of vlan resources among the PFs
  bnx2x: Use appropriate define for vlan credit
  of: mdio: Add missing inline to of_mdiobus_child_is_phy() dummy
  net: phy: aquantia: add suspend / resume ops for AQR105
  dpaa_eth: fix DMA mapping leak
  ...

Merge tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1

Pull tomoyo fixes from Tetsuo Handa:
"Two bug fixes:

   - Suppress RCU warning at list_for_each_entry_rcu()

   - Don't use fancy names on sockets"

* tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
  tomoyo: Suppress RCU warning at list_for_each_entry_rcu().
  tomoyo: Don't use nifty names on sockets.

hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()

hsr slave interfaces don't have debugfs directory.
So, hsr_debugfs_rename() shouldn't be called when hsr slave interface name
is changed.

Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
    ip link set dummy0 name ap

Splat looks like:
[21071.899367][T22666] ap: renamed from dummy0
[21071.914005][T22666] ==================================================================
[21071.919008][T22666] BUG: KASAN: slab-out-of-bounds in hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.923640][T22666] Read of size 8 at addr ffff88805febcd98 by task ip/22666
[21071.926941][T22666]
[21071.927750][T22666] CPU: 0 PID: 22666 Comm: ip Not tainted 5.5.0-rc2+ #240
[21071.929919][T22666] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[21071.935094][T22666] Call Trace:
[21071.935867][T22666]  dump_stack+0x96/0xdb
[21071.936687][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.937774][T22666]  print_address_description.constprop.5+0x1be/0x360
[21071.939019][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940081][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940949][T22666]  __kasan_report+0x12a/0x16f
[21071.941758][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.942674][T22666]  kasan_report+0xe/0x20
[21071.943325][T22666]  hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.944187][T22666]  hsr_netdev_notify+0x1fe/0x9b0 [hsr]
[21071.945052][T22666]  ? __module_text_address+0x13/0x140
[21071.945897][T22666]  notifier_call_chain+0x90/0x160
[21071.946743][T22666]  dev_change_name+0x419/0x840
[21071.947496][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.948600][T22666]  ? netdev_adjacent_rename_links+0x280/0x280
[21071.949577][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.950672][T22666]  ? lock_downgrade+0x6e0/0x6e0
[21071.951345][T22666]  ? do_setlink+0x811/0x2ef0
[21071.951991][T22666]  do_setlink+0x811/0x2ef0
[21071.952613][T22666]  ? is_bpf_text_address+0x81/0xe0
[ ... ]

Reported-by: syzbot+9328206518f08318a5fd@syzkaller.appspotmail.com
Fixes: 4c2d5e33dcd3 ("hsr: rename debugfs file when interface name is changed")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: add delete_empty() to filters and use it in cls_flower

Revert "net/sched: cls_u32: fix refcount leak in the error path of
u32_change()", and fix the u32 refcount leak in a more generic way that
preserves the semantic of rule dumping.
On tc filters that don't support lockless insertion/removal, there is no
need to guard against concurrent insertion when a removal is in progress.
Therefore, for most of them we can avoid a full walk() when deleting, and
just decrease the refcount, like it was done on older Linux kernels.
This fixes situations where walk() was wrongly detecting a non-empty
filter, like it happened with cls_u32 in the error path of change(), thus
leading to failures in the following tdc selftests:

6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id

On cls_flower, and on (future) lockless filters, this check is necessary:
move all the check_empty() logic in a callback so that each filter
can have its own implementation. For cls_flower, it's sufficient to check
if no IDRs have been allocated.

This reverts commit 275c44aa194b7159d1191817b20e076f55f0e620.

Changes since v1:
- document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
is used, thanks to Vlad Buslov
- implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
- squash revert and new fix in a single patch, to be nice with bisect
tests that run tdc on u32 filter, thanks to Dave Miller

Fixes: 275c44aa194b ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty")
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ncsi: Fix gma flag setting after response

gma_flag was set at the time of GMA command request but it should
only be set after getting successful response. Movinng this flag
setting in GMA response handler.

This flag is used mainly for not repeating GMA command once
received MAC address.

Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: add enabled check for path tracepoint loop.

sctp_outq_sack is the main function handles SACK, it is called very
frequently. As the commit "move trace_sctp_probe_path into sctp_outq_sack"
added below code to this function, sctp tracepoint is disabled most of time,
but the loop of transport list will be always called even though the
tracepoint is disabled, this is unnecessary.

+ /* SCTP path tracepoint for congestion control debugging. */
+ list_for_each_entry(transport, transport_list, transports) {
+ trace_sctp_probe_path(transport, asoc);
+ }

This patch is to add tracepoint enabled check at outside of the loop of
transport list, and avoid traversing the loop when trace is disabled,
it is a small optimization.

Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Improvements-to-SJA1105-DSA-RX-timestamping'

Vladimir Oltean says:

====================
Improvements to SJA1105 DSA RX timestamping

This series makes the sja1105 DSA driver use a dedicated kernel thread
for RX timestamping, a process which is time-sensitive and otherwise a
bit fragile. This allows users to customize their system (probabil an
embedded PTP switch) fully and allocate the CPU bandwidth for the driver
to expedite the RX timestamps as quickly as possible.

While doing this conversion, add a function to the PTP core for
cancelling this kernel thread (function which I found rather strange to
be missing).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>