git.openwrt.org Git - openwrt/staging/blogic.git/log

sock: do not set sk_err in sock_dequeue_err_skb

Do not set sk_err when dequeuing errors from the error queue.
Doing so results in:
a) Bugs: By overwriting existing sk_err values, it possibly
   hides legitimate errors. It is also incorrect when local
   errors are queued with ip_local_error. That happens in the
   context of a system call, which already returns the error
   code.
b) Inconsistent behavior: When there are pending errors on
   the error queue, sk_err is sometimes 0 (e.g., for
   the first timestamp on the error queue) and sometimes
   set to an error code (after dequeuing the first
   timestamp).
c) Suboptimality: Setting sk_err to ENOMSG on simple
   TX timestamps can abort parallel reads and writes.

Removing this line doesn't break userspace. This is because
userspace code cannot rely on sk_err for detecting whether
there is something on the error queue. Except for ICMP messages
received for UDP and RAW, sk_err is not set at enqueue time,
and as a result sk_err can be 0 while there are plenty of
errors on the error queue.

For ICMP packets in UDP and RAW, sk_err is set when they are
enqueued on the error queue, but that does not result in aborting
reads and writes. For such cases, sk_err is only readable via
getsockopt(SO_ERROR) which will reset the value of sk_err on
its own. More importantly, prior to this patch,
recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
sk_err is set by sock_dequeue_err_skb without atomic ops or
locks) which can store 0 in sk_err even when we have ICMP
messages pending. Removing this line from sock_dequeue_err_skb
eliminates that race.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'IFF_NO_QUEUE-semantics'

Jesper Dangaard Brouer says:

====================
qdisc and tx_queue_len cleanups for IFF_NO_QUEUE devices

This patchset is a cleanup for IFF_NO_QUEUE devices. It will
hopefully help userspace get a more consistent behavior when attaching
qdisc to such virtual devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

qdisc: catch misconfig of attaching qdisc to tx_queue_len zero device

It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration?  Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device.  The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device.  The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc.  The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] https://github.com/docker/libcontainer/pull/193
[2] https://github.com/openshift/origin/pull/11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc.  Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue len

The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a
default qdisc attached, given they will be backed by physical device,
that already have a qdisc attached for pushback.

It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as
this can be useful for difference policy reasons (e.g. bandwidth
limiting containers).  For this to work, the tx_queue_len need to have
a sane value, because some qdiscs inherit/copy the tx_queue_len
(namely, pfifo, bfifo, gred, htb, plug and sfb).

Commit a813104d9233 ("IFF_NO_QUEUE: Fix for drivers not calling
ether_setup()") caught situations where some drivers didn't initialize
tx_queue_len.  The problem with the commit was choosing 1 as the
fallback value.

A qdisc queue length of 1 causes more harm than good, because it
creates hard to debug situations for userspace. It gives userspace a
false sense of a working config after attaching a qdisc.  As low
volume traffic (that doesn't activate the qdisc policy) works,
like ping, while traffic that e.g. needs shaping cannot reach the
configured policy levels, given the queue length is too small.

This patch change the value to DEFAULT_TX_QUEUE_LEN, given other
IFF_NO_QUEUE devices (that call ether_setup()) also use this value.

Fixes: a813104d9233 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: make default TX queue length a defined constant

The default TX queue length of Ethernet devices have been a magic
constant of 1000, ever since the initial git import.

Looking back in historical trees[1][2] the value used to be 100,
with the same comment "Ethernet wants good queues". The commit[3]
that changed this from 100 to 1000 didn't describe why, but from
conversations with Robert Olsson it seems that it was changed
when Ethernet devices went from 100Mbit/s to 1Gbit/s, because the
link speed increased x10 the queue size were also adjusted. This
value later caused much heartache for the bufferbloat community.

This patch merely moves the value into a defined constant.

[1] https://git.kernel.org/cgit/linux/kernel/git/davem/netdev-vger-cvs.git/
[2] https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/
[3] https://git.kernel.org/tglx/history/c/98921832c232

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'udp-fwd-mem-sched-on-dequeue'

Paolo Abeni says:

====================
udp: do fwd memory scheduling on dequeue

After commit 850cbaddb52d ("udp: use it's own memory accounting schema"),
the udp code needs to acquire twice the receive queue spinlock on dequeue.

This patch series remove the need for the second lock at skb free time,
moving the udp memory scheduling inside the dequeue operation; the skb
destructor field is not used anymore and an additional sk argument is added
to ip_cmsg_recv_offset() to cope with null skb->sk after dequeue.

Many thanks to Eric Dumazed for suggesting pretty all much the above.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

udp: do fwd memory scheduling on dequeue

A new argument is added to __skb_recv_datagram to provide
an explicit skb destructor, invoked under the receive queue
lock.
The UDP protocol uses such argument to perform memory
reclaiming on dequeue, so that the UDP protocol does not
set anymore skb->desctructor.
Instead explicit memory reclaiming is performed at close() time and
when skbs are removed from the receive queue.
The in kernel UDP protocol users now need to call a
skb_recv_udp() variant instead of skb_recv_datagram() to
properly perform memory accounting on dequeue.

Overall, this allows acquiring only once the receive queue
lock on dequeue.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr sinks vanilla patched
1 440 560
3 2150 2300
6 3650 3800
9 4450 4600
12 6250 6450

v1 -> v2:
- do rmem and allocated memory scheduling under the receive lock
- do bulk scheduling in first_packet_length() and in udp_destruct_sock()
- avoid the typdef for the dequeue callback

Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sock: add an explicit sk argument for ip_cmsg_recv_offset()

So that we can use it even after orphaining the skbuff.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Update raw socket bind to consider l3 domain

Binding a raw socket to a local address fails if the socket is bound
to an L3 domain:

$ vrf-test -s -l 10.100.1.2 -R -I red
error binding socket: 99: Cannot assign requested address

Update raw_bind to look consider if sk_bound_dev_if is bound to an L3
domain and use inet_addr_type_table to lookup the address.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ns2-amac'

Jon Mason says:

====================
add NS2 support to bgmac

Changes in v6:
* Use a common bgmac_phy_connect_direct (per Rafal Milecki)
* Rebased on latest net-next
* Added Reviewed-by to the relevant patches

Changes in v5:
* Change a pr_err to netdev_err (per Scott Branden)
* Reword the lane swap binding documentation (per Andrew Lunn)

Changes in v4:
* Actually send out the lane swap binding doc patch (Per Scott Branden)
* Remove unused #define (Per Andrew Lunn)

Changes in v3:
* Clean-up the bgmac DT binding doc (per Rob Herring)
* Document the lane swap binding and make it generic (Per Andrew Lunn)

Changes in v2:
* Remove the PHY power-on (per Andrew Lunn)
* Misc PHY clean-ups regarding comments and #defines (per Andrew Lunn)
  This results on none of the original PHY code from Vikas being
  present.  So, I'm removing him as an author and giving him
  "Inspired-by" credit.
* Move PHY lane swapping to PHY driver (per Andrew Lunn and Florian
  Fainelli)
* Remove bgmac sleep (per Florian Fainelli)
* Re-add bgmac chip reset (per Florian Fainelli and Ray Jui)
* Rebased on latest net-next
* Added patch for bcm54xx_auxctl_read, which is used in the BCM54810
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

arm64: dts: NS2: add AMAC ethernet support

Add support for the AMAC ethernet to the Broadcom Northstar2 SoC device
tree

Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: bgmac: add NS2 support

Add support for the variant of amac hardware present in the Broadcom
Northstar2 based SoCs. Northstar2 requires an additional register to be
configured with the port speed/duplexity (NICPM). This can be added to
the link callback to hide it from the instances that do not use this.
Also, clearing of the pending interrupts on init is required due to
observed issues on some platforms.

Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: bgmac: device tree phy enablement

Change the bgmac driver to allow for phy's defined by the device tree

Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Acked-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: devicetree: net: add NS2 bindings to amac

Clean-up the documentation to the bgmac-amac driver, per suggestion by
Rob Herring, and add details for NS2 support.

Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: Add BCM54810 PHY entry

The BCM54810 PHY requires some semi-unique configuration, which results
in some additional configuration in addition to the standard config.
Also, some users of the BCM54810 require the PHY lanes to be swapped.
Since there is no way to detect this, add a device tree query to see if
it is applicable.

Inspired-by: Vikas Soni <vsoni@broadcom.com>
Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: devicetree: add PHY lane swap binding

Add the documentation for PHY lane swapping. This is a boolean entry to
notify the phy device drivers that the TX/RX lanes need to be swapped.

Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: add bcm54xx_auxctl_read

Add a helper function to read the AUXCTL register for the BCM54xx. This
mirrors the bcm54xx_auxctl_write function already present in the code.

Signed-off-by: Jon Mason <jon.mason@broadcom.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dwmac-sti-refactor-cleanup'

Joachim Eastwood says:

====================
stmmac: dwmac-sti refactor+cleanup

This patch set aims to remove the init/exit callbacks from the
dwmac-sti driver and instead use standard PM callbacks. Doing this
will also allow us to cleanup the driver.

Eventually the init/exit callbacks will be deprecated and removed
from all drivers dwmac-* except for dwmac-generic. Drivers will be
refactored to use standard PM and remove callbacks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-sti: remove unused priv dev member

The dev member of struct sti_dwmac is not used anywhere in the driver
so lets just remove it.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Tested-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-sti: clean up and rename sti_dwmac_init

Rename sti_dwmac_init to sti_dwmac_set_mode which is a better
description for what it really does.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Tested-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-sti: move clk_prepare_enable out of init and add error handling

Add clock error handling to probe and in the process move clock enabling
out of sti_dwmac_init() to make this easier.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Tested-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-sti: move st, gmac_en parsing to sti_dwmac_parse_data

The sti_dwmac_init() function is called both from probe and resume.
Since DT properties doesn't change between suspend/resume cycles move
parsing of this parameter into sti_dwmac_parse_data() where it belongs.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Tested-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-sti: add PM ops and resume function

Implement PM callbacks and driver remove in the driver instead
of relying on the init/exit hooks in stmmac_platform. This gives
the driver more flexibility in how the code is organized.

Eventually the init/exit callbacks will be deprecated in favor
of the standard PM callbacks and driver remove function.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Tested-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-sti: remove clk NULL checks

Since sti_dwmac_parse_data() sets dwmac->clk to NULL if not clock was
provided in DT and NULL is a valid clock there is no need to check for
NULL before using this clock.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Tested-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: dwmac-sti: remove useless of_node check

Since dwmac-sti is a DT only driver checking for OF node is not necessary.

Signed-off-by: Joachim Eastwood <manabian@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Tested-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '10GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
10GbE Intel Wired LAN Driver Updates 2016-11-04

This series contains updates to ixgbe and ixgbevf only.

Don does cleanup and configuration for our X553 devices, related to LED,
auto-negotiation, flow control and SFP+ setup and config.  Adds the
(not secret) sauce for B0 hardware for X553 hardware.

Emil provides several fixes, first replaces the driver specific MDIO
defines for the more preferred equivalent kernel ones.  Provides a fix
for auto-negotiaion status, by reading a PHY register twice.  Introduces
ixgbe_link_operations structure to allow X550EM_a to override the
methods for MDIO access while X550EM_x provides methods to use I2C
combined access.

Mark fixes an issue where the driver was crashing when msix_entires
were not there because they were freed by a previous suspend or remove.

Sowmini Varadhan fixes an issue where an incorrect check for IPPROTO_UDP
in ixgbe_atr().  Then makes sure that the network and transport headers
in the paged data are available in the headlen bytes to calculate the
l4_proto.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Remove unused including <generated/utsrelease.h>

Remove including <generated/utsrelease.h> that don't need it.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbevf: Handle previously-freed msix_entries

The msix_entries memory can be freed by a previous suspend or
remove, so don't crash on close when it isn't there. Also only
clear the interrupts when the interface is up, because there
aren't any when it is not up.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: ixgbe_atr() compute l4_proto only if non-paged data has network/transport headers

For some Tx paths (e.g., tpacket_snd()), ixgbe_atr may be
passed down an sk_buff that has the network and transport
header in the paged data, so it needs to make sure these
headers are available in the headlen bytes to calculate the
l4_proto.

This patch expect that network and transport headers are
already available in the non-paged header dat. The assumption
is that the caller has set this up if l4_proto based Tx
steering is desired.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: ixgbe_atr() should access udp_hdr(skb) only for UDP packets

Commit 9f12df906cd8 ("ixgbe: Store VXLAN port number in network order")
incorrectly checks for hdr.ipv4->protocol != IPPROTO_UDP
in ixgbe_atr(). This check should be for "==" instead.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Correct X550 phy ID

We were using an old Alpha version of the X550 phy ID. This was leading
to unnecessary queries of the PHY. I removed the old ID (which shouldn't
be on any HW) and add the two that are.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add X553 FW ALEF support

This patch add X553 FW ALEF support for B0.  ALEF is the new unified
FW.  This contains updated register defines for ALEF speed
configuration.  Likewise it also removes the AN_CNTL_8 usage from
the native SFI flow as it is no longer supported by FW.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: set device if before calling get_invariants

Fix an issue where set_phy_power was NULL for X550 copper devices
because get_invariants was called before hw->device_id was set.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: use link instead of I2C combined abstraction

Introduce ixgbe_link_operations struct with the following changes:

read_i2c_combined => read_link
read_i2c_combined_unlocked => read_link_unlocked
write_i2c_combined => write_link
write_i2c_combined_unlocked => write_link_unlocked

This will allow X550EM_a to override these methods for MDIO access
while X550EM_x provides methods to use I2C combined access. This
also adds a new structure, ixgbe_link_info, to hold information
about the link. Initially this is just method pointers and a bus
address.

The functions involved in combined I2C accesses were moved from
ixgbe_phy.c to ixgbe_x550.c. The underlying functions that carry
out the combined I2C accesses were left in ixgbe_phy.c because
they share some functions with other I2C methods.

v2 - set hw->link.ops in probe.
v3 - check ii->link_ops before setting it since we don't have it
for all devices.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: remove SFP ixfi support

Remove SFP ixfi code since there is no HW that currently supports it.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Handle previously-freed msix_entries

The msix_entries memory can be freed by a previous suspend or
remove, so don't crash on close when it isn't there.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add X553 PHY FC autoneg support

This patch adds X553 flow control auto negotiation for fiber and
backplain. To enable this new function pointers were added as well
as creating a function to dynamically set function pointer we can't
define only on MAC type.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: fix link status check for copper X550em

Read the PHY register twice in order to get the correct value for
autoneg_status.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: do not use ixgbe specific mdio defines

Replace some ixgbe specific MDIO defines with their equivalent
from the kernel.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Update setup PHY link to unset all speeds

This patch updates ixgbe_setup_phy_link_generic to set/unset
auto-negotiation for all speeds. This ensures that unsupported
speeds are unset. This is necessary since the PHY NVM may
advertise unsupported speeds.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add support to retrieve and store LED link active

This patch adds support to get the LED link active via the LEDCTL
register. If the LEDCTL register does not have LED link active
(LED mode field = 0x0100) set then default LED link active returned.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add X552 iXFI configuration helper function

X553 doesn't need all the initialization that X552 did for iXFI. This
patch will allow native SPI SFP+ to work with X553 devices. Future
patches will add additional configuration as needed.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'nfp-ring-reconfig-and-xdp-support'

Jakub Kicinski says:

====================
ring reconfiguration and XDP support

This set adds support for ethtool channel API and XDP.

I kick off with ethtool get_channels() implementation.
set_channels() needs some preparations to get right.  I follow
the prepare/commit paradigm and allocate all resources before
stopping the device.  It has already been done for ndo_change_mtu
and ethtool set_ringparam(), it makes sense now to consolidate all
the required logic in one place.

XDP support requires splitting TX rings into two classes -
for the stack and for XDP.  The ring structures are identical.
The differences are in how they are connected to IRQ vector
structs and how the completion/cleanup works.  When XDP is enabled
I switch from the frag allocator to page-per-packet and map buffers
BIDIRECTIONALly.

Last but not least XDP offload is added (the patch just takes
care of the small formal differences between cls_bpf and XDP).

There is a tiny & trivial DebugFS patch in the mix, I hope it can
be taken via net-next provided we have the right Acks.

Resending with improved commit message and CCing more people on patch 10.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: add support for offload of XDP programs

Most infrastructure can be reused, provide separate handling
of context offsets and exit codes.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: remove unnecessary parameters from nfp_net_bpf_offload()

nfp_net_bpf_offload() takes all .setup_tc() parameters but it
doesn't use them at the moment. Remove unnecessary ones to make
it possible for XDP to reuse this function.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: add XDP support in the driver

Add XDP support. Separate stack's and XDP's TX rings logically.
Add functions for handling XDP_TX and cleanup of XDP's TX rings.
For XDP allocate all RX buffers as separate pages and map them
with DMA_BIDIRECTIONAL.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

debugfs: constify argument to debugfs_real_fops()

seq_file users can only access const version of file pointer,
because the ->file member of struct seq_operations is marked
as such. Make parameter to debugfs_real_fops() const.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Nicolai Stange <nicstange@gmail.com>
CC: Christian Lamparter <chunkeey@gmail.com>
CC: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: reorganize nfp_net_rx() to get packet offsets early

Calculate packet offsets early in nfp_net_rx() so that we will be
able to use them in upcoming XDP handler. While at it move relevant
variables into the loop scope.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: add support for ethtool .set_channels

Allow changing the number of rings via ethtool .set_channels API.
Runtime reconfig needs to be extended to handle number of rings.
We need to be able to activate interrupt vectors before rings are
assigned to them.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: move RSS indirection table init into a separate function

We will need to rerun the initialization of the RSS indirection table
after the number of rings is changed. Move the code to a separate
function.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: add helper to reassign rings to IRQ vectors

Instead of fixing ring -> vector relations up in ring swap functions
put the reassignment into a helper function which will reinit all
links.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: loosen relation between rings and IRQs vectors

Upcoming XDP support will break the assumption that one can iterate
over IRQ vectors to get to all the rings easily. Use nn->.x_ring
arrays directly.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: reuse ring helpers on .ndo_open() path

Ring allocation helpers encapsulate all ring allocation and
initialization steps nicely. Reuse them on .ndo_open() path.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: rename ring allocation helpers

"Shadow" in ring helpers used to mean that the helper will allocate
rings without touching existing configuration, this was used for
reconfiguration while the device was running. We will soon use
the same helpers for .ndo_open() path, so replace "shadow" with
"ring_set".

No functional changes.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: centralize runtime reconfiguration logic

All functions which need to reallocate ring resources at runtime
look very similar. Centralize that logic into a separate function.
Encapsulate configuration parameters in a structure.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: add support for ethtool .get_channels

Report number of rings via ethtool .get_channels API.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'amd-xgbe-updates'

Tom Lendacky says:

====================
amd-xgbe: AMD XGBE driver updates 2016-11-03

This patch series is targeted at preparing the driver for a new PCI version
of the hardware. After this series is applied, a follow-on series will
introduce the support for the PCI version of the hardware.

The following updates and fixes are included in this driver update series:

- Fix formatting of PCS debug register dump
- Prepare for priority-based FIFO allocation
- Implement priority-based FIFO allocation
- Prepare for working with more than one type of PCS/PHY
- Prepare for the introduction of clause 37 auto-negotiation
- Add support for clause 37 auto-negotiation
- Prepare for supporting a new PCS register access method
- Add support for 64-bit management counter registers
- Update DMA channel status determination
- Prepare for supporting PCI devices in addition to platform devices

This patch series is based on net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Prepare for supporting PCI devices

Update the driver framework to separate out platform/ACPI specific code
from general code during device initialization. This will allow for the
introduction of PCI device support.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Update how to determine DMA channel status

Tx and Rx DMA channel status determiniation is different depending on the
version of the hardware. Update the channel status processing code to
account for the change. Also, reduce the timeout value used when stopping
the channels.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Support for 64-bit management counter registers

Add support for reading all management counter registers as 64-bit
values. The indication of whether to read the high 32-bits to form
a 64-bit value is indicated in the version data.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Prepare for a new PCS register access method

Prepare the code to be able to support accessing of the PCS registers
in a new way, while maintaining the current access method. Provide a
version specific field that indicates the method to use.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Add support for clause 37 auto-negotiation

Add support to be able to use clause 37 auto-negotiation.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Prepare for introduction of clause 37 autoneg

Prepare for the future introduction of clause 37 auto-negotiation by
updating the current auto-negotiation related functions to identify
them as clause 73 functions. Move interrupt enablement to the
enable/disable auto-negotiation functions. Update what will be common
routines to check for the current type of AN and process accordingly.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Prepare for working with more than one type of phy

Prepare the code to be able to work with more than one type of phy by
adding additional callable functions into the phy interface and removing
phy specific settings/functions from non-phy related files.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Perform priority-based hardware FIFO allocation

Allocate the FIFO across the hardware Rx queues based on the priority
of the queues. Giving more FIFO resources to queues with a higher
priority. If PFC is active but not enabled for a queue, then less
resources can allocated to the queue.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Prepare for priority-based FIFO allocation

Currently, the Rx and Tx fifos are evenly allocated between the hardware
queues of the device. As more queues are instantiated, the fifo memory
needs to be able to be allocated based on queue priority. This allows for
higher priority queues to have more fifo memory than lower priority
queues. Prepare for this by modifying the current fifo calculation to
assign the fifo queue allocation in an array that is then used to program
the hardware.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Fix formatting of PCS register dump

Fix the length value used for the PCS register dump so that the full
value can be displayed.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'uid-routing'

Lorenzo Colitti says:

====================
net: inet: Support UID-based routing

This patchset adds support for per-UID routing. It allows the
administrator to configure rules such as:

  ip rule add uidrange 100-200 lookup 123

This functionality has been in use by all Android devices since
5.0. It is primarily used to impose per-app routing policies (on
Android, every app has its own UID) without having to resort to
rerouting packets in iptables, which breaks getsockname() and
MTU/MSS calculation, and generally disrupts end-to-end
connectivity.

This patch series is similar to the code currently used on
Android, but has better correctness and performance because
it stores the UID in the socket instead of calling sock_i_uid.
This avoids contention on sk->sk_callback_lock, and makes it
possible to correctly route a socket on which userspace has
called close(), for which sock_i_uid will return 0.

Changes from v1:
- Don't set the UID in sk_clone_lock, it's already set by
  sock_copy.
- For packets originated by kernel sockets, don't use the socket
  UID. This is the UID that created the namespace, but it might
  not be mapped in the namespace at all. Instead, use UID 0 in
  the namespace, which is less surprising and consistent with
  what happens in the root namespace.
- Fix UID routing of IPv4 and IPv6 SYN_RECV sockets.
- Fix UID routing of received IPv6 redirects.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: inet: Support UID-based routing in IP protocols.

- Use the UID in routing lookups made by protocol connect() and
  sendmsg() functions.
- Make sure that routing lookups triggered by incoming packets
  (e.g., Path MTU discovery) take the UID of the socket into
  account.
- For packets not associated with a userspace socket, (e.g., ping
  replies) use UID 0 inside the user namespace corresponding to
  the network namespace the socket belongs to. This allows
  all namespaces to apply routing and iptables rules to
  kernel-originated traffic in that namespaces by matching UID 0.
  This is better than using the UID of the kernel socket that is
  sending the traffic, because the UID of kernel sockets created
  at namespace creation time (e.g., the per-processor ICMP and
  TCP sockets) is the UID of the user that created the socket,
  which might not be mapped in the namespace.

Tested: compiles allnoconfig, allyesconfig, allmodconfig
Tested: https://android-review.googlesource.com/253302
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: add UID to flows, rules, and routes

- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
  range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
  rtnetlink. The value INVALID_UID indicates no UID was
  specified.
- Add a UID field to the flow structures.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: Add a UID field to struct sock.

Protocol sockets (struct sock) don't have UIDs, but most of the
time, they map 1:1 to userspace sockets (struct socket) which do.

Various operations such as the iptables xt_owner match need
access to the "UID of a socket", and do so by following the
backpointer to the struct socket. This involves taking
sk_callback_lock and doesn't work when there is no socket
because userspace has already called close().

Simplify this by adding a sk_uid field to struct sock whose value
matches the UID of the corresponding struct socket. The semantics
are as follows:

1. Whenever sk_socket is non-null: sk_uid is the same as the UID
   in sk_socket, i.e., matches the return value of sock_i_uid.
   Specifically, the UID is set when userspace calls socket(),
   fchown(), or accept().
2. When sk_socket is NULL, sk_uid is defined as follows:
   - For a socket that no longer has a sk_socket because
     userspace has called close(): the previous UID.
   - For a cloned socket (e.g., an incoming connection that is
     established but on which userspace has not yet called
     accept): the UID of the socket it was cloned from.
   - For a socket that has never had an sk_socket: UID 0 inside
     the user namespace corresponding to the network namespace
     the socket belongs to.

Kernel sockets created by sock_create_kern are a special case
of #1 and sk_uid is the user that created them. For kernel
sockets created at network namespace creation time, such as the
per-processor ICMP and TCP sockets, this is the user that created
the network namespace.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-mv88e6xxx-port-operation-refine'

Vivien Didelot says:

====================
net: dsa: mv88e6xxx: refine port operations

The Marvell chips have one internal SMI device per port, containing a
set of registers used to configure a port's link, STP state, default
VLAN or addresses database, etc.

This patchset creates port files to implement the port operations as
described in datasheets, and extend the chip ops structure with them.

Patches 1 to 6 implement accessors for port's STP state, port based VLAN
map, default FID, default VID, and 802.1Q mode.

Patches 7 to 11 implement the port's MAC setup of link state, duplex
mode, RGMII delay and speed, all accessed through port's register 0x01.

The new port's MAC setup code is used to re-implement the adjust_link
code and correctly force the link down before changing any of the MAC
settings, as requested by the datasheets.

The port's MAC accessors use values compatible with struct phy_device
(e.g. DUPLEX_FULL) and extend them when needed (e.g. SPEED_MAX).

Changes in v2:

- Strictly use new _UNFORCED values instead of re-using _UNKNOWN ones.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: setup port's MAC

Now that we have setters to configure the port's MAC, use them to
refactor the port setup and adjust_link code.

Note that port's MAC speed, duplex or RGMII delay must not be changed
unless the port's link is forced down. So wrap all that in a
mv88e6xxx_port_setup_mac function.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port's MAC speed setter

While the two bits for link, duplex or RGMII delays are used the same
way on chips supporting the said feature, the two bits for speed have
different meaning for most of the chips out there.

Speed value is stored in bits 1:0, 0x3 means unforce (normal detection).

Some chips reuse values for alternative speeds when bit 12 is set.

Newer chips with speed > 1Gbps reuse value 0x3 thus need a new bit 13.

Here are the values to write in register 0x1 to (un)force speed:

    | Speed   | 88E6065 | 88E6185 | 88E6352 | 88E6390 | 88E6390X |
    | ------- | ------- | ------- | ------- | ------- | -------- |
    | 10      | 0x0000  | 0x0000  | 0x0000  | 0x2000  | 0x2000   |
    | 100     | 0x0001  | 0x0001  | 0x0001  | 0x2001  | 0x2001   |
    | 200     | 0x0002  | NA      | 0x1001  | 0x3001  | 0x3001   |
    | 1000    | NA      | 0x0002  | 0x0002  | 0x2002  | 0x2002   |
    | 2500    | NA      | NA      | NA      | 0x3003  | 0x3003   |
    | 10000   | NA      | NA      | NA      | NA      | 0x2003   |
    | unforce | 0x0003  | 0x0003  | 0x0003  | 0x0000  | 0x0000   |

This patch implements a generic mv88e6xxx_port_set_speed() function used
by chip-specific wrappers to filter supported ports and speeds.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port's RGMII delay setter

Some chips such as 88E6352 and 88E6390 can be programmed to add delays
to RXCLK for IND inputs or to GTXCLK for OUTD outputs when port is in
RGMII mode.

Add a port function to program such delays according to the provided PHY
interface mode.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port duplex setter

Similarly to port's link, add setter to force port's half duplex, full
duplex or let normal duplex detection occurs.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port link setter

Most of the chips will have a port register control bits to force the
port's link up, down, or let normal link detection occurs.

Implement such operation to use it later when setting duplex, etc.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port 802.1Q mode setter

Add port functions to set the port 802.1Q mode.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port PVID accessors

Add port functions to access the ports default VID.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port FID accessors

Add functions to port files to access the ports default FID.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port vlan map setter

Add a port function to access the Port Based VLAN Map register.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port state setter

Add the port STP state setter to the port files.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add port files

The Marvell switches contains one internal SMI device per port, called
"Port Registers". Depending on the model, the addresses of these devices
start from 0x0, 0x8 or 0x10.

Start moving Port Registers specific code to their own files.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: cls_flower: Support matching on SCTP ports

Support matching on SCTP ports in the same way that matching
on TCP and UDP ports is already supported.

Example usage:

tc qdisc add dev eth0 ingress

tc filter add dev eth0 protocol ip parent ffff: \
flower indev eth0 ip_proto sctp dst_port 80 \
action drop

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: pci: Fix the FW ready mask length

The system-status register is actually 16-bit wide and not 8 bit-wide.

Fixes: 233fa44bd67ae ("mlxsw: pci: Implement reset done check")
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ip-recvfragsize-cmsg'

Willem de Bruijn says:

====================
ip: add RECVFRAGSIZE cmsg

On IP datagrams and raw sockets, when packets arrive fragmented,
expose the largest received fragment size through a new cmsg.

Protocols implemented on top of these sockets may use this, for
instance, to inform peers to lower MSS on platforms that silently
allow send calls to exceed PMTU and cause fragmentation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: on reassembly, record frag_max_size

IP6CB and IPCB have a frag_max_size field. In IPv6 this field is
filled in when packets are reassembled by the connection tracking
code. Also fill in when reassembling in the input path, to expose
it through cmsg IPV6_RECVFRAGSIZE in all cases.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: add IPV6_RECVFRAGSIZE cmsg

When reading a datagram or raw packet that arrived fragmented, expose
the maximum fragment size if recorded to allow applications to
estimate receive path MTU.

At this point, the field is only recorded when ipv6 connection
tracking is enabled. A follow-up patch will record this field also
in the ipv6 input path.

Tested using the test for IP_RECVFRAGSIZE plus

  ip netns exec to ip addr add dev veth1 fc07::1/64
  ip netns exec from ip addr add dev veth0 fc07::2/64

  ip netns exec to ./recv_cmsg_recvfragsize -6 -u -p 6000 &
  ip netns exec from nc -q 1 -u fc07::1 6000 < payload

Both with and without enabling connection tracking

  ip6tables -A INPUT -m state --state NEW -p udp -j LOG

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: add IP_RECVFRAGSIZE cmsg

The IP stack records the largest fragment of a reassembled packet
in IPCB(skb)->frag_max_size. When reading a datagram or raw packet
that arrived fragmented, expose the value to allow applications to
estimate receive path MTU.

Tested:
  Sent data over a veth pair of which the source has a small mtu.
  Sent data using netcat, received using a dedicated process.

  Verified that the cmsg IP_RECVFRAGSIZE is returned only when
  data arrives fragmented, and in that cases matches the veth mtu.

    ip link add veth0 type veth peer name veth1

    ip netns add from
    ip netns add to

    ip link set dev veth1 netns to
    ip netns exec to ip addr add dev veth1 192.168.10.1/24
    ip netns exec to ip link set dev veth1 up

    ip link set dev veth0 netns from
    ip netns exec from ip addr add dev veth0 192.168.10.2/24
    ip netns exec from ip link set dev veth0 up
    ip netns exec from ip link set dev veth0 mtu 1300
    ip netns exec from ethtool -K veth0 ufo off

    dd if=/dev/zero bs=1 count=1400 2>/dev/null > payload

    ip netns exec to ./recv_cmsg_recvfragsize -4 -u -p 6000 &
    ip netns exec from nc -q 1 -u 192.168.10.1 6000 < payload

  using github.com/wdebruij/kerneltools/blob/master/tests/recvfragsize.c

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'stmmac-OXNAS'

Neil Armstrong says:

====================
net: stmmac: Add OXNAS DWMAC Glue

This patchset add support for the Sysnopsys DWMAC Gigabit Ethernet
controller Glue layer of the Oxford Semiconductor OX820 SoC.

Changes since v2 at http://lkml.kernel.org/r/20161031105345.16711-1-narmstrong@baylibre.com :
- Disable/Unprepare clock if regmap read fails in oxnas_dwmac_init

Changes since v1 at https://patchwork.kernel.org/patch/9388231/ :
- Split dt-bindings in a separate patch
- Add IP version in the dt-bindings compatible
- Check return of clk_prepare_enable()
- use get_stmmac_bsp_priv() helper
- hardwire setup values in oxnas_dwmac_init()

Changes since RFC at https://patchwork.kernel.org/patch/9387257 :
- Drop init/exit callbacks
- Implement proper remove and PM callback
- Call init from probe
- Disable/Unprepare clock if stmmac probe fails
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: Add OXNAS DWMAC Bindings

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Add OXNAS Glue Driver

Add Synopsys Designware MAC Glue layer for the Oxford Semiconductor OX820.

Acked-by: Joachim Eastwood <manabian@gmail.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'diag-raw-fixes'

Cyrill Gorcunov says:

====================
net: Fixes for raw diag sockets handling

Hi! Here are a few fixes for raw-diag sockets handling: missing
sock_put call and jump for exiting from nested cycle. I made
patches for iproute2 as well so will send them out soon.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ip, raw_diag -- Use jump for exiting from nested loop

I managed to miss that sk_for_each is called under "for"
cycle so need to use goto here to return matching socket.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ip, raw_diag -- Fix socket leaking for destroy request

In raw_diag_destroy the helper raw_sock_get returns
with sock_hold call, so we have to put it then.

CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enic: set skb->hash type properly

Driver sets the skb l4/l3 hash based on NIC_CFG_RSS_HASH_TYPE_*,
which is bit mask. This is wrong. Hw actually provides us enum.
Use CQ_ENET_RQ_DESC_RSS_TYPE_* to set l3 and l4 hash type.

Fixes: bf751ba802fe ("driver/net: enic: record q_number and rss_hash for skb")
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: 3com: typhoon: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Reviewed-by: David Dillow <dave@thedillows.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ila: Fix crash caused by rhashtable changes

commit ca26893f05e86 ("rhashtable: Add rhlist interface")
added a field to rhashtable_iter so that length became 56 bytes
and would exceed the size of args in netlink_callback (which is
48 bytes). The netlink diag dump function already has been
allocating a iter structure and storing the pointed to that
in the args of netlink_callback. ila_xlat also uses
rhahstable_iter but is still putting that directly in
the arg block. Now since rhashtable_iter size is increased
we are overwriting beyond the structure. The next field
happens to be cb_mutex pointer in netlink_sock and hence the crash.

Fix is to alloc the rhashtable_iter and save it as pointer
in arg.

Tested:

  modprobe ila
  ./ip ila add loc 3333:0:0:0 loc_match 2222:0:0:1,
  ./ip ila list  # NO crash now

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnect

While being preparing patches for killing raw sockets via
diag netlink interface I noticed that my runs are stuck:

| [root@pcs7 ~]# cat /proc/`pidof ss`/stack
| [<ffffffff816d1a76>] __lock_sock+0x80/0xc4
| [<ffffffff816d206a>] lock_sock_nested+0x47/0x95
| [<ffffffff8179ded6>] udp_disconnect+0x19/0x33
| [<ffffffff8179b517>] raw_abort+0x33/0x42
| [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52

which has not been the case before. I narrowed it down to the commit

| commit 286c72deabaa240b7eebbd99496ed3324d69f3c0
| Author: Eric Dumazet <edumazet@google.com>
| Date: Thu Oct 20 09:39:40 2016 -0700
|
| udp: must lock the socket in udp_disconnect()

where we start locking the socket for different reason.

So the raw_abort escaped the renaming and we have to
fix this typo using __udp_disconnect instead.

Fixes: 286c72deabaa ("udp: must lock the socket in udp_disconnect()")
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

lan78xx: Use irq_domain for phy interrupt from USB Int. EP

To utilize phylib with interrupt fully than handling some of phy stuff in the MAC driver,
create irq_domain for USB interrupt EP of phy interrupt and
pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT.

Idea comes from drivers/gpio/gpio-dl2.c

Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>