git.openwrt.org Git - openwrt/staging/blogic.git/log

nfp: devlink add support for getting eswitch mode

Add app callback for reporting eswitch mode. Non-SRIOV apps
should not implement this callback, nfp_app code will then
respond with -EOPNOTSUPP.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: store port/representator id in metadata_dst

Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id. This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission. The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower netdevs
today.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'phy-internal'

Florian Fainelli says:

====================
net: phy: Support "internal" PHY interface

This makes the "internal" phy-mode property generally available and
documented and this allows us to remove some custom parsing code
we had for bcmgenet and bcm_sf2 which both used that specific value.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: bcm_sf2: Remove special handling of "internal" phy-mode

The PHY library now supports an "internal" phy-mode, thus making our
custom parsing code now unnecessary.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bcmgenet: Remove special handling of "internal" phy-mode

The PHY library now supports an "internal" phy-mode, thus making our
custom parsing code now unnecessary.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Support "internal" PHY interface

Now that the Device Tree binding has been updated, update the PHY
library phy_interface_t and phy_modes to support the "internal" PHY
interface type.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: Add "internal" as a valid 'phy-mode' property

A number of Ethernet MACs have internal Ethernet PHYs and the internal
wiring makes it so that this knowledge needs to be available using the
standard 'phy-mode' property.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-updates-2017-06-23' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2017-06-23

This series provides some updates to the mlx5 core and netdevice drivers.

Three patches from Tariq, Introduces page reuse mechanism in non-Striding
RQ RX datapath, we allow the the RX descriptor to reuse its allocated page
as much as it could, until the page is fully consumed. RX page reuse
reduces the stress on page allocator and improves RX performance especially
with high speeds (100Gb/s).

Next four patches of the series from Or allows to offload tc flower matching
on ttl/hoplimit and header re-write of hoplimit.

The rest of the series from Yotam and Or enhances mlx5 to support FW flashing
through the mlxfw module, in a similar manner done by the mlxsw driver.
Currently, only ethtool based flashing is implemented, where both Eth and IB ports
are supported.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Use Firmware params to get buffer-group map

Buffer group mappings can be obtained using FW_PARAMs cmd for newer FW.

Since some of the bg_maps are obtained in atomic context, created another
t4_query_params_ns(), that wont sleep when awaiting mbox cmd completion.

Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Update T6 Buffer Group and Channel Mappings

We were using t4_get_mps_bg_map() for both t4_get_port_stats()
to determine which MPS Buffer Groups to report statistics on for a given
Port, and also for t4_sge_alloc_rxq() to provide a TP Ingress Channel
Congestion Map. For T4/T5 these are actually the same values (because they
are ~somewhat~ related), but for T6 they should return different values
(T6 has Port 0 associated with MPS Buffer Group 0 (with MPS Buffer Group 1
silently cascading off) and Port 1 is associated with MPS Buffer Group 2
(with 3 cascading off)).

Based on the original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: return -EFAULT if copy_to_user() fails

The copy_to_user() function returns the number of bytes remaining but we
want to return -EFAULT here.

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-06-23

1) Use memdup_user to spmlify xfrm_user_policy.
   From Geliang Tang.

2) Make xfrm_dev_register static to silence a sparse warning.
   From Wei Yongjun.

3) Use crypto_memneq to check the ICV in the AH protocol.
   From Sabrina Dubroca.

4) Remove some unused variables in esp6.
   From Stephen Hemminger.

5) Extend XFRM MIGRATE to allow to change the UDP encapsulation port.
   From Antony Antony.

6) Include the UDP encapsulation port to km_migrate announcements.
   From Antony Antony.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ena-new-features-and-improvements'

Netanel Belgazal says:

====================
net: update ena ethernet driver to version 1.2.0

This patchset contains some new features/improvements that were added
to the ENA driver to increase its robustness and are based on
experience of wide ENA deployment.

Change log:

V2:
* Remove patch that add inline to C-file static function (contradict coding style).
* Remove patch that moves MTU parameter validation in ena_change_mtu() instead of
using the network stack.
* Use upper_32_bits()/lower_32_bits() instead of casting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: update ena driver to version 1.2.0

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: update driver's rx drop statistics

rx drop counter is reported by the device in the keep-alive
event.
update the driver's counter with the device counter.

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: use lower_32_bits()/upper_32_bits() to split dma address

In ena_com_mem_addr_set(), use the above functions to split dma address
to the lower 32 bits and the higher 16 bits.

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: separate skb allocation to dedicated function

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: use napi_schedule_irqoff when possible

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: allow the driver to work with small number of msix vectors

Current driver tries to allocate msix vectors as the number of the
negotiated io queues. (with another msix vector for management).
If pci_alloc_irq_vectors() fails, the driver aborts the probe
and the ENA network device is never brought up.

With this patch, the driver's logic will reduce the number of IO
queues to the number of allocated msix vectors (minus one for management)
instead of failing probe().

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: add support for out of order rx buffers refill

ENA driver post Rx buffers through the Rx submission queue
for the ENA device to fill them with receive packets.
Each Rx buffer is marked with req_id in the Rx descriptor.

Newer ENA devices could consume the posted Rx buffer in out of order,
and as result the corresponding Rx completion queue will have Rx
completion descriptors with non contiguous req_id(s)

In this change the driver holds two rings.
The first ring (called free_rx_ids) is a mapping ring.
It holds all the unused request ids.
The values in this ring are from 0 to ring_size -1.

When the driver wants to allocate a new Rx buffer it uses the head of
free_rx_ids and uses it's value as the index for rx_buffer_info ring.
The req_id is also written to the Rx descriptor

Upon Rx completion,
The driver took the req_id from the completion descriptor and uses it
as index in rx_buffer_info.
The req_id is then return to the free_rx_ids ring.

This patch also adds statistics to inform when the driver receive out
of range or unused req_id.

Note:
free_rx_ids is only accessible from the napi handler, so no locking is
required

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: add reset reason for each device FLR

For each device reset, log to the device what is the cause
the reset occur.

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: change sizeof() argument to be the type pointer

Instead of using:
memset(ptr, 0x0, sizeof(struct ...))
use:
memset(ptr, 0x0, sizeor(*ptr))

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: add hardware hints capability to the driver

With this patch, ENA device can update the ena driver about
the desired timeout values:
These values are part of the "hardware hints" which are transmitted
to the driver as Asynchronous event through ENA async
event notification queue.

In case the ENA device does not support this capability,
the driver will use its own default values.

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ena: change return value for unsupported features unsupported return value

return -EOPNOTSUPP instead of -EPERM.

Signed-off-by: Netanel Belgazal <netanel@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: fix out-of-bounds access in ULP sysctl

KASAN reports out-of-bound access in proc_dostring() coming from
proc_tcp_available_ulp() because in case TCP ULP list is empty
the buffer allocated for the response will not have anything
printed into it. Set the first byte to zero to avoid strlen()
going out-of-bounds.

Fixes: 734942cc4ea6 ("tcp: ULP infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: possibly avoid extra masking for narrower load in verifier

Commit 31fd85816dbe ("bpf: permits narrower load from bpf program
context fields") permits narrower load for certain ctx fields.
The commit however will already generate a masking even if
the prog-specific ctx conversion produces the result with
narrower size.

For example, for __sk_buff->protocol, the ctx conversion
loads the data into register with 2-byte load.
A narrower 2-byte load should not generate masking.
For __sk_buff->vlan_present, the conversion function
set the result as either 0 or 1, essentially a byte.
The narrower 2-byte or 1-byte load should not generate masking.

To avoid unnecessary masking, prog-specific *_is_valid_access
now passes converted_op_size back to verifier, which indicates
the valid data width after perceived future conversion.
Based on this information, verifier is able to avoid
unnecessary marking.

Since we want more information back from prog-specific
*_is_valid_access checking, all of them are packed into
one data structure for more clarity.

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: make some functions static

The functions dwmac4_dma_init_rx_chan, dwmac4_dma_init_tx_chan and
dwmac4_dma_init_channel do not need to be in global scope, so them
static.

Cleans up sparse warnings:
"symbol 'dwmac4_dma_init_rx_chan' was not declared. Should it be static?"
"symbol 'dwmac4_dma_init_tx_chan' was not declared. Should it be static?"
"symbol 'dwmac4_dma_init_channel' was not declared. Should it be static?"

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'xdp-offload-mode'

Jakub Kicinski says:

====================
xdp: offload mode

While we discuss the representors.. :)

This set adds XDP flag for forcing offload and a attachment mode
for reporting to user space that program has been offloaded.  The
nfp driver is modified to make use of the new flags, but also to
adhere to the DRV_MODE flag which should disable the HW offload.

The intended driver behaviour is:
            DRV mode   offload
no flags      yes     attempted
DRV_MODE      yes        no
HW_MODE      no         yes

Where 'yes' means required, and error will be returned if setup fails.
'Attempted' means the offload will only happen automatically if HW is
capable and offloading the program will cause no change in system
behaviour (e.g. maps don't have to bound).

Thanks to loading the program both to the driver and HW by default we
can fallback to the driver mode without disruption in case user replaces
the program with one which cannot be offloaded later.

Note that the NFP driver currently claims XDP offload support but
lacks most basic features like direct packet access.

Only change compared to the RFC is fixing the double bpf_prog_put()
which Daniel has spotted (patch 5).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: xdp: report if program is offloaded

Make use of just added XDP_ATTACHED_HW.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

xdp: add reporting of offload mode

Extend the XDP_ATTACHED_* values to include offloaded mode.
Let drivers report whether program is installed in the driver
or the HW by changing the prog_attached field from bool to
u8 (type of the netlink attribute).

Exploit the fact that the value of XDP_ATTACHED_DRV is 1,
therefore since all drivers currently assign the mode with
double negation:
mode = !!xdp_prog;
no drivers have to be modified.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: add support for XDP_FLAGS_HW_MODE

Respect the XDP_FLAGS_HW_MODE. When it's set install the program
on the NIC and skip enabling XDP in the driver.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: release the reference on offloaded programs

The xdp_prog member of the adapter's data path structure is used
for XDP in driver mode.  In case a XDP program is loaded with in
HW-only mode, we need to store it somewhere else.  Add a new XDP
prog pointer in the main structure and use that when we need to
know whether any XDP program is loaded, not only a driver mode
one.  Only release our reference on adapter free instead of
immediately after netdev unregister to allow offload to be disabled
first.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: don't offload XDP programs in DRV_MODE

DRV_MODE means that user space wants the program to be run in
the driver. Do not try to offload. Only offload if no mode
flags have been specified.

Remember what the mode is when the program is installed and refuse
new setup requests if there is already a program loaded in a
different mode. This should leave it open for us to implement
simultaneous loading of two programs - one in the drv path and
another to the NIC later.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: xdp: move driver XDP setup into a separate function

In preparation of XDP offload flags move the driver setup into
a function. Otherwise the number of conditions in one function
would make it slightly hard to follow. The offload handler may
now be called with NULL prog, even if no offload is currently
active, but that's fine, offload code can handle that.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

xdp: add HW offload mode flag for installing programs

Add an installation-time flag for requesting that the program
be installed only if it can be offloaded to HW.

Internally new command for ndo_xdp is added, this way we avoid
putting checks into drivers since they all return -EINVAL on
an unknown command.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

xdp: pass XDP flags into install handlers

Pass XDP flags to the xdp ndo. This will allow drivers to look
at the mode flags and make decisions about offload.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: fix poll()

Michael reported an UDP breakage caused by the commit b65ac44674dd
("udp: try to avoid 2 cache miss on dequeue").
The function __first_packet_length() can update the checksum bits
of the pending skb, making the scratched area out-of-sync, and
setting skb->csum, if the skb was previously in need of checksum
validation.

On later recvmsg() for such skb, checksum validation will be
invoked again - due to the wrong udp_skb_csum_unnecessary()
value - and will fail, causing the valid skb to be dropped.

This change addresses the issue refreshing the scratch area in
__first_packet_length() after the possible checksum update.

Fixes: b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp/v6: prefetch rmem_alloc in udp6_queue_rcv_skb()

very similar to commit dd99e425be23 ("udp: prefetch
rmem_alloc in udp_queue_rcv_skb()"), this allows saving a cache
miss when the BH is bottle-neck for UDP over ipv6 packet
processing, e.g. for small packets when a single RX NIC ingress
queue is in use.

Performances under flood when multiple NIC RX queues used are
unaffected, but when a single NIC rx queue is in use, this
gives ~8% performance improvement.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-mvpp2-misc-improvements'

Thomas Petazzoni says:

====================
net: mvpp2: misc improvements

Here are a few patches making various small improvements/refactoring
in the mvpp2 driver. They are based on today's net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: remove mvpp2_pool_refill()

When all a function does is calling another function with the exact same
arguments, in the exact same order, you know it's time to remove said
function. Which is exactly what this commit does.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: remove unused mvpp2_bm_cookie_pool_set() function

This function is not used in the driver, remove it.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: add comments about smp_processor_id() usage

A previous commit modified a number of smp_processor_id() used in
migration-enabled contexts into get_cpu/put_cpu sections. However, a few
smp_processor_id() calls remain in the driver, and this commit adds
comments explaining why they can be kept.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'stmmac-pci-Refactor-DMI-probing'

Jan Kiszka says:

====================
stmmac: pci: Refactor DMI probing

Some cleanups of the way we probe DMI platforms in the driver. Reduces
a bit of open-coding and makes the logic easier reusable for any
potential DMI platform != Quark.

Tested on IOT2000 and Galileo Gen2.

Changes in v5:
- fixed a remaining issue in patch 5
- dropped patch 6 for now
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: pci: Use dmi_system_id table for retrieving PHY addresses

Avoids reimplementation of DMI matching in stmmac_pci_find_phy_addr.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: pci: Select quark_pci_dmi_data from quark_default_data

No need to carry this reference in stmmac_pci_info - the Quark-specific
setup handler knows that it needs to use the Quark-specific DMI table.
This also allows to drop the stmmac_pci_info reference from the setup
handler parameter list.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: pci: Make stmmac_pci_find_phy_addr truly generic

Move the special case for the early Galileo firmware into
quark_default_setup. This allows to use stmmac_pci_find_phy_addr for
non-quark cases.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: pci: Use stmmac_pci_info for all devices

Make stmmac_default_data compatible with stmmac_pci_info.setup and use
an info structure for all devices. This allows to make the probing more
regular.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: pci: Make stmmac_pci_info structure constant

By removing the PCI device reference from the structure and passing it
as parameters to the interested functions, we can make quark_pci_info
const.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: Fix the carrier state error when data path is off

When the VF NIC is opened, the synthetic NIC's carrier state is set to
off. This tells the host to transitions data path to the VF device. But
if startup script or user manipulates the admin state of the netvsc
device directly for example:
# ifconfig eth0 down
# ifconfig eth0 up
Then the carrier state of the synthetic NIC would be on, even though the
data path was still over the VF NIC. This patch sets the carrier state
of synthetic NIC with consideration of the related VF state.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: Remove unnecessary var link_state from struct netvsc_device_info

We simply use rndis_device->link_state in the netdev_dbg. The variable,
link_state from struct netvsc_device_info, is not used anywhere else.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: fix a build problem

tracex5_kern.c build failed with the following error message:
../samples/bpf/tracex5_kern.c:12:10: fatal error: 'syscall_nrs.h' file not found
#include "syscall_nrs.h"
The generated file syscall_nrs.h is put in build/samples/bpf directory,
but this directory is not in include path, hence build failed.

The fix is to add $(obj) into the clang compilation path.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rds-tcp-fixes'

Sowmini Varadhan says:

====================
rds: tcp: fixes

Patch1 is a bug fix for correct reconnect when a connection
is restarted. Patch 2 accelerates cleanup by setting linger
to 1 and sending a RST to the peer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

rds: tcp: set linger to 1 when unloading a rds-tcp

If we are unloading the rds_tcp module, we can set linger to 1
and drop pending packets to accelerate reconnect. The peer will
end up resetting the connection based on new generation numbers
of the new incarnation, so hanging on to unsent TCP packets via
linger is mostly pointless in this case.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Jenny Xu <jenny.x.xu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: tcp: send handshake ping-probe from passive endpoint

The RDS handshake ping probe added by commit 5916e2c1554f
("RDS: TCP: Enable multipath RDS for TCP") is sent from rds_sendmsg()
before the first data packet is sent to a peer. If the conversation
is not bidirectional (i.e., one side is always passive and never
invokes rds_sendmsg()) and the passive side restarts its rds_tcp
module, a new HS ping probe needs to be sent, so that the number
of paths can be re-established.

This patch achieves that by sending a HS ping probe from
rds_tcp_accept_one() when c_npaths is 0 (i.e., we have not done
a handshake probe with this peer yet).

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Jenny Xu <jenny.x.xu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Correct return code checking for ibmvnic_init during probe

The update to ibmvnic_init to allow an EAGAIN return code broke
the calling of ibmvnic_init from ibmvnic_probe. The code now
will return from this point in the probe routine if anything
other than EAGAIN is returned. The check should be to see if rc
is non-zero and not equal to EAGAIN.

Without this fix, the vNIC driver can return 0 (success) from
its probe routine due to ibmvnic_init returning zero, but before
completing the probe process and registering with the netdev layer.

Fixes: 6a2fb0e99f9c (ibmvnic: driver initialization for kdump/kexec)
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ibmvnic-Correct-long-term-mapped-buffer-error-handling'

Thomas Falcon says:

====================
ibmvnic: Correct long-term-mapped buffer error handling

This patch set fixes the error-handling of long-term-mapped buffers
during adapter initialization and reset. The first patch fixes a bug
in an incorrectly defined descriptor that was keeping the return
codes from the VIO server from being properly checked. The second patch
fixes and cleans up the error-handling implementation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Fix error handling when registering long-term-mapped buffers

The patch stores the return code of the REQUEST_MAP_RSP sub-CRQ command
in the private data structure, where it can be later checked during
device open or a reset.

In the case of a reset, the mapping request to the vNIC Server may fail,
especially in the case of a partition migration. The driver attempts to
handle this by re-allocating the buffer and re-sending the mapping request.

The original error handling implementation was removed. The separate
function handling the REQUEST_MAP response message was also removed,
since it is now simple enough to be handled in the ibmvnic_handle_crq
function.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ibmvnic: Fix incorrectly defined ibmvnic_request_map_rsp structure

This reserved area should be eight bytes in length instead of four.
As a result, the return codes in the REQUEST_MAP_RSP descriptors
were not being properly handled.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: Add a tcp_filter hook before handle ack packet

Currently in both ipv4 and ipv6 code path, the ack packet received when
sk at TCP_NEW_SYN_RECV state is not filtered by socket filter or cgroup
filter since it is handled from tcp_child_process and never reaches the
tcp_filter inside tcp_v4_rcv or tcp_v6_rcv. Adding a tcp_filter hooks
here can make sure all the ingress tcp packet can be correctly filtered.

Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: smsc: fix buffer overflow in memcpy

The memcpy annotation triggers for a fixed-length buffer copy:

In file included from /git/arm-soc/arch/arm64/include/asm/processor.h:30:0,
                 from /git/arm-soc/arch/arm64/include/asm/spinlock.h:21,
                 from /git/arm-soc/include/linux/spinlock.h:87,
                 from /git/arm-soc/include/linux/seqlock.h:35,
                 from /git/arm-soc/include/linux/time.h:5,
                 from /git/arm-soc/include/linux/stat.h:21,
                 from /git/arm-soc/include/linux/module.h:10,
                 from /git/arm-soc/drivers/net/phy/smsc.c:20:
In function 'memcpy',
    inlined from 'smsc_get_strings' at /git/arm-soc/drivers/net/phy/smsc.c:166:3:
/git/arm-soc/include/linux/string.h:309:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter

Using strncpy instead of memcpy should do the right thing here.

Fixes: 030a89028db0 ("net: phy: smsc: Implement PHY statistics")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Use device ID defines

Use Mellanox device ID definitions in the driver's mlx5 ID table so tools
such as 'grep' and 'cscope' can be used to help find correlated material
(such as INTx Masking quirks: d76d2fe05fd PCI: Convert Mellanox broken
INTx quirks to be for listed devices only).

No functional change intended.

Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: stop using huge static buffer, save 4096k in .data

Only compile-tested - I don't have the hardware.

>From code inspection, octeon_pci_write_core_mem() appears to be safe wrt
unaligned source. In any case, u8 fbuf[] was not guaranteed to be aligned
anyway.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Felix Manlunas <felix.manlunas@cavium.com>
CC: Prasad Kanneganti <prasad.kanneganti@cavium.com>
CC: Derek Chickles <derek.chickles@cavium.com>
CC: David Miller <davem@davemloft.net>
CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Acked-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Fix offset of hca cap reserved field

The offending commit pushed fwd the field by two bits but
didn't increment the offset, fix that. Currently, no damage
was done b/c this is just a field name, but lets have it right.

Fixes: f32f5bd2eb7e ('net/mlx5: Configure cache line size for start and end padding')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: IPoIB, Support the flash device ethtool callback

This callback further invokes the mlxfw module to flash the new
firmware file to the device.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Support the flash device ethtool callback

This callback further invokes the mlxfw module to flash the new
firmware file to the device.

As the firmware flash process takes about 20 seconds and ethtool
takes the rtnl lock during the flash_device callback, we release
the rtnl lock at the beginning of the flash process and take it
again before leaving the callback.

This way, rtnl is not held during the process. To make sure the
device does not get deleted while being flashed, we take a
reference to it before releasing rtnl lock.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Add mlxfw callbacks

Add mlx5 implementation for the ones defined by the mlxfw
shared module to be used while flashing the device firmware.

The callbacks do their job through the MCQI, MCC and MCDA registers.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Add helper functions to set/query MCC/MCDA/MCQI registers

To be used by the mlx5 callbacks exposed to the mlxfw module.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Enhance MCAM reg to allow query on access reg support

Enhance MCAM to allow the driver to query which access regs are
supported. For now, expose the regs needed for FW flashing.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Add MCC (Management Component Control) register definitions

MCC (Management Component Control) allows to control a firmware
component update.

MCDA (Management Component Data Access) allows to read and write
a firmware component.

MCQI (Management Component Query Information) allows to query
information about firmware components.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

mlxfw: Make the module selectable

There are upcoming NIC (mlx5) use-cases where people want to avoid
building the mlxfw module, allow for that. The mlxsw module is
untouched and keeps selecting mlxfw.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Add header re-write offloading of IPv6 hop-limit

For environments where flow-based ipv6 router is offloaded.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Use macro for TC header re-write offload field mapping

Use a macro for the static mapping between the enumeration of field
supported by the firmware for header re-write to the corresponding
network header field. This improves the readability of the code and
doesn't change any functionality.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Offload TC matching on ip ttl

Enable offloading of TC matching on ip ttl / hop-limit

As matching on ttl is supported only by newer HW brands (ConnectX-5),
we should do capability check before attempting to offload that.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Relocate the TC match on ip tos offload code section

The code section for offloading matches on ip tos (L3) should come
before and not after the one that deals with tcp/udp (L4) matches.

Otherwise, we might come up with wrong min-inline requirement, when
one attempts to match on both L3 and L4.

Fixes: fd7da28b280d ('net/mlx5e: Offload TC matching on ip tos / traffic-class')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Introduce RX Page-Reuse

Introduce a Page-Reuse mechanism in non-Striding RQ RX datapath.

A WQE (RX descriptor) buffer is a page, that in most cases was fully
wasted on a packet that is much smaller, requiring a new page for
the next round.

In this patch, we implement a page-reuse mechanism, that resembles a
`SW Striding RQ`.
We allow the WQE to reuse its allocated page as much as it could,
until the page is fully consumed.  In each round, the WQE is capable
of receiving packet of maximal size (MTU). Yet, upon the reception of
a packet, the WQE knows the actual packet size, and consumes the exact
amount of memory needed to build a linear SKB. Then, it updates the
buffer pointer within the page accordingly, for the next round.

Feature is mutually exclusive with XDP (packet-per-page)
and LRO (session size is a power of two, needs unused page).

Performance tests:
iperf tcp tests show huge gain:

--------------------------------------------
num streams | BW before | BW after | ratio |
          1 |      22.2 |     30.9 | 1.39x |
          8 |      64.2 |     93.6 | 1.46x |
         64 |      56.7 |     91.4 | 1.61x |
--------------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Enhance RX SKB headroom logic

In the RX memory scheme of non Striding RQ, we use linear SKBs.
Keeping NET_IP_ALIGN in headroom can improve performance on some archs.
In addition, take this headroom into account when calculating the
LRO WQE size.

These are not needed in Striding RQ as they're done implicitly
within the non-linear SKB allocation.

Fixes: 1bfecfca565c ("net/mlx5e: Build RX SKB on demand")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Build SKB with exact frag_size

Build the SKB over the receive packet instead of the
whole page. Getting the SKB's linear data and shared_info
closer improves locality.
In addition, this opens up the possibility to make use of
other parts of the page in the downstream page-reuse patch.

Fixes: 1bfecfca565c ("net/mlx5e: Build RX SKB on demand")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

Merge git://git./linux/kernel/git/davem/net

Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix refcounting wrt timers which hold onto inet6 address objects,
    from Xin Long.

2) Fix an ancient bug in wireless wext ioctls, from Johannes Berg.

3) Firmware handling fixes in brcm80211 driver, from Arend Van Spriel.

4) Several mlx5 driver fixes (firmware readiness, timestamp cap
    reporting, devlink command validity checking, tc offloading, etc.)
    From Eli Cohen, Maor Dickman, Chris Mi, and Or Gerlitz.

5) Fix dst leak in IP/IP6 tunnels, from Haishuang Yan.

6) Fix dst refcount bug in decnet, from Wei Wang.

7) Netdev can be double freed in register_vlan_device(). Fix from Gao
    Feng.

8) Don't allow object to be destroyed while it is being dumped in SCTP,
    from Xin Long.

9) Fix dpaa_eth build when modular, from Madalin Bucur.

10) Fix throw route leaks, from Serhey Popovych.

11) IFLA_GROUP missing from if_nlmsg_size() and ifla_policy[] table,
    also from Serhey Popovych.

12) Fix premature TX SKB free in stmmac, from Niklas Cassel.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (36 commits)
  igmp: add a missing spin_lock_init()
  net: stmmac: free an skb first when there are no longer any descriptors using it
  sfc: remove duplicate up_write on VF filter_sem
  rtnetlink: add IFLA_GROUP to ifla_policy
  ipv6: Do not leak throw route references
  dt-bindings: net: sms911x: Add missing optional VDD regulators
  dpaa_eth: reuse the dma_ops provided by the FMan MAC device
  fsl/fman: propagate dma_ops
  net/core: remove explicit do_softirq() from busy_poll_stop()
  fib_rules: Resolve goto rules target on delete
  sctp: ensure ep is not destroyed before doing the dump
  net/hns:bugfix of ethtool -t phy self_test
  net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
  cxgb4: notify uP to route ctrlq compl to rdma rspq
  ip6_tunnel: Correct tos value in collect_md mode
  decnet: always not take dst->__refcnt when inserting dst into hash table
  ip6_tunnel: fix potential issue in __ip6_tnl_rcv
  ip_tunnel: fix potential issue in ip_tunnel_rcv
  brcmfmac: fix uninitialized warning in brcmf_usb_probe_phase2()
  net/mlx5e: Avoid doing a cleanup call if the profile doesn't have it
  ...

Merge branch 'qed-File-split-and-rename-towards-iWARP-support'

Michal Kalderon says:

====================
qed*: File split and rename towards iWARP support

This patch series makes a few more infrastructure changes towards adding
iWARP support. Hopefully this is the last infrastructure change
prior to the iWARP RFC.

Patch #1-3 take care of taking all the common iWARP/RoCE code out of
qed_roce.[ch] and placing it in qed_rdma.[ch]
Patch #4 renames the roce interface file as it is common for
RoCE and iWARP. This patch touches qedr as well.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

qed*: Rename qed_roce_if.h to qed_rdma_if.h

Rename the qed_roce_if file to qed_rdma_if as it
represents a common interface for RoCE and iWARP.

this commit affects RDMA/qedr as well.

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Split rdma content between qed_rdma and qed_roce

This patch places common iWARP / RoCE code in qed_rdma
and roce specific code in qed_roce

There is one new function ( qed_roce_setup ) added, the rest
of the patch removes content from the files and removes some
static definitions.

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Duplicate qed_roce.[ch] to qed_rdma.[ch]

This patch adds files that will contain common code for RoCE/iWARP.
The files are currently identical to qed_roce.c / qed_roce.h and
intentionally not added to the makefile. The next patch in the series
will modify the files so that roce specific code is left in qed_roce
and common roce/iwarp code will be placed in qed_rdma

This patch is the result of a simple
cp qed_rdma.c qed_roce.c
cp qed_rdma.h qed_roce.h

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Cleanup qed_roce before duplicating it

The next patch in the series will duplicate qed_roce as part
of code preprations for iWARP support. Do some cleanup before
duplicating

Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'pinctrl-v4.12-3' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull more pin control fixes from Linus Walleij:
"Some late arriving fixes. I should have sent earlier, just swamped
  with work as usual. Thomas patch makes AMD systems usable despite
  firmware bugs so it is fairly important.

   - Make the AMD driver use a regular interrupt rather than a chained
     one, so the system does not lock up.

   - Fix a function call error deep inside the STM32 driver"

* tag 'pinctrl-v4.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: stm32: Fix bad function call
  pinctrl/amd: Use regular interrupt instead of chained

bpf: expose prog id for cls_bpf and act_bpf

In order to be able to retrieve the attached programs from cls_bpf
and act_bpf, we need to expose the prog ids via netlink so that
an application can later on get an fd based on the id through the
BPF_PROG_GET_FD_BY_ID command, and dump related prog info via
BPF_OBJ_GET_INFO_BY_FD command for bpf(2).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid

Pull HID fixes from Jiri Kosina:

- revert of a commit to magicmouse driver that regressess certain
   devices, from Daniel Stone

- quirk for a specific Dell mouse, from Sebastian Parschauer

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
  Revert "HID: magicmouse: Set multi-touch keybits for Magic Mouse"
  HID: Add quirk for Dell PIXART OEM mouse

Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/livepatching

Pull livepatching fix from Jiri Kosina:
"Fix the way how livepatches are being stacked with respect to RCU,
from Petr Mladek"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: Fix stacking of patches with respect to RCU

Merge branch 'ufs-fixes' of git://git./linux/kernel/git/viro/vfs

Pull more ufs fixes from Al Viro:
"More UFS fixes, unfortunately including build regression fix for the
  64-bit s_dsize commit. Fixed in this pile:

   - trivial bug in signedness of 32bit timestamps on ufs1

   - ESTALE instead of ufs_error() when doing open-by-fhandle on
     something deleted

   - build regression on 32bit in ufs_new_fragments() - calculating that
     many percents of u64 pulls libgcc stuff on some of those. Mea
     culpa.

   - fix hysteresis loop broken by typo in 2.4.14.7 (right next to the
     location of previous bug).

   - fix the insane limits of said hysteresis loop on filesystems with
     very low percentage of reserved blocks. If it's 5% or less, just
     use the OPTSPACE policy.

   - calculate those limits once and mount time.

  This tree does pass xfstests clean (both ufs1 and ufs2) and it _does_
  survive cross-builds.

  Again, my apologies for missing that, especially since I have noticed
  a related percentage-of-64bit issue in earlier patches (when dealing
  with amount of reserved blocks). Self-LART applied..."

* 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs: fix the logics for tail relocation
  ufs_iget(): fail with -ESTALE on deleted inode
  fix signedness of timestamps on ufs1

Allow stack to grow up to address space limit

Fix expand_upwards() on architectures with an upward-growing stack (parisc,
metag and partly IA-64) to allow the stack to reliably grow exactly up to
the address space limit given by TASK_SIZE.

Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: fix new crash in unmapped_area_topdown()

Trinity gets kernel BUG at mm/mmap.c:1963! in about 3 minutes of
mmap testing.  That's the VM_BUG_ON(gap_end < gap_start) at the
end of unmapped_area_topdown().  Linus points out how MAP_FIXED
(which does not have to respect our stack guard gap intentions)
could result in gap_end below gap_start there.  Fix that, and
the similar case in its alternative, unmapped_area().

Cc: stable@vger.kernel.org
Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Debugged-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sock: avoid dirtying incoming_cpu if not needed

for connected socket, the incoming_cpu field in the sock struct
is not going to change frequently, but we are setting it
unconditionally for each packet.

Since sk_incoming_cpu and sk_flags share the same cacheline,
and the latter is access by udp_recvmsg(), this cause a cache
miss for each packet for UDP connected socket.

With this patch, we set the incoming cpu field only when the
ingress cpu really changes.

This gives a small but measurable performance improvement for
connected UDP socket.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: introduce SO_PEERGROUPS getsockopt

This adds the new getsockopt(2) option SO_PEERGROUPS on SOL_SOCKET to
retrieve the auxiliary groups of the remote peer. It is designed to
naturally extend SO_PEERCRED. That is, the underlying data is from the
same credentials. Regarding its syntax, it is based on SO_PEERSEC. That
is, if the provided buffer is too small, ERANGE is returned and @optlen
is updated. Otherwise, the information is copied, @optlen is set to the
actual size, and 0 is returned.

While SO_PEERCRED (and thus `struct ucred') already returns the primary
group, it lacks the auxiliary group vector. However, nearly all access
controls (including kernel side VFS and SYSVIPC, but also user-space
polkit, DBus, ...) consider the entire set of groups, rather than just
the primary group. But this is currently not possible with pure
SO_PEERCRED. Instead, user-space has to work around this and query the
system database for the auxiliary groups of a UID retrieved via
SO_PEERCRED.

Unfortunately, there is no race-free way to query the auxiliary groups
of the PID/UID retrieved via SO_PEERCRED. Hence, the current user-space
solution is to use getgrouplist(3p), which itself falls back to NSS and
whatever is configured in nsswitch.conf(3). This effectively checks
which groups we *would* assign to the user if it logged in *now*. On
normal systems it is as easy as reading /etc/group, but with NSS it can
resort to quering network databases (eg., LDAP), using IPC or network
communication.

Long story short: Whenever we want to use auxiliary groups for access
checks on IPC, we need further IPC to talk to the user/group databases,
rather than just relying on SO_PEERCRED and the incoming socket. This
is unfortunate, and might even result in dead-locks if the database
query uses the same IPC as the original request.

So far, those recursions / dead-locks have been avoided by using
primitive IPC for all crucial NSS modules. However, we want to avoid
re-inventing the wheel for each NSS module that might be involved in
user/group queries. Hence, we would preferably make DBus (and other IPC
that supports access-management based on groups) work without resorting
to the user/group database. This new SO_PEERGROUPS ioctl would allow us
to make dbus-daemon work without ever calling into NSS.

Cc: Michal Sekletar <msekleta@redhat.com>
Cc: Simon McVittie <simon.mcvittie@collabora.co.uk>
Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: prefetch rmem_alloc in udp_queue_rcv_skb()

On UDP packets processing, if the BH is the bottle-neck, it
always sees a cache miss while updating rmem_alloc; try to
avoid it prefetching the value as soon as we have the socket
available.

Performances under flood with multiple NIC rx queues used are
unaffected, but when a single NIC rx queue is in use, this
gives ~10% performance improvement.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Fix compilation without QED_RDMA

When CONFIG_QED_RDMA isn't defined, we'd hit the following:

/include/linux/qed/qede_rdma.h:84:19:
warning: ‘qede_rdma_dev_add’ used but never defined [enabled by default]
static inline int qede_rdma_dev_add(struct qede_dev *dev);

Fixes: bbfcd1e8e167 ("qed*: Set rdma generic functions prefix")
Signed-off-by: Chad Dupuis <chad.dupuis@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8152: correct the definition

Replace VLAN_HLEN and CRC_SIZE with ETH_FCS_LEN.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2017-06-20

This series contains updates to i40e and i40evf only.

Björn adds additional XDP support for i40e, by adding pass and drop actions
and XDP_TX action support.

Jake fixes a possible NULL pointer dereference in
i40evf_get_ethtool_stats() which could occur if the VF fails to recover
from a reset, and then a user requests statistics.  Changed the use of
dev_info() to dev_dbg() for vf_capability client routine so that the
standard log is not spammed with this information which "might" cause
administrators to worry.  Also added more code comments to help explain
why udp_port has be in host byte order and to avoid future changes which
may cause this to break.  Fixed the holding of the RTNL lock for the
entire reset routine, reduced the scope so that the reset function will
handle its own lock, so that we do not have to wrap every reference
to i40e_do_reset() with RTNL lock/unlock.

Alice updates flags related to firmware interactions for WoL and admin
queue command address with the correct value.

Sudheer makes a fix to ensure that the array is not accessed past the
size of the array.

Greg fixes the parsing of firmware 4.33 admin queue commmand "Get CEE
DCBX PER CFG" because the firmware now creates the oper_prio_tc nibbles
reversed from those in the CDD Priority Group sub-TLV.

Carolyn adds a check and message to let users know that when in MFP mode,
changing RSS hash input set is not supported.

Shannon makes the partition bandwidth control more generic since it is not
in just one form of multi-function partitioning (MFP).  Also fixes a bug
which was causing the firmware confusion in some reset sequences, when
we were disabling interrupts and we were clearing the whole register.
Instead we should only be clearing the CAUSE_ENA bit when disabling
interrupts.

Filip adds support for OEM firmware version, so that if a OEM specific
adapter is detected, ethtool reports the OEM product version in the
firmware version string instead of etrack id.

Alan fixes a bug where the driver was not correctly exiting overflow
promiscuous mode, which can happen if "too many" MAC filters are added,
putting the driver into overflow promiscuous mode, and the filters are
then removed.  The bug occurs because the conditional for toggling
promiscuous mode was only be executed when enabled and not when it was
disabled.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipmr-ip6mr-add-Netlink-notifications-on-cache-reports'

Julien Gomes says:

====================
ipmr/ip6mr: add Netlink notifications on cache reports

Currently, all ipmr/ip6mr cache reports are sent through the
mroute/mroute6 socket only.
This forces the use of a single socket for mroute programming, cache
reports and, regarding ipmr, IGMP messages without Router Alert option
reception.

The present patches are aiming to send Netlink notifications in addition
to the existing igmpmsg/mrt6msg to give user programs a way to handle
cache reports in parallel with multiple sockets other than the
mroute/mroute6 socket.

Changes in v2:
- Changed attributes naming from {IPMRA,IP6MRA}_CACHEREPORTA_* to
  {IPMRA,IP6MRA}_CREPORT_*
- Improved packet data copy to handle non-linear packets in
  ipmr/ip6mr cache report Netlink notification creation
- Added two rtnetlink groups with restricted-binding
- Changed cache report notified groups from RTNL_{IPV4,IPV6}_MROUTE to
  the new restricted groups in ipmr/ip6mr

Changes in v3:
- Put message size calculation for {igmp,mrt6}msg_netlink_event in separate
  functions
- Increased vif id attributes size from u8 to u32
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ip6mr: add netlink notifications on mrt6msg cache reports

Add Netlink notifications on cache reports in ip6mr, in addition to the
existing mrt6msg sent to mroute6_sk.
Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV6_MROUTE_R.

MSGTYPE, MIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
same data as their equivalent fields in the mrt6msg header.
PKT attribute is the packet sent to mroute6_sk, without the added
mrt6msg header.

Suggested-by: Ryan Halbrook <halbrook@arista.com>
Signed-off-by: Julien Gomes <julien@arista.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipmr: add netlink notifications on igmpmsg cache reports

Add Netlink notifications on cache reports in ipmr, in addition to the
existing igmpmsg sent to mroute_sk.
Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV4_MROUTE_R.

MSGTYPE, VIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
same data as their equivalent fields in the igmpmsg header.
PKT attribute is the packet sent to mroute_sk, without the added igmpmsg
header.

Suggested-by: Ryan Halbrook <halbrook@arista.com>
Signed-off-by: Julien Gomes <julien@arista.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>