git.openwrt.org Git - openwrt/staging/blogic.git/log

dpaa2-eth: Fix ndo_stop routine

In the current implementation, on interface down we disabled NAPI and
then manually drained any remaining ingress frames. This could lead
to a situation when, under heavy traffic, the data availability
notification for some of the channels would not get rearmed correctly.

Change the implementation such that we let all remaining ingress frames
be processed as usual and only disable NAPI once the hardware queues
are empty.

We also add a wait on the Tx side, to allow hardware time to process
all in-flight Tx frames before issueing the disable command.

Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

wan: dscc4: fix various indentation issues

There are some lines that have indentation issues, fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'vxlan-FDB-veto'

Petr Machata says:

====================
vxlan: Allow vetoing FDB operations

mlxsw does not implement handling of the more advanced types of VXLAN
FDB entries. In order to provide visibility to users, it is important to
be able to reject such FDB entries, ideally with an explanation passed
in extended ack. This patch set implements this.

In patches #1-#4, vxlan is gradually transformed to support vetoing of
FDB entries added (or modified) through vxlan_fdb_update(), and the
default FDB entry added in __vxlan_dev_create().

Patches #5-#7 deal with vxlan_changelink(). The existing code recognizes
that vxlan_fdb_update() may fail, but doesn't attempt to keep things
intact if it does. These patches change the function in several steps to
gracefully handle vetoes (or other failures).

Then in patches #8-#11, extack arguments are added, respectively, to
ndo_fdb_add(), mlxsw's mlxsw_sp_nve_ops.fdb_replay, the functions that
connect to the VXLAN vetoing code, and call_switchdev_notifiers(). Note
that call_switchdev_blocking_notifiers() already does support extack.

Finally in patch #12, mlxsw is extended to add extack messages to
rejected FDB entries. In patch #13, the functionality is tested.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Test veto of unsupported VXLAN FDBs

mlxsw doesn't implement offloading of all types of FDB entries that the
VXLAN driver supports. Test that such FDB entries are rejected. That
makes sure that the decision made by the existing validation code in
mlxsw propagates up the stack. It also exercises rollback functionality
in VXLAN, and tests that extack is returned.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Add extack messages to VXLAN FDB rejection

Annotate the rejections in mlxsw_sp_switchdev_vxlan_work_prepare() with
textual reasons.

Because this code ends up being invoked for FDB replay as well, drop the
default message from there, so that the more accurate error message is
not overwritten.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

switchdev: Add extack argument to call_switchdev_notifiers()

A follow-up patch will enable vetoing of FDB entries. Make it possible
to communicate details of why an FDB entry is not acceptable back to the
user.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: Add extack to switchdev operations

There are four sources of VXLAN switchdev notifier calls:

- the changelink() link operation, which already supports extack,
- ndo_fdb_add() which got extack support in a previous patch,
- FDB updates due to packet forwarding,
- and vxlan_fdb_replay().

Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
switchdev message that it sends, and propagate the argument upwards to
the callers. For the first two cases, pass in the extack gotten through
the operation. For case #3, pass in NULL.

To cover the last case, extend vxlan_fdb_replay() to take extack
argument, which might come from whatever operation necessitated the FDB
replay.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Add extack to mlxsw_sp_nve_ops.fdb_replay

A follow-up patch will extend vxlan_fdb_replay() with an extack
argument. Extend the fdb_replay callback in mlxsw likewise so that the
argument is ready for the vxlan conversion.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add extack argument to ndo_fdb_add()

Drivers may not be able to support certain FDB entries, and an error
code is insufficient to give clear hints as to the reasons of rejection.

In order to make it possible to communicate the rejection reason, extend
ndo_fdb_add() with an extack argument. Adapt the existing
implementations of ndo_fdb_add() to take the parameter (and ignore it).
Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: changelink: Delete remote after update

If a change in remote address prompts a change in a default FDB entry,
that change might be vetoed. If that happens, it would then be necessary
to reinstate the already-removed default FDB entry corresponding to the
previous remote address.

Instead, arrange to have the previous address removed only after the
FDB is successfully vetted.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: changelink: Postpone vxlan_config_apply()

When an FDB entry is vetoed, it is necessary to unroll the changes that
have already been done. To avoid having to unroll vxlan_config_apply(),
postpone the call after the point where the vetoing takes place. Since
the call can't fail, it doesn't necessitate any cleanups in the
preceding FDB update logic.

Correspondingly, move down the mod_timer() call as well.

References to *dst need to be replaced with references to conf.
Additionally, old_dst and old_age_interval are not necessary anymore,
and therefore drop them.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: changelink: Inline vxlan_dev_configure()

The changelink operation may cause change in remote address, and
therefore an FDB update, which can be vetoed. To properly handle
vetoing, vxlan_changelink() needs to be gradually updated.

In this patch simply replace vxlan_dev_configure() with the two
constituent calls.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: Allow vetoing of FDB notifications

Change vxlan_fdb_switchdev_call_notifiers() to return the result from
calling switchdev notifiers. Propagate the error number up the stack.

In vxlan_fdb_update_existing() and vxlan_fdb_update_create() add
rollbacks to clean up the work that was done before the veto.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: Have vxlan_fdb_replace() save original rdst value

To enable rollbacks after vetoed FDB updates, extend vxlan_fdb_replace()
to take an additional argument where it should store the original values
of a modified rdst. Update the sole caller.

The following patch will make use of the saved value.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: Split vxlan_fdb_update() in two

In order to make it easier to implement rollbacks after FDB update
vetoing, separate the FDB update code to two parts: one that deals with
updates of existing FDB entries, and one that creates new entries.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: Move up vxlan_fdb_free(), vxlan_fdb_destroy()

These functions will be needed for rollbacks of vetoed FDB entries. Move
them up so that they are visible at their intended point of use.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'improving-TCP-behavior-on-host-congestion'

Yuchung Cheng says:

====================
improving TCP behavior on host congestion

This patch set aims to improve how TCP handle local qdisc congestion
by simplifying the previous implementation. Previously when an
skb fails to (re)transmit due to local qdisc congestion or other
resource issue, TCP refrains from setting the skb timestamp or the
recovery starting time.

This design makes determining when to abort a stalling socket more
complicated, as the timestamps of these tranmission attempts were
missing. The stack needs to sort of infer when the original attempt
happens. A by-product is a socket may disregard the system timeout
limit (i.e. sysctl net.ipv4.tcp_retries2 or USER_TIMEOUT option),
and continue to retry until the transmission is successful.

In data-center environment when TCP RTO is small, this could cause
the socket to retry frequently for long during qdisc congestion.

The solution is to first unconditionally timestamp skb and recovery
attempt. Then retry more conservatively (twice a second) on local
qdisc congestion but abort the sockets according to the system limit.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: less aggressive window probing on local congestion

Previously when the sender fails to send (original) data packet or
window probes due to congestion in the local host (e.g. throttling
in qdisc), it'll retry within an RTO or two up to 500ms.

In low-RTT networks such as data-centers, RTO is often far below
the default minimum 200ms. Then local host congestion could trigger
a retry storm pouring gas to the fire. Worse yet, the probe counter
(icsk_probes_out) is not properly updated so the aggressive retry
may exceed the system limit (15 rounds) until the packet finally
slips through.

On such rare events, it's wise to retry more conservatively
(500ms) and update the stats properly to reflect these incidents
and follow the system limit. Note that this is consistent with
the behaviors when a keep-alive probe or RTO retry is dropped
due to local congestion.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: retry more conservatively on local congestion

Previously when the sender fails to retransmit a data packet on
timeout due to congestion in the local host (e.g. throttling in
qdisc), it'll retry within an RTO up to 500ms.

In low-RTT networks such as data-centers, RTO is often far
below the default minimum 200ms (and the cap 500ms). Then local
host congestion could trigger a retry storm pouring gas to the
fire. Worse yet, the retry counter (icsk_retransmits) is not
properly updated so the aggressive retry may exceed the system
limit (15 rounds) until the packet finally slips through.

On such rare events, it's wise to retry more conservatively (500ms)
and update the stats properly to reflect these incidents and follow
the system limit. Note that this is consistent with the behavior
when a keep-alive probe is dropped due to local congestion.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: simplify window probe aborting on USER_TIMEOUT

Previously we use the next unsent skb's timestamp to determine
when to abort a socket stalling on window probes. This no longer
works as skb timestamp reflects the last instead of the first
transmission.

Instead we can estimate how long the socket has been stalling
with the probe count and the exponential backoff behavior.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: create a helper to model exponential backoff

Create a helper to model TCP exponential backoff for the next patch.
This is pure refactor w no behavior change.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: properly track retry time on passive Fast Open

This patch addresses a corner issue on timeout behavior of a
passive Fast Open socket. A passive Fast Open server may write
and close the socket when it is re-trying SYN-ACK to complete
the handshake. After the handshake is completely, the server does
not properly stamp the recovery start time (tp->retrans_stamp is
0), and the socket may abort immediately on the very first FIN
timeout, instead of retying until it passes the system or user
specified limit.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: always set retrans_stamp on recovery

Previously TCP socket's retrans_stamp is not set if the
retransmission has failed to send. As a result if a socket is
experiencing local issues to retransmit packets, determining when
to abort a socket is complicated w/o knowning the starting time of
the recovery since retrans_stamp may remain zero.

This complication causes sub-optimal behavior that TCP may use the
latest, instead of the first, retransmission time to compute the
elapsed time of a stalling connection due to local issues. Then TCP
may disrecard TCP retries settings and keep retrying until it finally
succeed: not a good idea when the local host is already strained.

The simple fix is to always timestamp the start of a recovery.
It's worth noting that retrans_stamp is also used to compare echo
timestamp values to detect spurious recovery. This patch does
not break that because retrans_stamp is still later than when the
original packet was sent.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: always timestamp on every skb transmission

Previously TCP skbs are not always timestamped if the transmission
failed due to memory or other local issues. This makes deciding
when to abort a socket tricky and complicated because the first
unacknowledged skb's timestamp may be 0 on TCP timeout.

The straight-forward fix is to always timestamp skb on every
transmission attempt. Also every skb retransmission needs to be
flagged properly to avoid RTT under-estimation. This can happen
upon receiving an ACK for the original packet and the a previous
(spurious) retransmission has failed.

It's worth noting that this reverts to the old time-stamping
style before commit 8c72c65b426b ("tcp: update skb->skb_mstamp more
carefully") which addresses a problem in computing the elapsed time
of a stalled window-probing socket. The problem will be addressed
differently in the next patches with a simpler approach.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: exit if nothing to retransmit on RTO timeout

Previously TCP only warns if its RTO timer fires and the
retransmission queue is empty, but it'll cause null pointer
reference later on. It's better to avoid such catastrophic failure
and simply exit with a warning.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: micrel: use phy_read_mmd and phy_write_mmd

This driver implements open-coded versions of phy_read_mmd() and
phy_write_mmd() for KSZ9031. That's not needed, let's use the
phylib functions directly.

This is compile-tested only because I have no such hardware.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

davicom: Annotate implicit fall through in dm9000_set_io

There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warning (W=1).

This commit removes the following warning:

  include/linux/device.h:1480:5: warning: this statement may fall through [-Wimplicit-fallthrough=]
  drivers/net/ethernet/davicom/dm9000.c:397:3: note: in expansion of macro 'dev_dbg'
  drivers/net/ethernet/davicom/dm9000.c:398:2: note: here

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipv6/udp_tunnel: prefer SO_BINDTOIFINDEX over SO_BINDTODEVICE

The udp-tunnel setup allows binding sockets to a network device. Prefer
the new SO_BINDTOIFINDEX to avoid temporarily resolving the device-name
just to look it up in the ioctl again.

Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipv4/udp_tunnel: prefer SO_BINDTOIFINDEX over SO_BINDTODEVICE

The udp-tunnel setup allows binding sockets to a network device. Prefer
the new SO_BINDTOIFINDEX to avoid temporarily resolving the device-name
just to look it up in the ioctl again.

Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: introduce SO_BINDTOIFINDEX sockopt

This introduces a new generic SOL_SOCKET-level socket option called
SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a
network interface index as argument, rather than the network interface
name.

User-space often refers to network-interfaces via their index, but has
to temporarily resolve it to a name for a call into SO_BINDTODEVICE.
This might pose problems when the network-device is renamed
asynchronously by other parts of the system. When this happens, the
SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong
device.

In most cases user-space only ever operates on devices which they
either manage themselves, or otherwise have a guarantee that the device
name will not change (e.g., devices that are UP cannot be renamed).
However, particularly in libraries this guarantee is non-obvious and it
would be nice if that race-condition would simply not exist. It would
make it easier for those libraries to operate even in situations where
the device-name might change under the hood.

A real use-case that we recently hit is trying to start the network
stack early in the initrd but make it survive into the real system.
Existing distributions rename network-interfaces during the transition
from initrd into the real system. This, obviously, cannot affect
devices that are up and running (unless you also consider moving them
between network-namespaces). However, the network manager now has to
make sure its management engine for dormant devices will not run in
parallel to these renames. Particularly, when you offload operations
like DHCP into separate processes, these might setup their sockets
early, and thus have to resolve the device-name possibly running into
this race-condition.

By avoiding a call to resolve the device-name, we no longer depend on
the name and can run network setup of dormant devices in parallel to
the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this
race.

Reviewed-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: Fix recvmsg() to be able to peek across multiple records

This fixes recvmsg() to be able to peek across multiple tls records.
Without this patch, the tls's selftests test case
'recv_peek_large_buf_mult_recs' fails. Each tls receive context now
maintains a 'rx_list' to retain incoming skb carrying tls records. If a
tls record needs to be retained e.g. for peek case or for the case when
the buffer passed to recvmsg() has a length smaller than decrypted
record length, then it is added to 'rx_list'. Additionally, records are
added in 'rx_list' if the crypto operation runs in async mode. The
records are dequeued from 'rx_list' after the decrypted data is consumed
by copying into the buffer passed to recvmsg(). In case, the MSG_PEEK
flag is used in recvmsg(), then records are not consumed or removed
from the 'rx_list'.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-lantiq_gswip-probe-fixes-and-remove-cleanup'

Johan Hovold says:

====================
net: dsa: lantiq_gswip: probe fixes and remove cleanup

This series fix a few issues found through inspection when fixing up new
bad uses of of_find_compatible_node() that have crept in since 4.19.

Note that these have only been compile tested.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: lantiq_gswip: drop bogus drvdata check

The platform-device driver data is set on successful probe and will
never be NULL on remove (or we have much bigger problems).

Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: lantiq_gswip: fix OF child-node lookups

Use the new of_get_compatible_child() helper to look up child nodes to
avoid ever matching non-child nodes elsewhere in the tree.

Also fix up the related struct device_node leaks.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Cc: stable <stable@vger.kernel.org> # 4.20
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: lantiq_gswip: fix use-after-free on failed probe

Make sure to disable and deregister the switch on late probe errors to
avoid use-after-free when the device-resource-managed switch is freed.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Cc: stable <stable@vger.kernel.org> # 4.20
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: extend MTD support for newer hardware

The X2 family of NICs (based on the SFC9250) have additional
MTD partitions for firmware and configuration. This includes
partitions that are read-only.

The NICs also have extended versions of the NVRAM interface,
allowing more detailed status information to be returned.

Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests/tls: Fix recv partial/large_buff test cases

TLS test cases recv_partial & recv_peek_large_buf_mult_recs expect to
receive a certain amount of data and then compare it against known
strings using memcmp. To prevent recvmsg() from returning lesser than
expected number of bytes (compared in memcmp), MSG_WAITALL needs to be
passed in recvmsg().

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: check return code when requesting PHY driver module

When requesting the PHY driver module fails we'll bind the genphy
driver later. This isn't obvious to the user and may cause, depending
on the PHY, different types of issues. Therefore check the return code
of request_module(). Note that we only check for failures in loading
the module, not whether a module exists for the respective PHY ID.

v2:
- add comment explaining what is checked and what is not
- return error from phy_device_create() if loading module fails

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/tls: Make function tls_sw_do_sendpage static

Fixes the following sparse warning:

net/tls/tls_sw.c:1023:5: warning:
symbol 'tls_sw_do_sendpage' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/tls: remove unused function tls_sw_sendpage_locked

There are no in-tree callers.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Optimize sk_msg_clone() by data merge to end dst sg entry

Function sk_msg_clone has been modified to merge the data from source sg
entry to destination sg entry if the cloned data resides in same page
and is contiguous to the end entry of destination sk_msg. This improves
kernel tls throughput to the tune of 10%.

When the user space tls application calls sendmsg() with MSG_MORE, it leads
to calling sk_msg_clone() with new data being cloned placed continuous to
previously cloned data. Without this optimization, a new SG entry in
the destination sk_msg i.e. rec->msg_plaintext in tls_clone_plaintext_msg()
gets used. This leads to exhaustion of sg entries in rec->msg_plaintext
even before a full 16K of allowable record data is accumulated. Hence we
lose oppurtunity to encrypt and send a full 16K record.

With this patch, the kernel tls can accumulate full 16K of record data
irrespective of the size of data passed in sendmsg() with MSG_MORE.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns: Use struct_size() in devm_kzalloc()

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
int stuff;
struct boo entry[];
};

instance = devm_kzalloc(dev, sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = devm_kzalloc(dev, struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Add helpers to determine if PHY driver is generic

We are already checking in phy_detach() that the PHY driver is of
generic kind (1G or 10G) and we are going to make use of that in the SFP
layer as well for 1000BaseT SFP modules, so expose helper functions to
return that information.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-Split-platform-data-to-header-file'

Florian Fainelli says:

====================
net: dsa: Split platform data to header file

This patch series decouples the DSA platform data structures from
net/dsa.h which was getting used for all sorts of DSA related
structures.

It would probably make sense for this series to go via David's net-next
tree to avoid conflicts on the ARM part, since we cannot obviously
include a header that does not yet exist.

No functional changes intended.
====================

Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Include platform_data header file

b53 and mv88e6xxx support passing platform_data, and now that we have
split the platform_data portion from the main net/dsa.h header file,
include only the relevant parts.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ARM: orion5x: Include platform_data/dsa.h

Now that we have split the DSA platform data structures from the main
net/dsa.h header file, include only the relevant header file.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Split platform data to header file

Instead of having net/dsa.h contain both the internal switch tree/driver
structures, split the relevant platform_data parts into
include/linux/platform_data/dsa.h and make that header be included by
net/dsa.h in order not to break any setup. A subsequent set of patches
will update code including net/dsa.h to include only the platform_data
header.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-next/hinic: replace disable_irq_nosync/enable_irq

In order to avoid frequent system interrupts when sending and
receiving packets. we replace disable_irq_nosync/enable_irq
with hinic_set_msix_state(), hinic_set_msix_state is used to
access memory mapped hinic devices.

Signed-off-by: Xue Chaojing <xuechaojing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Add ndo_get_phys_port_name() for CPU port

There is not currently way to infer the port number through sysfs that
is being used as the CPU port number. Overlay a ndo_get_phys_port_name()
operation onto the DSA master network device in order to retrieve that
information.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: networking: dsa: Update documentation

Since 83c0afaec7b7 ("net: dsa: Add new binding implementation"), DSA is
no longer a platform device exclusively and can support registering DSA
switches from other bus drivers (PCI, USB, I2C, etc.).

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4/l2t: Use struct_size() in kvzalloc()

One of the more common cases of allocation size calculations is finding the
size of a structure that has a zero-sized array at the end, along with memory
for some number of elements for that array. For example:

struct foo {
int stuff;
struct boo entry[];
};

instance = kvzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:

instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: meter: Use struct_size() in kzalloc()

One of the more common cases of allocation size calculations is finding the
size of a structure that has a zero-sized array at the end, along with memory
for some number of elements for that array. For example:

struct foo {
int stuff;
struct boo entry[];
};

instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: don't include asm/irq.h directly

There's no need to and one shouldn't include asm/irq.h directly.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: improve logging in phylib

Some time ago phydev_info() and friends have been added. They allow to
improve and simplify logging.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: remove preliminary workaround for not loading PHY driver

This workaround attempt helped for some but not all affected users.
With commit 11287b693d03 ("r8169: load Realtek PHY driver module
before r8169") we have a better workaround now, so we an remove
the first attempt.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'nfp-flower-improve-flower-resilience'

Jakub Kicinski says:

====================
nfp: flower: improve flower resilience

This series contains mostly changes which improve nfp flower
offload's resilience, but are too large or risky to push into net.

Fred makes the driver waits for flower FW responses uninterruptible,
and a little longer (~40ms).

Pieter adds support for cards with multiple rule memories.

John reworks the MAC offloads. He says:
> When potential tunnel end-point MACs are offloaded, they are assigned an
> index. This index may be associated with a port number meaning that if a
> packet matches an offloaded MAC address on the card, then the ingress
> port for that MAC can also be verified. In the case of shared MACs (e.g.
> on a linux bond) there may be situations where this index maps to only
> one of the ports that share the MAC.
>
> The idea of 'global' MAC indexes are supported that bypass the check on
> ingress port on the NFP. The patchset tracks shared MACs and assigns
> global indexes to these. It also ensures that port based indexes are
> re-applied if a single port becomes the only user of an offloaded MAC.
>
> Other patches in the set aim to tidy code without changing functionality.
> There is also a delete offload message introduced to ensure that MACs no
> longer in use in kernel space are removed from the firmware lookup tables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: enable MAC address sharing for offloadable devs

A MAC address is not necessarily a unique identifier for a netdev. Drivers
such as Linux bonds, for example, can apply the same MAC address to the
upper layer device and all lower layer devices.

NFP MAC offload for tunnel decap includes port verification for reprs but
also supports the offload of non-repr MAC addresses by assigning 'global'
indexes to these. This means that the FW will not verify the incoming port
of a packet matching this destination MAC.

Modify the MAC offload logic to assign global indexes based on MAC address
instead of net device (as it currently does). Use this to allow multiple
devices to share the same MAC. In other words, if a repr shares its MAC
address with another device then give the offloaded MAC a global index
rather than associate it with an ingress port. Track this so that changes
can be reverted as MACs stop being shared.

Implement this by removing the current list based assignment of global
indexes and replacing it with an rhashtable that maps an offloaded MAC
address to the number of devices sharing it, distributing global indexes
based on this.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: ensure MAC cleanup on address change

It is possible to receive a MAC address change notification without the
net device being down (e.g. when an OvS bridge is assigned the same MAC as
a port added to it). This means that an offloaded MAC address may not be
removed if its device gets a new address.

Maintain a record of the offloaded MAC addresses for each repr and netdev
assigned a MAC offload index. Use this to delete the (now expired) MAC if
a change of address event occurs. Only handle change address events if the
device is already up - if not then the netdev up event will handle it.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: add infastructure for non-repr priv data

NFP repr netdevs contain private data that can store per port information.
In certain cases, the NFP driver offloads information from non-repr ports
(e.g. tunnel ports). As the driver does not have control over non-repr
netdevs, it cannot add/track private data directly to the netdev struct.

Add infastructure to store private information on any non-repr netdev that
is offloaded at a given time. This is used in a following patch to track
offloaded MAC addresses for non-reprs and enable correct house keeping on
address changes.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: ensure deletion of old offloaded MACs

When a potential tunnel end point goes down then its MAC address should
not be matchable on the NFP.

Implement a delete message for offloaded MACs and call this on net device
down. While at it, remove the actions on register and unregister netdev
events. A MAC should only be offloaded if the device is up. Note that the
netdev notifier will replay any notifications for UP devices on
registration so NFP can still offload ports that exist before the driver
is loaded. Similarly, devices need to go down before they can be
unregistered so removal of offloaded MACs is only required on down events.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: remove list infastructure from MAC offload

Potential MAC destination addresses for tunnel end-points are offloaded to
firmware. This was done by building a list of such MACs and writing to
firmware as blocks of addresses.

Simplify this code by removing the list format and sending a new message
for each offloaded MAC.

This is in preparation for delete MAC messages. There will be one delete
flag per message so we cannot assume that this applies to all addresses
in a list.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: ignore offload of VF and PF repr MAC addresses

Currently MAC addresses of all repr netdevs, along with selected non-NFP
controlled netdevs, are offloaded to FW as potential tunnel end-points.
However, the addresses of VF and PF reprs are meaningless outside of
internal communication and it is only those of physical port reprs
required.

Modify the MAC address offload selection code to ignore VF/PF repr devs.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: tidy tunnel related private data

Recent additions to the flower app private data have grouped the variables
of a given feature into a struct and added that struct to the main private
data struct.

In keeping with this, move all tunnel related private data to their own
struct. This has no affect on functionality but improves readability and
maintenance of the code.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: support multiple memory units for filter offloads

Adds support for multiple memory units which are used for filter
offloads. Each filter is assigned a stats id, the MSBs of the id are
used to determine which memory unit the filter should be offloaded
to. The number of available memory units that could be used for filter
offload is obtained from HW. A simple round robin technique is used to
allocate and distribute the ids across memory units.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: increase cmesg reply timeout

QA tests report occasional timeouts on REIFY message replies. Profiling
of the two cmesg reply types under burst conditions, with a 12-core host
under heavy cpu and io load (stress --cpu 12 --io 12), show both PHY MTU
change and REIFY replies can exceed the 10ms timeout. The maximum MTU
reply wait under burst is 16ms, while the maximum REIFY wait under 40 VF
burst is 12ms. Using a 4 VF REIFY burst results in an 8ms maximum wait.
A larger VF burst does increase the delay, but not in a linear enough
way to justify a scaled REIFY delay. The worse case values between
MTU and REIFY appears close enough to justify a common timeout. Pick a
conservative 40ms to make a safer future proof common reply timeout. The
delay only effects the failure case.

Change the REIFY timeout mechanism to use wait_event_timeout() instead
of wait_event_interruptible_timeout(), to match the MTU code. In the
current implementation, theoretically, a signal could interrupt the
REIFY waiting period, with a return code of ERESTARTSYS. However, this is
caught under the general timeout error code EIO. I cannot see the benefit
of exposing the REIFY waiting period to signals with such a short delay
(40ms), while the MTU mechnism does not use the same logic. In the absence
of any reply (wakeup() call), both reply types will wake up the task after
the timeout period. The REIFY timeout applies to the entire representor
group being instantiated (e.g. VFs), while the MTU timeout apples to a
single PHY MTU change.

Signed-off-by: Fred Lotter <frederik.lotter@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sungem: fix indentation, remove a tab

The declaration of variable 'found' is one level too deep, fix this by
removing a tab.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: atp: fix various indentation issues

There are various lines that have indentation issues, fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: fix various indentation issues

There are lines that have indentation issues, fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

networking: Documentation: fix snmp_counters.rst Sphinx warnings

Fix over 100 documentation warnings in snmp_counter.rst by
extending the underline string lengths and inserting a blank line
after bullet items.

Examples:

Documentation/networking/snmp_counter.rst:1: WARNING: Title overline too short.
Documentation/networking/snmp_counter.rst:14: WARNING: Bullet list ends without a blank line; unexpected unindent.

Fixes: 2b96547223e3 ("add document for TCP OFO, PAWS and skip ACK counters")
Fixes: 8e2ea53a83df ("add snmp counters document")
Fixes: 712ee16c230f ("add documents for snmp counters")
Fixes: 80cc49507ba4 ("net: Add part of TCP counts explanations in snmp_counters.rst")
Fixes: b08794a922c4 ("documentation of some IP/ICMP snmp counters")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: yupeng <yupeng0921@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net, decnet: use struct_size() in kzalloc()

One of the more common cases of allocation size calculations is finding the
size of a structure that has a zero-sized array at the end, along with memory
for some number of elements for that array. For example:

struct foo {
int stuff;
struct boo entry[];
};

instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_nve: Use struct_size() in kzalloc()

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
int stuff;
struct boo entry[];
};

instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_acl_bloom_filter: use struct_size() in kzalloc()

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
int stuff;
void *entry[];
};

instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);

This issue was detected with the help of Coccinelle.

Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: dsa: ksz9477: fix indentation for switch spi bindings

Switch bindings for spi managed mode are using spaces instead of tabs.
Fix them to get a file with a proper kernel indentation style.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Sergio Paracuellos <sergio.paracuellos@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Fix ERROR:do not initialise statics to 0 in af_vsock.c

Found by scripts/checkpatch.pl
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2019-01-15

This series contains updates to the ice driver only.

Bruce fixes an unused variable build warning, which was introduced with
the commit 2fd527b72bb6 ("net: ndo_bridge_setlink: Add extack").  Added
ethtool support for get_eeprom and get_eeprom_len operations.  Added
support for bringing down the PHY link optional when the interface is
administratively downed.

Anirudh refactors the transmit scheduler functions, which results in
reduced code duplication and adds a helper function, which all the
scheduler functions call instead.  Added an LED blinking handler to
ethtool.  Reworked the queue management code to allow for reuse in
future XDP feature support.  Updates the driver to be able to preserve
the aggregator list after reset by moving it out of port_info and into
ice_hw.  Added the ability to offload SCTP checksum calculation to the
hardware.  Added support for new PHY types, which support higher link
speeds.

Md Fahad makes sure that RSS lookup table and hash key get configured
during the rebuild path after a reset.

Brett updates the driver to set the physical link state according to the
netdev state (up/down).  Added support for adaptive/dynamic interrupt
moderation in the ice driver, along with the ethtool operations needed.

Tony adds software timestamping support by using
ethtool_op_get_ts_info().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ice: add const qualifier to mac_addr parameter

The function ice_aq_manage_mac_write takes a pointer to a MAC address.
The parameter is not marked const, even though the function doesn't need
to modify it. This prevents passing a parameter that is already marked
const. Update the function prototype to take a const pointer, to allow
passing constant pointers to this function.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Add support for new PHY types

This patch adds code for the detection and operation of several
additional PHY types that support higher link speeds.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Offload SCTP checksum

This patch adds the ability to offload SCTP checksum calculations to the
NIC.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Allow for software timestamping

Use ethtool_op_get_ts_info to provide software timestamping.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Implement getting and setting ethtool coalesce

This patch includes the following ethtool operations:

1. get_coalesce
2. set_coalesce
3. get_per_q_coalesce
4. set_per_q_coalesce

Each ITR value (current_itr/target_itr) are stored on a per
ice_ring_container basis. This is because each valid ice_ring_container
can have 1 or more rings that are tied to the same q_vector ITR index.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Add support for adaptive interrupt moderation

Currently the driver does not support adaptive/dynamic interrupt
moderation. This patch adds support for this. Also, adaptive/dynamic
interrupt moderation is turned on by default upon driver load.

In order to support adaptive interrupt moderation, two functions were
added, ice_update_itr() and ice_itr_divisor(). These are used to
determine the current packet load and to determine a divisor based
on link speed respectively.

This patch also adds the ICE_ITR_GRAN_S define that is used in the
hot-path when setting a new ITR value. The shift is used to pet two
birds with one hand, set the ITR value while re-enabling the
interrupt. Also, the ICE_ITR_GRAN_S is defined as 1 because the device
has a ITR granularity of 2usecs.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Move aggregator list into ice_hw instance

The aggregator list needs to be preserved for use after a reset. This
patch moves it out of the port_info instance and into the ice_hw instance.

Signed-off-by: Tarun Singh <tarun.k.singh@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Rework queue management code for reuse

This patch reworks the queue management code to allow for reuse with the
XDP feature (to be added in a future patch).

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Add ethtool private flag to make forcing link down optional

Add new infrastructure for implementing ethtool private flags using the
existing pf->flags bitmap to store them, and add the link-down-on-close
ethtool private flag to optionally bring down the PHY link when the
interface is administratively downed.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Set physical link up/down when an interface is set up/down

When a netdev is set up/down we need to set the phsyical link state
accordingly. This patch adds that functionality by calling
ice_force_phys_link_state(vsi, link_up) in both the ice_stop() and
ice_open() paths.

In order to force link, ice_force_phys_link_state(vsi, link_up) will
first determine the current phy capabilities. If link has not changed
there is nothing to do. If link has changed, previous PHY capabilities
are saved and the "Enable Automatic Link Update" and "Link Establishment
State Machine (LESM)" enable bits are set. Then the new PHY config is
saved. The "Enable Automatic Link Update" will force the FW to execute
Setup link and restart auto-negotiation. This *should* then result in a
"Link Status Event (LSE)" which will cause the driver to get the current
link status.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Implement support for normal get_eeprom[_len] ethtool ops

Add support for get_eeprom and get_eeprom_len ethtool ops

Specification states that PF software accesses NVM (shadow-ram) via AQ
commands (e.g. NVM Read, NVM Write) in the range 0x000000-0x00FFFF (64KB),
so the get_eeprom_len op should return 64KB. If additional regions of the
16MB NVM must be read, another access method must be used.

The ethtool kernel code, by default, will ask for multiple page-size hunks
of the NVM not to exceed the value returned by ice_get_eeprom_len().
ice_read_sr_buf() deals with arch page sizes different than 4KB.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Add ethtool set_phys_id handler

Add led blinking handler to ethtool. Since led blinking is
controlled by FW/HW only ETHTOOL_ID_ACTIVE and ETHTOOL_ID_INACTIVE
are really needed.

Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Configure RSS LUT and HASH KEY in rebuild path

This patch configures the RSS lookup table and hash key post reset.

Signed-off-by: Md Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ice: Refactor a few Tx scheduler functions

The following functions were refactored to call a new common function,
ice_aqc_send_sched_elem_cmd():

- ice_aq_add_sched_elems()
- ice_aq_delete_sched_elems()
- ice_aq_move_sched_elems()
- ice_aq_query_sched_elems()
- ice_aq_cfg_sched_elems()
- ice_aq_suspend_sched_elems()
- ice_aq_resume_sched_elems()

Signed-off-by: Greg Priest <greg.priest@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge tag 'trace-v5.0-rc1' of git://git./linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
"Andrea Righi fixed a NULL pointer dereference in trace_kprobe_create()

  It is possible to trigger a NULL pointer dereference by writing an
  incorrectly formatted string to the krpobe_events file"

* tag 'trace-v5.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create()

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix regression in multi-SKB responses to RTM_GETADDR, from Arthur
    Gautier.

2) Fix ipv6 frag parsing in openvswitch, from Yi-Hung Wei.

3) Unbounded recursion in ipv4 and ipv6 GUE tunnels, from Stefano
    Brivio.

4) Use after free in hns driver, from Yonglong Liu.

5) icmp6_send() needs to handle the case of NULL skb, from Eric
    Dumazet.

6) Missing rcu read lock in __inet6_bind() when operating on mapped
    addresses, from David Ahern.

7) Memory leak in tipc-nl_compat_publ_dump(), from Gustavo A. R. Silva.

8) Fix PHY vs r8169 module loading ordering issues, from Heiner
    Kallweit.

9) Fix bridge vlan memory leak, from Ido Schimmel.

10) Dev refcount leak in AF_PACKET, from Jason Gunthorpe.

11) Infoleak in ipv6_local_error(), flow label isn't completely
    initialized. From Eric Dumazet.

12) Handle mv88e6390 errata, from Andrew Lunn.

13) Making vhost/vsock CID hashing consistent, from Zha Bin.

14) Fix lack of UMH cleanup when it unexpectedly exits, from Taehee Yoo.

15) Bridge forwarding must clear skb->tstamp, from Paolo Abeni.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
  bnxt_en: Fix context memory allocation.
  bnxt_en: Fix ring checking logic on 57500 chips.
  mISDN: hfcsusb: Use struct_size() in kzalloc()
  net: clear skb->tstamp in bridge forwarding path
  net: bpfilter: disallow to remove bpfilter module while being used
  net: bpfilter: restart bpfilter_umh when error occurred
  net: bpfilter: use cleanup callback to release umh_info
  umh: add exit routine for UMH process
  isdn: i4l: isdn_tty: Fix some concurrency double-free bugs
  vhost/vsock: fix vhost vsock cid hashing inconsistent
  net: stmmac: Prevent RX starvation in stmmac_napi_poll()
  net: stmmac: Fix the logic of checking if RX Watchdog must be enabled
  net: stmmac: Check if CBS is supported before configuring
  net: stmmac: dwxgmac2: Only clear interrupts that are active
  net: stmmac: Fix PCI module removal leak
  tools/bpf: fix bpftool map dump with bitfields
  tools/bpf: test btf bitfield with >=256 struct member offset
  bpf: fix bpffs bitfield pretty print
  net: ethernet: mediatek: fix warning in phy_start_aneg
  tcp: change txhash on SYN-data timeout
  ...

ice: Fix unused variable build warning

Commit 2fd527b72bb6 ("net: ndo_bridge_setlink: Add extack") added a new
parameter "extack" to ice_bridge_setlink but this parameter isn't used
by the function. This results in a warning: unused parameter ‘extack’
[-Wunused-parameter]. Fix that by adding an "__always_unused" qualifier.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create()

It is possible to trigger a NULL pointer dereference by writing an
incorrectly formatted string to krpobe_events (trying to create a
kretprobe omitting the symbol).

Example:

echo "r:event_1 " >> /sys/kernel/debug/tracing/kprobe_events

That triggers this:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
#PF error: [normal kernel read fault]
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 6 PID: 1757 Comm: bash Not tainted 5.0.0-rc1+ #125
Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.5.1 08/09/2018
RIP: 0010:kstrtoull+0x2/0x20
Code: 28 00 00 00 75 17 48 83 c4 18 5b 41 5c 5d c3 b8 ea ff ff ff eb e1 b8 de ff ff ff eb da e8 d6 36 bb ff 66 0f 1f 44 00 00 31 c0 <80> 3f 2b 55 48 89 e5 0f 94 c0 48 01 c7 e8 5c ff ff ff 5d c3 66 2e
RSP: 0018:ffffb5d482e57cb8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff82b12720
RDX: ffffb5d482e57cf8 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffb5d482e57d70 R08: ffffa0c05e5a7080 R09: ffffa0c05e003980
R10: 0000000000000000 R11: 0000000040000000 R12: ffffa0c04fe87b08
R13: 0000000000000001 R14: 000000000000000b R15: ffffa0c058d749e1
FS:  00007f137c7f7740(0000) GS:ffffa0c05e580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000497d46004 CR4: 00000000003606e0
Call Trace:
  ? trace_kprobe_create+0xb6/0x840
  ? _cond_resched+0x19/0x40
  ? _cond_resched+0x19/0x40
  ? __kmalloc+0x62/0x210
  ? argv_split+0x8f/0x140
  ? trace_kprobe_create+0x840/0x840
  ? trace_kprobe_create+0x840/0x840
  create_or_delete_trace_kprobe+0x11/0x30
  trace_run_command+0x50/0x90
  trace_parse_run_command+0xc1/0x160
  probes_write+0x10/0x20
  __vfs_write+0x3a/0x1b0
  ? apparmor_file_permission+0x1a/0x20
  ? security_file_permission+0x31/0xf0
  ? _cond_resched+0x19/0x40
  vfs_write+0xb1/0x1a0
  ksys_write+0x55/0xc0
  __x64_sys_write+0x1a/0x20
  do_syscall_64+0x5a/0x120
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix by doing the proper argument checks in trace_kprobe_create().

Cc: Ingo Molnar <mingo@redhat.com>
Link: https://lore.kernel.org/lkml/20190111095108.b79a2ee026185cbd62365977@kernel.org
Link: http://lkml.kernel.org/r/20190111060113.GA22841@xps-13
Fixes: 6212dd29683e ("tracing/kprobes: Use dyn_event framework for kprobe events")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

sbitmap: Protect swap_lock from hardirq

Because we may call blk_mq_get_driver_tag() directly from
blk_mq_dispatch_rq_list() without holding any lock, then HARDIRQ may
come and the above DEADLOCK is triggered.

Commit ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq") tries to
fix this issue by using 'spin_lock_bh', which isn't enough because we
complete request from hardirq context direclty in case of multiqueue.

Cc: Clark Williams <williams@redhat.com>
Fixes: ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sbitmap: Protect swap_lock from softirqs

The swap_lock used by sbitmap has a chain with locks taken from softirq,
but the swap_lock is not protected from being preempted by softirqs.

A chain exists of:

sbq->ws[i].wait -> dispatch_wait_lock -> swap_lock

Where the sbq->ws[i].wait lock can be taken from softirq context, which
means all locks below it in the chain must also be protected from
softirqs.

Reported-by: Clark Williams <williams@redhat.com>
Fixes: 58ab5e32e6fd ("sbitmap: silence bogus lockdep IRQ warning")
Fixes: ea86ea2cdced ("sbitmap: amortize cost of clearing bits")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'gpio-v5.0-2' of git://git./linux/kernel/git/linusw/linux-gpio

Pull GPIO fixes from Linus Walleij:
"The patch hitting the MMC/SD subsystem is fixing up my own mess when
  moving semantics from MMC/SD over to gpiolib. Ulf is on vacation but I
  managed to reach him on chat and obtain his ACK.

  The other two are early-rc fixes that are not super serious but pretty
  annoying so I'd like to get rid of them.

  Summary:

   - Get rid of some WARN_ON() from the ACPI code

   - Staticize a symbol

   - Fix MMC polarity detection"

* tag 'gpio-v5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  mmc: core: don't override the CD GPIO level when "cd-inverted" is set
  gpio: pca953x: Make symbol 'pca953x_i2c_regmap' static
  gpiolib-acpi: Remove unnecessary WARN_ON from acpi_gpiochip_free_interrupts

Merge tag 'mfd-next-4.21' of git://git./linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
"New Device Support
   - Add support for Power Supply to AXP813
   - Add support for GPIO, ADC, AC and Battery Power Supply to AXP803
   - Add support for UART to Exynos LPASS

  Fix-ups:
   - Use supplied MACROS; ti_am335x_tscadc
   - Trivial spelling/whitespace/alignment; tmio, axp20x, rave-sp
   - Regmap changes; bd9571mwv, wm5110-tables
   - Kconfig dependencies; MFD_AT91_USART
   - Supply shared data for child-devices; madera-core
   - Use new of_node_name_eq() API call; max77620, stmpe
   - Use managed resources (devm_*); tps65218
   - Comment descriptions; ingenic-tcu
   - Coding style; madera-core

  Bug Fixes:
   - Fix section mismatches; twl-core, db8500-prcmu
   - Correct error path related issues; mt6397-core, ab8500-core, mc13xxx-core
   - IRQ related fixes; tps6586x
   - Ensure proper initialisation sequence; qcom_rpm
   - Repair potential memory leak; cros_ec_dev"

* tag 'mfd-next-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (25 commits)
  mfd: exynos-lpass: Enable UART module support
  mfd: mc13xxx: Fix a missing check of a register-read failure
  mfd: cros_ec: Add commands to control codec
  mfd: madera: Remove spurious semicolon in while loop
  mfd: rave-sp: Fix typo in rave_sp_checksum comment
  mfd: ingenic-tcu: Fix bit field description in header
  mfd: tps65218: Use devm_regmap_add_irq_chip and clean up error path in probe()
  mfd: Use of_node_name_eq() for node name comparisons
  mfd: cros_ec_dev: Add missing mfd_remove_devices() call in remove
  mfd: axp20x: Add supported cells for AXP803
  mfd: axp20x: Re-align MFD cell entries
  mfd: axp20x: Add AC power supply cell for AXP813
  mfd: wm5110: Add missing ASRC rate register
  mfd: qcom_rpm: write fw_version to CTRL_REG
  mfd: tps6586x: Handle interrupts on suspend
  mfd: madera: Add shared data for accessory detection
  mfd: at91-usart: Add platform dependency
  mfd: bd9571mwv: Add volatile register to make DVFS work
  mfd: ab8500-core: Return zero in get_register_interruptible()
  mfd: tmio: Typo s/use use/use/
  ...

Merge tag 'backlight-next-4.21' of git://git./linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:
"Fix-ups:
   - Use new of_node_name_eq() API call

  Bug Fixes:
   - Internally track 'enabled' state in pwm_bl
   - Fix auto-generated pwm_bl brightness tables parsed by DT

* tag 'backlight-next-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: 88pm860x_bl: Use of_node_name_eq for node name comparisons
  backlight: pwm_bl: Fix devicetree parsing with auto-generated brightness tables
  backlight: pwm_bl: Re-add driver internal enabled tracking

Linux 5.0-rc2

kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymore

UNAME26 is a mechanism to report Linux's version as 2.6.x, for
compatibility with old/broken software.  Due to the way it is
implemented, it would have to be updated after 5.0, to keep the
resulting versions unique.  Linus Torvalds argued:

"Do we actually need this?

  I'd rather let it bitrot, and just let it return random versions. It
  will just start again at 2.4.60, won't it?

  Anybody who uses UNAME26 for a 5.x kernel might as well think it's
  still 4.x. The user space is so old that it can't possibly care about
  differences between 4.x and 5.x, can it?

  The only thing that matters is that it shows "2.4.<largeenough>",
  which it will do regardless"

Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>