git.openwrt.org Git - openwrt/staging/blogic.git/log

Merge branch 'devlink-Add-configuration-parameters-support'

Moshe Shemesh says:

====================
Add configuration parameters support

Add configuration parameters setting through devlink.
Each device registers supported configuration parameters table.
Each parameter can be either generic or driver specific.
The user can retrieve data on these parameters by "devlink param show"
command and can set new value to a parameter by "devlink param set"
command.
The parameters can be set in different configuration modes:
  runtime - set while driver is running, no reset required.
  driverinit - applied while driver initializes, requires restart
               driver by devlink reload command.
  permanent - written to device's non-volatile memory, hard reset required.

The patches at the end of the patchset introduce few params that are using
the introduced infrastructure on mlx4 and bnxt.

Command examples and output:

pci/0000:81:00.0:
  name internal_error_reset type generic
    values:
      cmode runtime value true
      cmode driverinit value true
  name max_macs type generic
    values:
      cmode driverinit value 128
  name enable_64b_cqe_eqe type driver-specific
    values:
      cmode driverinit value true
  name enable_4k_uar type driver-specific
    values:
      cmode driverinit value false

pci/0000:81:00.0:
  name internal_error_reset type generic
    values:
      cmode runtime value false
      cmode driverinit value true
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Add bnxt_en initial params table and register it.

Create initial devlink parameters table for bnxt_en.
Table consists of a permanent generic parameter.

enable_sriov - Enables Single-Root Input/Output Virtualization(SR-IOV)
characteristic of the device.

Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add enable_sriov boolean generic parameter

enable_sriov - Enables Single-Root Input/Output Virtualization(SR-IOV)
characteristic of the device.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlx4: Add support for devlink reload and load driverinit values

Add mlx4_devlink_reload() to support devlink reload operation.
Add mlx4_devlink_param_load_driverinit_values() to load values which
were set using driverinit configuration mode.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlx4: Add mlx4 initial parameters table and register it

Create initial parameters table for mlx4.
The table consists of two generic parameters and two driver-specific
parameters.
Generic:
  internal_err_reset - Enable reset device on internal errors. This
  parameter can be configured on mlx4 either on runtime or during driver
  initialization.
  max_macs - Max number of MACs per ETH port. For mlx4 this parameter
  value range is between 1 and 128. This parameter can be configured on
  mlx4 only during driver initialization.
Driver specific:
  enable_64b_cqe_eqe - Enable 64 byte CQEs/EQEs when the FW supports it.
  This parameter can be configured on mlx4 only during driver
  initialization.
  enable_4k_uar - Enable using 4K UAR. This parameter can be configured on
  mlx4 only during driver initialization.

Register the parameters table on mlx4_init_one() and unregister on
mlx4_remove_one().

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add generic parameters internal_err_reset and max_macs

Add 2 first generic parameters to devlink configuration parameters set:
internal_err_reset - When set enables reset device on internal errors.
max_macs - max number of MACs per ETH port.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add devlink notifications support for params

Add devlink_param_notify() function to support devlink param notifications.
Add notification call to devlink param set, register and unregister
functions.
Add devlink_param_value_changed() function to enable the driver notify
devlink on value change. Driver should use this function after value was
changed on any configuration mode part to driverinit.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add support for get/set driverinit value

"driverinit" configuration mode value is held by devlink to enable
the driver query the value after reload. Two additional functions
added to help the driver get/set the value from/to devlink:
devlink_param_driverinit_value_set() and
devlink_param_driverinit_value_get().

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add param set command

Add param set command to set value for a parameter.
Value can be set to any of the supported configuration modes.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add param get command

Add param get command which gets data per parameter.
Option to dump the parameters data per device.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add devlink_param register and unregister

Define configuration parameters data structure.
Add functions to register and unregister the driver supported
configuration parameters table.
For each parameter registered, the driver should fill all the parameter's
fields. In case the only supported configuration mode is "driverinit"
the parameter's get()/set() functions are not required and should be set
to NULL, for any other configuration mode, these functions are required
and should be set by the driver.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/hamradio/6pack: remove redundant variable channel

Variable channel is being assigned but is never used hence it is
redundant and can be removed.

Cleans up two clang warnings:
warning: variable 'channel' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fjes: use currently unused variable my_epid and max_epid

Variables my_epid and max_epid are currently assigned and not being
used - however, I suspect they were intended to be used in the for-loops
to reduce the dereferencing of hw. Replace hw->my_epid and hw->max_epid
with these variables.

Cleans up clang warnings:
warning: variable 'my_epid' set but not used [-Wunused-but-set-variable]
variable 'max_epid' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tehuti: remove redundant pointer skb

Pointer skb is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'skb' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: socionext: remove redundant pointer ndev

Pointer ndev is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'ndev' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: Make some functions static

Fixes the following sparse warnings:

drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c:525:5: warning:
symbol 'hw_atl_utils_mpi_set_speed' was not declared. Should it be static?
drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c:536:5: warning:
symbol 'hw_atl_utils_mpi_set_state' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: vsc73xx: Make some functions static

Fixes the following sparse warnings:

drivers/net/dsa/vitesse-vsc73xx.c:1054:6: warning:
symbol 'vsc73xx_get_strings' was not declared. Should it be static?
drivers/net/dsa/vitesse-vsc73xx.c:1113:5: warning:
symbol 'vsc73xx_get_sset_count' was not declared. Should it be static?
drivers/net/dsa/vitesse-vsc73xx.c:1122:6: warning:
symbol 'vsc73xx_get_ethtool_stats' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: fix spelling mistake "waitting" -> "waiting"

Trivial fix to spelling mistake in dev_err error message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: limit each hash list length to MAX_GRO_SKBS

After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash
table.")' there is 8 hash buckets, which allows more flows to be held for
merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still,
limit the hash table performance.

keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not
the total 8 skb

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: fix runtime suspend

When runtime-suspending we configure WoL w/o touching saved_wolopts.
If saved_wolopts == 0 we would power down the PHY in this case what's
wrong. Therefore we have to check the actual chip WoL settings here.

Fixes: 433f9d0ddcc6 ("r8169: improve saved_wolopts handling")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()

Since callees (ip_rcv_core() and ip_rcv_finish_core()) might free or steal
the skb, we can't use the list_cut_before() method; we can't even do a
list_del(&skb->list) in the drop case, because skb might have already been
freed and reused.
So instead, take each skb off the source list before processing, and add it
to the sublist afterwards if it wasn't freed or stolen.

Fixes: 5fa12739a53d net: ipv4: listify ip_rcv_finish
Fixes: 17266ee93984 net: ipv4: listified version of ip_rcv
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Add support to read actual provisioned resources

In highly constrained resources environments (like the 124VF
T5 and 248VF T6 configurations), PF4 may not have very many
resources at all and we need to adapt to whatever we've been
allocated, this patch adds support to get the provisioned
resources.

Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

epic100: remove redundant variable 'irq'

Variable 'irq' is being assigned but is never used hence it is
and can be removed.

Cleans up clang warning:
warning: variable 'irq' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: remove redundant variable old_vlan

Variable old_vlan is being assigned but is never used hence it is
and can be removed.

Cleans up clang warning:
warning: variable 'old_vlan' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: remove redundant pointer 'name'

Pointer 'name' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'name' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: micrel: remove redundant pointer 'info'

Pointer 'info' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'info' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hinic: remove redundant pointer pfhwdev

Pointer pfhwdev is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'pfhwdev' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: remove redundant variable 'protocol'

Variable 'protocol' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'protocol' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: gianfar_ethtool: remove redundant variable last_rule_idx

Variable last_rule_idx is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'last_rule_idx' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fec: remove redundant variable 'inc'

Variable 'inc' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'inc' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cnic: remove redundant pointer req and variable func

Pointer req and variable func are being assigned but are never used
hence they are redundant and can be removed.

Cleans up clang warnings:
warning: variable 'req' set but not used [-Wunused-but-set-variable]
warning: variable 'func' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bgmac: remove redundant variable 'freed'

Variable 'freed' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'freed' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: nb8800: remove redundant pointer rxd

Pointer rxd is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'rxb' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: alx: remove redundant variable old_duplex

Variable old_duplex is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'old_duplex' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: alteon: acenic: remove redundant pointer rxdesc

Pointer rxdesc is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'rxdesc' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: bcm_sf2: remove redundant variable off

Variable 'off' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
warning: variable 'off' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Scheduled-packet-Transmission-ETF'

Jesus Sanchez-Palencia says:

====================
Scheduled packet Transmission: ETF

Changes since v1:
  - moved struct sock_txtime from socket.h to uapi net_tstamp.h;
  - sk_clockid was changed from u16 to u8;
  - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock;
  - the socket option flags are now validated in sock_setsockopt();
  - added SO_EE_ORIGIN_TXTIME;
  - sockc.transmit_time is now initialized from all IPv4 Tx paths;
  - added support for the IPv6 Tx path;

Overview
========

This work consists of a set of kernel interfaces that can be used by
applications that require (time-based) Scheduled Tx of packets.
It is comprised by 3 new components to the kernel:

  - SO_TXTIME: socket option + cmsg programming interfaces.

  - etf: the "earliest txtime first" qdisc, that provides per-queue
TxTime-based scheduling. This has been renamed from 'tbs' to
'etf' to better describe its functionality.

  - taprio: the "time-aware priority scheduler" qdisc, that provides
    per-port Time-Aware scheduling;

This patchset is providing the first 2 components, which have been
developed for longer. The taprio qdisc will be shared as an RFC separately
(shortly).

Note that this series is a follow up of the "Time based packet
transmission" RFCv3 [1].

etf (formerly known as 'tbs')
=============================

For applications/systems that the concept of time slices isn't precise
enough, the etf qdisc allows applications to control the instant when
a packet should leave the network controller. When used in conjunction
with taprio, it can also be used in case the application needs to
control with greater guarantee the offset into each time slice a packet
will be sent. Another use case of etf, is when only a small number of
applications on a system are time sensitive, so it can then be used
with a more traditional root qdisc (like mqprio).

The etf qdisc is designed so it buffers packets until a configurable
time before their deadline (Tx time). The qdisc uses a rbtree internally
so the buffered packets are always 'ordered' by their txtime (deadline)
and will be dequeued following the earliest txtime first.

It relies on the SO_TXTIME API set for receiving the per-packet timestamp
(txtime) as well as the config flags for each socket: the clockid to be
used as a reference, if the expected mode of txtime for that socket is
deadline or strict mode, and if packet drops should be reported on the
socket's error queue or not.

The qdisc will drop any packets with a Tx time in the past, or if a
packet expires while waiting for being dequeued. Drops can be reported
as errors back to userspace through the socket's error queue.

Example configuration:

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 200000 \
            clockid CLOCK_TAI

Here, the Qdisc will use HW offload for the txtime control.
Packets will be dequeued by the qdisc "delta" (200000) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by hrtimers, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

A more complete example can be found here, with instructions of how to
test it:

https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2]

Note that we haven't modified the qdisc so it uses a timerqueue because
the modification needed was increasing the number of cachelines of a sk_buff.

This series is also hosted on github and can be found at [3].
The companion iproute2 patches can be found at [4].

[1] https://patchwork.ozlabs.org/cover/882342/

[2] github doesn't make it clear, but the gist can be cloned like this:
$ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f scheduled-tx-tests

[3] https://github.com/jeez/linux/tree/etf-v2

[4] https://github.com/jeez/iproute2/tree/etf-v2
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: Make etf report drops on error_queue

Use the socket error queue for reporting dropped packets if the
socket has enabled that feature through the SO_TXTIME API.

Packets are dropped either on enqueue() if they aren't accepted by the
qdisc or on dequeue() if the system misses their deadline. Those are
reported as different errors so applications can react accordingly.

Userspace can retrieve the errors through the socket error queue and the
corresponding cmsg interfaces. A struct sock_extended_err* is used for
returning the error data, and the packet's timestamp can be retrieved by
adding both ee_data and ee_info fields as e.g.:

((__u64) serr->ee_data << 32) + serr->ee_info

This feature is disabled by default and must be explicitly enabled by
applications. Enabling it can bring some overhead for the Tx cycles
of the application.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: Add support for ETF offload

Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_ETF and configuring i210 so time based transmit
arbitration is enabled.

The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).

As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.

The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the ETF qdisc already.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: Only call skb_tx_timestamp after descriptors are ready

Currently, skb_tx_timestamp() is being called before the Tx
descriptors are prepared in igb_xmit_frame_ring(), which happens
during either the igb_tso() or igb_tx_csum() calls.

Given that now the skb->tstamp might be used to carry the timestamp
for SO_TXTIME, we must only call skb_tx_timestamp() after the
information has been copied into the Tx descriptors.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: Refactor igb_offload_cbs()

Split code into a separate function (igb_offload_apply()) that will be
used by ETF offload implementation.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: Only change Tx arbitration when CBS is on

Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.

Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.

Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: Refactor igb_configure_cbs()

Make this function retrieve what it needs from the Tx ring being
addressed since it already relies on what had been saved on it before.
Also, since this function will be used by the upcoming Launchtime
patches rename it to better reflect its intention. Note that
Launchtime is not part of what 802.1Qav specifies, but the i210
datasheet refers to this set of functionality as "Qav Transmission
Mode".

Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
further documentation to igb_setup_tx_mode().

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: Add HW offloading capability to ETF

Add infra so etf qdisc supports HW offload of time-based transmission.

For hw offload, the time sorted list is still used, so packets are
dequeued always in order of txtime.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \
clockid CLOCK_REALTIME

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. The hrtimer used for
packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME
as reference and packets leave the Qdisc "delta" (100000) nanoseconds
before their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as
expected.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: Introduce the ETF Qdisc

The ETF (Earliest TxTime First) qdisc uses the information added
earlier in this series (the socket option SO_TXTIME and the new
role of sk_buff->tstamp) to schedule packets transmission based
on absolute time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
clockid CLOCK_TAI

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in the
socket will be in reference to the clockid CLOCK_TAI and packets
will leave the qdisc "delta" (100000) nanoseconds before its transmission
time.

The ETF qdisc will buffer packets sorted by their txtime. It will drop
packets on enqueue() if their skbuff clockid does not match the clock
reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
if it expires while being enqueued.

The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
will dequeue a packet as soon as possible and change the skb timestamp
to 'now' during etf_dequeue().

Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
to be configured, but it's been decided that usage of CLOCK_TAI should
be enforced until we decide to allow for other clockids to be used.
The rationale here is that PTP times are usually in the TAI scale, thus
no other clocks should be necessary. For now, the qdisc will return
EINVAL if any clocks other than CLOCK_TAI are used.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: Allow creating a Qdisc watchdog with other clocks

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: packet: Hook into time based transmission.

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv6: Hook into time based transmission

Add a struct sockcm_cookie parameter to ip6_setup_cork() so
we can easily re-use the transmit_time field from struct inet_cork
for most paths, by copying the timestamp from the CMSG cookie.
This is later copied into the skb during __ip6_make_skb().

For the raw fast path, also pass the sockcm_cookie as a parameter
so we can just perform the copy at rawv6_send_hdrinc() directly.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: Hook into time based transmission

Add a transmit_time field to struct inet_cork, then copy the
timestamp from the CMSG cookie at ip_setup_cork() so we can
safely copy it into the skb later during __ip_make_skb().

For the raw fast path, just perform the copy at raw_send_hdrinc().

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add a new socket option for a future transmit time.

This patch introduces SO_TXTIME. User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2). The argument to this socket option is a 8-bytes long struct
provided by the uapi header net_tstamp.h defined as:

struct sock_txtime {
clockid_t clockid;
u32 flags;
};

Note that new fields were added to struct sock by filling a 2-bytes
hole found in the struct. For that reason, neither the struct size or
number of cachelines were altered.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Clear skb->tstamp only on the forwarding path

This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() before the early return would
break our feature when tunnels are used.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

isdn: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Warning level 2 was used: -Wimplicit-fallthrough=2

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: usb: asix: allow optionally getting mac address from device tree

For Embedded use where e.g. AX88772B chips may be used without external
EEPROMs the boot loader may choose to pass the MAC address to be used
via device tree. Therefore, allow for optionally getting the MAC
address from device tree data e.g. as follows (excerpt from a T30 based
board, local-mac-address to be filled in by boot loader):

/* EHCI instance 1: USB2_DP/N -> AX88772B */
usb@7d004000 {
status = "okay";
#address-cells = <1>;
#size-cells = <0>;
asix@1 {
reg = <1>;
local-mac-address = [00 00 00 00 00 00];
};
};

Signed-off-by: Marcel Ziswiler <marcel.ziswiler@toradex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: act_pedit: fix possible memory leak in tcf_pedit_init()

'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init()
but not all of the error handle path free it, this may cause memory
leak. This patch fix it.

Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the conventional network headers")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bridge-iproute2-isolated-port-and-selftests'

Nikolay Aleksandrov says:

====================
bridge: iproute2 isolated port and selftests

Add support to iproute2 for port isolation config and selftests for it.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: test for bridge port isolation

This test checks if the bridge port isolation feature works as expected
by performing ping/ping6 tests between hosts that are isolated (should
not work) and between an isolated and non-isolated hosts (should work).
Same test is performed for flooding from and to isolated and
non-isolated ports.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: lib: extract ping and ping6 so they can be reused

Extract ping and ping6 command execution so the return value can be
checked by the caller, this is needed for port isolation tests that are
intended to fail.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'vhost_net-Avoid-vq-kicks-during-busyloop'

Toshiaki Makita says:

====================
vhost_net: Avoid vq kicks during busyloop

Under heavy load vhost tx busypoll tend not to suppress vq kicks, which
causes poor guest tx performance. The detailed scenario is described in
commitlog of patch 2.
Rx seems not to have that serious problem, but for consistency I made a
similar change on rx to avoid rx wakeups (patch 3).
Additionary patch 4 is to avoid rx kicks under heavy load during
busypoll.

Tx performance is greatly improved by this change. I don't see notable
performance change on rx with this series though.

Performance numbers (tx):

- Bulk transfer from guest to external physical server.
    [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server]
- Set 10us busypoll.
- Guest disables checksum and TSO because of host XDP.
- Measured single flow Mbps by netperf, and kicks by perf kvm stat
  (EPT_MISCONFIG event).

                            Before              After
                          Mbps  kicks/s      Mbps  kicks/s
UDP_STREAM 1472byte              247758                 27
                Send   3645.37            6958.10
                Recv   3588.56            6958.10
              1byte                9865                 37
                Send      4.34               5.43
                Recv      4.17               5.26
TCP_STREAM             8801.03    45794   9592.77     2884

v2:
- Split patches into 3 parts (renaming variables, tx-kick fix, rx-wakeup
  fix).
- Avoid rx-kicks too (patch 4).
- Don't memorize endtime as it is not needed for now.
====================

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vhost_net: Avoid rx vring kicks during busyloop

We may run out of avail rx ring descriptor under heavy load but busypoll
did not detect it so busypoll may have exited prematurely. Avoid this by
checking rx ring full during busypoll.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vhost_net: Avoid rx queue wake-ups during busypoll

We may run handle_rx() while rx work is queued. For example a packet can
push the rx work during the window before handle_rx calls
vhost_net_disable_vq().
In that case busypoll immediately exits due to vhost_has_work()
condition and enables vq again. This can lead to another unnecessary rx
wake-ups, so poll rx work instead of enabling the vq.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vhost_net: Avoid tx vring kicks during busyloop

Under heavy load vhost busypoll may run without suppressing
notification. For example tx zerocopy callback can push tx work while
handle_tx() is running, then busyloop exits due to vhost_has_work()
condition and enables notification but immediately reenters handle_tx()
because the pushed work was tx. In this case handle_tx() tries to
disable notification again, but when using event_idx it by design
cannot. Then busyloop will run without suppressing notification.
Another example is the case where handle_tx() tries to enable
notification but avail idx is advanced so disables it again. This case
also leads to the same situation with event_idx.

The problem is that once we enter this situation busyloop does not work
under heavy load for considerable amount of time, because notification
is likely to happen during busyloop and handle_tx() immediately enables
notification after notification happens. Specifically busyloop detects
notification by vhost_has_work() and then handle_tx() calls
vhost_enable_notify(). Because the detected work was the tx work, it
enters handle_tx(), and enters busyloop without suppression again.
This is likely to be repeated, so with event_idx we are almost not able
to suppress notification in this case.

To fix this, poll the work instead of enabling notification when
busypoll is interrupted by something. IMHO vhost_has_work() is kind of
interruption rather than a signal to completely cancel the busypoll, so
let's run busypoll after the necessary work is done.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vhost_net: Rename local variables in vhost_net_rx_peek_head_len

So we can easily see which variable is for which, tx or rx.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net:sched: add action inheritdsfield to skbedit

The new action inheritdsfield copies the field DS of
IPv4 and IPv6 packets into skb->priority. This enables
later classification of packets based on the DS field.

v5:
*Update the drop counter for TC_ACT_SHOT

v4:
*Not allow setting flags other than the expected ones.

*Allow dumping the pure flags.

v3:
*Use optional flags, so that it won't break old versions of tc.

*Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.

v2:
*Fix the style issue

*Move the code from skbmod to skbedit

Original idea by Jamal Hadi Salim <jhs@mojatatu.com>

Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
Reviewed-by: Michel Machado <michel@digirati.com.br>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'More-mirror-to-gretap-tests-with-bridge-in-UL'

Petr Machata says:

====================
More mirror-to-gretap tests with bridge in UL

This patchset adds two more tests where the mirror-to-gretap has a
bridge in underlay packet path, without a VLAN above or below that
bridge.

In patch #1, a non-VLAN-filtering bridge is tested.

In patch #2, a VLAN-filtering bridge is tested.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q

Test for "tc action mirred egress mirror" that mirrors to gretap when
the underlay route points at a VLAN-aware bridge (802.1q).

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: Test mirror-to-gretap w/ UL 802.1d

Test for "tc action mirred egress mirror" that mirrors to gretap when
the underlay route points at a VLAN-unaware bridge (802.1d).

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Handle-multiple-received-packets-at-each-stage'

Edward Cree says:

====================
Handle multiple received packets at each stage

This patch series adds the capability for the network stack to receive a
list of packets and process them as a unit, rather than handling each
packet singly in sequence.  This is done by factoring out the existing
datapath code at each layer and wrapping it in list handling code.

The motivation for this change is twofold:
* Instruction cache locality.  Currently, running the entire network
  stack receive path on a packet involves more code than will fit in the
  lowest-level icache, meaning that when the next packet is handled, the
  code has to be reloaded from more distant caches.  By handling packets
  in "row-major order", we ensure that the code at each layer is hot for
  most of the list.  (There is a corresponding downside in _data_ cache
  locality, since we are now touching every packet at every layer, but in
  practice there is easily enough room in dcache to hold one cacheline of
  each of the 64 packets in a NAPI poll.)
* Reduction of indirect calls.  Owing to Spectre mitigations, indirect
  function calls are now more expensive than ever; they are also heavily
  used in the network stack's architecture (see [1]).  By replacing 64
  indirect calls to the next-layer per-packet function with a single
  indirect call to the next-layer list function, we can save CPU cycles.

Drivers pass an SKB list to the stack at the end of the NAPI poll; this
gives a natural batch size (the NAPI poll weight) and avoids waiting at
the software level for further packets to make a larger batch (which
would add latency).  It also means that the batch size is automatically
tuned by the existing interrupt moderation mechanism.
The stack then runs each layer of processing over all the packets in the
list before proceeding to the next layer.  Where the 'next layer' (or
the context in which it must run) differs among the packets, the stack
splits the list; this 'late demux' means that packets which differ only
in later headers (e.g. same L2/L3 but different L4) can traverse the
early part of the stack together.
Also, where the next layer is not (yet) list-aware, the stack can revert
to calling the rest of the stack in a loop; this allows gradual/creeping
listification, with no 'flag day' patch needed to listify everything.

Patches 1-2 simply place received packets on a list during the event
processing loop on the sfc EF10 architecture, then call the normal stack
for each packet singly at the end of the NAPI poll.  (Analogues of patch
#2 for other NIC drivers should be fairly straightforward.)
Patches 3-9 extend the list processing as far as the IP receive handler.

Patches 1-2 alone give about a 10% improvement in packet rate in the
baseline test; adding patches 3-9 raises this to around 25%.

Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
packets and a single core to handle interrupts on the RX side; this was
in order to measure as simply as possible the packet rate handled by a
single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
setup was tuned for maximum reproducibility, rather than raw performance.
Full details and more results (both with and without retpolines) from a
previous version of the patch series are presented in [2].

The baseline test uses four streams, and multiple RXQs all bound to a
single CPU (the netperf binary is bound to a neighbouring CPU).  These
tests were run with retpolines.
net-next: 6.91 Mb/s (datum)
after 9: 8.46 Mb/s (+22.5%)
Note however that these results are not robust; changes in the parameters
of the test sometimes shrink the gain to single-digit percentages.  For
instance, when using only a single RXQ, only a 4% gain was seen.

One test variation was the use of software filtering/firewall rules.
Adding a single iptables rule (UDP port drop on a port range not matching
the test traffic), thus making the netfilter hook have work to do,
reduced baseline performance but showed a similar gain from the patches:
net-next: 5.02 Mb/s (datum)
after 9: 6.78 Mb/s (+35.1%)

Similarly, testing with a set of TC flower filters (kindly supplied by
Cong Wang) gave the following:
net-next: 6.83 Mb/s (datum)
after 9: 8.86 Mb/s (+29.7%)

These data suggest that the batching approach remains effective in the
presence of software switching rules, and perhaps even improves the
performance of those rules by allowing them and their codepaths to stay
in cache between packets.

Changes from v3:
* Fixed build error when CONFIG_NETFILTER=n (thanks kbuild).

Changes from v2:
* Used standard list handling (and skb->list) instead of the skb-queue
  functions (that use skb->next, skb->prev).
  - As part of this, changed from a "dequeue, process, enqueue" model to
    using list_for_each_safe, list_del, and (new) list_cut_before.
* Altered __netif_receive_skb_core() changes in patch 6 as per Willem de
  Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming).
* Removed patches to Generic XDP, since they were producing no benefit.
  I may revisit them later.
* Removed RFC tags.

Changes from v1:
* Rebased across 2 years' net-next movement (surprisingly straightforward).
  - Added Generic XDP handling to netif_receive_skb_list_internal()
  - Dealt with changes to PFMEMALLOC setting APIs
* General cleanup of code and comments.
* Skipped function calls for empty lists at various points in the stack
  (patch #9).
* Added listified Generic XDP handling (patches 10-12), though it doesn't
  seem to help (see above).
* Extended testing to cover software firewalls / netfilter etc.

[1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
[2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: don't bother calling list RX functions on empty lists

Generally the check should be very cheap, as the sk_buff_head is in cache.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: listify ip_rcv_finish

ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
demux or route lookup. The last step, calling dst_input(skb), is left to
the caller; in the listified case, we split to form sublists with a common
dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
The next step in listification would thus be to add a list_input() method
to struct dst_entry.

Early demux is an indirect call based on iph->protocol; this is another
opportunity for listification which is not taken here (it would require
slicing up ip_rcv_finish_core() to allow splitting on protocol changes).

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: listified version of ip_rcv

Also involved adding a way to run a netfilter hook over a list of packets.
Rather than attempting to make netfilter know about lists (which would be
a major project in itself) we just let it call the regular okfn (in this
case ip_rcv_finish()) for any packets it steals, and have it give us back
a list of packets it's synchronously accepted (which normally NF_HOOK
would automatically call okfn() on, but we want to be able to potentially
pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
packet (see nf_hook_entry_hookfn()), but again, changing that can be left
for future work.

There is potential for out-of-order receives if the netfilter hook ends up
synchronously stealing packets, as they will be processed before any
accepts earlier in the list. However, it was already possible for an
asynchronous accept to cause out-of-order receives, so presumably this is
considered OK.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: propagate SKB lists through packet_type lookup

__netif_receive_skb_core() does a depressingly large amount of per-packet
work that can't easily be listified, because the another_round looping
makes it nontrivial to slice up into smaller functions.
Fortunately, most of that work disappears in the fast path:
* Hardware devices generally don't have an rx_handler
* Unless you're tcpdumping or something, there is usually only one ptype
* VLAN processing comes before the protocol ptype lookup, so doesn't force
a pt_prev deliver
so normally, __netif_receive_skb_core() will run straight through and pass
back the one ptype found in ptype_base[hash of skb->protocol].

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: another layer of lists, around PF_MEMALLOC skb handling

First example of a layer splitting the list (rather than merely taking
individual packets off it).
Involves new list.h function, list_cut_before(), like list_cut_position()
but cuts on the other side of the given entry.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: Another step of skb receive list processing

netif_receive_skb_list_internal() now processes a list and hands it
on to the next function.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: unwrap skb list receive slightly further

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: batch up RX delivery

Improves packet rate of 1-byte UDP receives by up to 10%.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: trivial netif_receive_skb_list() entry point

Just calls netif_receive_skb() in a loop.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sctp-fully-support-for-dscp-and-flowlabel-per-transport'

Xin Long says:

====================
sctp: fully support for dscp and flowlabel per transport

Now dscp and flowlabel are set from sock when sending the packets,
but being multi-homing, sctp also supports for dscp and flowlabel
per transport, which is described in section 8.1.12 in RFC6458.

v1->v2:
  - define ip_queue_xmit as inline in net/ip.h, instead of exporting
    it in Patch 1/5 according to David's suggestion.
  - fix the param len check in sctp_s/getsockopt_peer_addr_params()
    in Patch 3/5 to guarantee that an old app built with old kernel
    headers could work on the newer kernel per Marcelo's point.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: check for ipv6_pinfo legal sndflow with flowlabel in sctp_v6_get_dst

The transport with illegal flowlabel should not be allowed to send
packets. Other transport protocols already denies this.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: add support for setting flowlabel when adding a transport

Struct sockaddr_in6 has the member sin6_flowinfo that includes the
ipv6 flowlabel, it should also support for setting flowlabel when
adding a transport whose ipaddr is from userspace.

Note that addrinfo in sctp_sendmsg is using struct in6_addr for
the secondary addrs, which doesn't contain sin6_flowinfo, and
it needs to copy sin6_flowinfo from the primary addr.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams

spp_ipv6_flowlabel and spp_dscp are added in sctp_paddrparams in
this patch so that users could set sctp_sock/asoc/transport dscp
and flowlabel with spp_flags SPP_IPV6_FLOWLABEL or SPP_DSCP by
SCTP_PEER_ADDR_PARAMS , as described section 8.1.12 in RFC6458.

As said in last patch, it uses '| 0x100000' or '|0x1' to mark
flowlabel or dscp is set, so that their values could be set
to 0.

Note that to guarantee that an old app built with old kernel
headers could work on the newer kernel, the param's check in
sctp_g/setsockopt_peer_addr_params() is also improved, which
follows the way that sctp_g/setsockopt_delayed_ack() or some
other sockopts' process that accept two types of params does.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: add support for dscp and flowlabel per transport

Like some other per transport params, flowlabel and dscp are added
in transport, asoc and sctp_sock. By default, transport sets its
value from asoc's, and asoc does it from sctp_sock. flowlabel
only works for ipv6 transport.

Other than that they need to be passed down in sctp_xmit, flow4/6
also needs to set them before looking up route in get_dst.

Note that it uses '& 0x100000' to check if flowlabel is set and
'& 0x1' (tos 1st bit is unused) to check if dscp is set by users,
so that they could be set to 0 by sockopt in next patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: add __ip_queue_xmit() that supports tos param

This patch introduces __ip_queue_xmit(), through which the callers
can pass tos param into it without having to set inet->tos. For
ipv6, ip6_xmit() already allows passing tclass parameter.

It's needed when some transport protocol doesn't use inet->tos,
like sctp's per transport dscp, which will be added in next patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Add Vitesse VSC73xx DSA router driver

This adds a DSA driver for:

Vitesse VSC7385 SparX-G5 5-port Integrated Gigabit Ethernet Switch
Vitesse VSC7388 SparX-G8 8-port Integrated Gigabit Ethernet Switch
Vitesse VSC7395 SparX-G5e 5+1-port Integrated Gigabit Ethernet Switch
Vitesse VSC7398 SparX-G8e 8-port Integrated Gigabit Ethernet Switch

These switches have a built-in 8051 CPU and can download and execute
firmware in this CPU. They can also be configured to use an external
CPU handling the switch in a memory-mapped manner by connecting to
that external CPU's memory bus.

This driver (currently) only takes control of the switch chip over
SPI and configures it to route packages around when connected to a
CPU port. The chip has embedded PHYs and VLAN support so we model it
using DSA as a best fit so we can easily add VLAN support and maybe
later also exploit the internal frame header to get more direct
control over the switch.

The four built-in GPIO lines are exposed using a standard GPIO chip.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: vitesse: Add support for VSC73xx

The VSC7385, VSC7388, VSC7395 and VSC7398 are integrated
switch/router chips for 5+1 or 8-port switches/routers. When
managed directly by Linux using DSA we need to do a special
set-up "dance" on the PHY. Unfortunately these sequences
switches the PHY to undocumented pages named 2a30 and 52b6
and does undocumented things. It is described by these opaque
sequences also in the reference manual. This is a best
effort to integrate it anyways.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Add DT bindings for Vitesse VSC73xx switches

This adds the device tree bindings for the Vitesse VSC73xx
switches. We also add the vendor name for Vitesse.

Cc: devicetree@vger.kernel.org
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-07-03

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Various improvements to bpftool and libbpf, that is, bpftool build
   speed improvements, missing BPF program types added for detection
   by section name, ability to load programs from '.text' section is
   made to work again, and better bash completion handling, from Jakub.

2) Improvements to nfp JIT's map read handling which allows for optimizing
   memcpy from map to packet, from Jiong.

3) New BPF sample is added which demonstrates XDP in combination with
   bpf_perf_event_output() helper to sample packets on all CPUs, from Toke.

4) Add a new BPF kselftest case for tracking connect(2) BPF hooks
   infrastructure in combination with TFO, from Andrey.

5) Extend the XDP/BPF xdp_rxq_info sample code with a cmdline option to
   read payload from packet data in order to use it for benchmarking.
   Also for '--action XDP_TX' option implement swapping of MAC addresses
   to avoid drops on some hardware seen during testing, from Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'aquantia-various-ethtool-ops-implementation'

Igor Russkikh says:

====================
net: aquantia: various ethtool ops implementation

In this patchset Anton Mikaev and I added some useful ethtool operations:
- ring size changes
- link renegotioation
- flow control management

The patch also improves init/deinit sequence.

V3 changes:
- After review and analysis it is clear that rtnl lock (which is
captured by default on ethtool ops) is enough to secure possible
overlapping of dev open/close. Thus, just dropping internal mutex.

V2 changes:
- using mutex to secure simultaneous dev close/open
- using state var to store/restore dev state
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: bump driver version

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: Add renegotiate ethtool operation support

Adds ethtool -r|--negotiate operation support. It triggers special
control bit on FW interface causing FW to restart link negotiation.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Anton Mikaev <amikaev@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: Implement rx/tx flow control ethtools callback

Runtime change of pause frame configuration (rx/tx flow control)
via ethtool.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: Improve adapter init/deinit logic

We now pass link drop status to FW on init/deinit. This is required
to inform FW that driver took/released a control on link.
FW then will manage its own state and device power profile based
on this information. To improve management we remove mpi_set
function which ambiguously took both state and speed parameters.

Deinit callback is now a part of FW ops, as it actually manages the FW.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: aquantia: Ethtool based ring size configuration

Implemented ring size setup, min/max validation and reconfiguration in
runtime.

Signed-off-by: Anton Mikaev <amikaev@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac_tc: use 64-bit arithmetic instead of 32-bit

Add suffix UL to constant 1024 in order to give the compiler complete
information about the proper arithmetic to use. Notice that this
constant is used in a context that expects an expression of type
u64 (64 bits, unsigned) and following expressions are currently
being evaluated using 32-bit arithmetic:

qopt->idleslope * 1024 * ptr
qopt->hicredit * 1024 * 8
qopt->locredit * 1024 * 8

Addresses-Coverity-ID: 1470246 ("Unintentional integer overflow")
Addresses-Coverity-ID: 1470248 ("Unintentional integer overflow")
Addresses-Coverity-ID: 1470249 ("Unintentional integer overflow")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: DP83TC811: Fix SGMII enable/disable

If SGMII was selected in the DT then the device should
write the SGMII enable bit.

If SGMII is not selected in the DT then the SGMII bit
should be disabled.

Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: DP83TC811: Add INT_STAT3

Add INT_STAT3 interrupt setting and clearing
support.

Signed-off-by: Dan Murphy <dmurphy@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge ra./pub/scm/linux/kernel/git/davem/net

Simple overlapping changes in stmmac driver.

Adjust skb_gro_flush_final_remcsum function signature to make GRO list
changes in net-next, as per Stephen Rothwell's example merge
resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-next' of git://git./linux/kernel/git/shli/md

Pull MD fixes from Shaohua Li:
"Two small fixes for MD:

   - an error handling fix from me

   - a recover bug fix for raid10 from BingJing"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md/raid10: fix that replacement cannot complete recovery after reassemble
  MD: cleanup resources in failure

Merge tag 'for-linus' of git://github.com/stffrdhrn/linux

Pull OpenRISC fixes from Stafford Horne:
"Two fixes for issues which were breaking OpenRISC boot:

   - Fix bug in __pte_free_tlb() exposed in 4.18 by Matthew Wilcox's
     page table flag addition.

   - Fix issue booting on real hardware if delay slot detection
     emulation is disabled"

* tag 'for-linus' of git://github.com/stffrdhrn/linux:
  openrisc: entry: Fix delay slot exception detection
  openrisc: Call destructor during __pte_free_tlb

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Verify netlink attributes properly in nf_queue, from Eric Dumazet.

2) Need to bump memory lock rlimit for test_sockmap bpf test, from
    Yonghong Song.

3) Fix VLAN handling in lan78xx driver, from Dave Stevenson.

4) Fix uninitialized read in nf_log, from Jann Horn.

5) Fix raw command length parsing in mlx5, from Alex Vesker.

6) Cleanup loopback RDS connections upon netns deletion, from Sowmini
    Varadhan.

7) Fix regressions in FIB rule matching during create, from Jason A.
    Donenfeld and Roopa Prabhu.

8) Fix mpls ether type detection in nfp, from Pieter Jansen van Vuuren.

9) More bpfilter build fixes/adjustments from Masahiro Yamada.

10) Fix XDP_{TX,REDIRECT} flushing in various drivers, from Jesper
    Dangaard Brouer.

11) fib_tests.sh file permissions were broken, from Shuah Khan.

12) Make sure BH/preemption is disabled in data path of mac80211, from
    Denis Kenzior.

13) Don't ignore nla_parse_nested() return values in nl80211, from
    Johannes berg.

14) Properly account sock objects ot kmemcg, from Shakeel Butt.

15) Adjustments to setting bpf program permissions to read-only, from
    Daniel Borkmann.

16) TCP Fast Open key endianness was broken, it always took on the host
    endiannness. Whoops. Explicitly make it little endian. From Yuching
    Cheng.

17) Fix prefix route setting for link local addresses in ipv6, from
    David Ahern.

18) Potential Spectre v1 in zatm driver, from Gustavo A. R. Silva.

19) Various bpf sockmap fixes, from John Fastabend.

20) Use after free for GRO with ESP, from Sabrina Dubroca.

21) Passing bogus flags to crypto_alloc_shash() in ipv6 SR code, from
    Eric Biggers.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
  qede: Adverstise software timestamp caps when PHC is not available.
  qed: Fix use of incorrect size in memcpy call.
  qed: Fix setting of incorrect eswitch mode.
  qed: Limit msix vectors in kdump kernel to the minimum required count.
  ipvlan: call dev_change_flags when ipvlan mode is reset
  ipv6: sr: fix passing wrong flags to crypto_alloc_shash()
  net: fix use-after-free in GRO with ESP
  tcp: prevent bogus FRTO undos with non-SACK flows
  bpf: sockhash, add release routine
  bpf: sockhash fix omitted bucket lock in sock_close
  bpf: sockmap, fix smap_list_map_remove when psock is in many maps
  bpf: sockmap, fix crash when ipv6 sock is added
  net: fib_rules: bring back rule_exists to match rule during add
  hv_netvsc: split sub-channel setup into async and sync
  net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
  atm: zatm: Fix potential Spectre v1
  s390/qeth: consistently re-enable device features
  s390/qeth: don't clobber buffer on async TX completion
  s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]
  s390/qeth: fix race when setting MAC address
  ...

Merge branch 'hns3-a-few-code-improvements'

Peng Li says:

====================
net: hns3: a few code improvements

This patchset removes some redundant code and fixes a few code
stylistic issues from internal concentrated review,
no functional changes introduced.

---
Change log:
V1 -> V2:
1, remove a patch according to the comment reported by David Miller.
---
====================

Signed-off-by: David S. Miller <davem@davemloft.net>