openwrt/staging/blogic.git
5 years agonet: hns3: cleanup some print format warning
Guojia Liao [Thu, 31 Oct 2019 11:23:23 +0000 (19:23 +0800)]
net: hns3: cleanup some print format warning

Using '%d' for printing type unsigned int or '%u' for
type int would cause static tools to give false warnings,
so this patch cleanups this warning by using the suitable
format specifier of the type of variable.

BTW, modifies the type of some variables and macro to
synchronize with their usage.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: add or modify some comments
Guangbin Huang [Thu, 31 Oct 2019 11:23:22 +0000 (19:23 +0800)]
net: hns3: add or modify some comments

This patch makes the comment for macro HCLGE_MBX_GET_VF_FLR_STATUS
more correct, and adds comments in some place to make the code more
readable.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: optimize local variable initialization
Guangbin Huang [Thu, 31 Oct 2019 11:23:21 +0000 (19:23 +0800)]
net: hns3: optimize local variable initialization

The variable tx_ring is unnecessary to be initialized as it will be set
before used, and the variable rst_cnt is better to be initialized when
declaration for simplification.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: cleanup a format-truncation warning
Guojia Liao [Thu, 31 Oct 2019 11:23:20 +0000 (19:23 +0800)]
net: hns3: cleanup a format-truncation warning

In hns3_nic_init_irq(), when '*_int_idx' has more than 9 digits
and the length of netdev's name is IFNAMSIZ, the total length
of final name will be bigger the HNAE3_INT_NAME_LEN - 1, even
though '*_int_idx' will never have such large value, but the
compiler gives a format-truncation warning for this case.

So this patch just enlarges the length to avoid this warning.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: cleanup some coding style issues
Guangbin Huang [Thu, 31 Oct 2019 11:23:19 +0000 (19:23 +0800)]
net: hns3: cleanup some coding style issues

To unify code style and make code simpler, this patch modifies
some code, deletes unnecessary blank lines and {}, changes
location of code, and so on.

No functional change.

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: cleanup some magic numbers
Guojia Liao [Thu, 31 Oct 2019 11:23:18 +0000 (19:23 +0800)]
net: hns3: cleanup some magic numbers

To make the code more readable, this patch replaces
some magic numbers with macro or sizeof operation.

Also uses macro lower_32_bits and upper_32_bits to
get bits 0-31 and 32-63 of a number, instead of
using type conversion and '>>' operation.

No functional change.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: add struct netdev_queue debug info for TX timeout
Yunsheng Lin [Thu, 31 Oct 2019 11:23:17 +0000 (19:23 +0800)]
net: hns3: add struct netdev_queue debug info for TX timeout

When there is a TX timeout, we can tell if the driver or stack
has stopped the queue by looking at state field, and when has
the last packet transmited by looking at trans_start field.

So this patch prints these two field in the
hns3_get_tx_timeo_queue_info().

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns3: dump some debug information when reset fail
Huazhong Tan [Thu, 31 Oct 2019 11:23:16 +0000 (19:23 +0800)]
net: hns3: dump some debug information when reset fail

When reset fails, there is some information that will help for
finding out why does reset fail. and removes an unused
core_rst_cnt field in struct hclge_rst_stats.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'bnxt_en-Add-OP-TEE-based-bnxt-f-w-manager'
David S. Miller [Thu, 31 Oct 2019 18:00:45 +0000 (11:00 -0700)]
Merge branch 'bnxt_en-Add-OP-TEE-based-bnxt-f-w-manager'

Sheetal Tigadoli says:

====================
bnxt_en: Add OP-TEE based bnxt f/w manager

This patch series adds support for TEE based BNXT firmware
management module and the driver changes to invoke OP-TEE
APIs to fastboot firmware and to collect crash dump.

Changes from v4:
 - update Kconfig to reflect dependency on TEE driver
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobnxt_en: Add support to collect crash dump via ethtool
Vasundhara Volam [Thu, 31 Oct 2019 10:08:52 +0000 (15:38 +0530)]
bnxt_en: Add support to collect crash dump via ethtool

Driver supports 2 types of core dumps.

1. Live dump - Firmware dump when system is up and running.
2. Crash dump - Dump which is collected during firmware crash
                that can be retrieved after recovery.
Crash dump is currently supported only on specific 58800 chips
which can be retrieved using OP-TEE API only, as firmware cannot
access this region directly.

User needs to set the dump flag using following command before
initiating the dump collection:

    $ ethtool -W|--set-dump eth0 N

Where N is "0" for live dump and "1" for crash dump

Command to collect the dump after setting the flag:

    $ ethtool -w eth0 data Filename

v3: Modify set_dump to support even when CONFIG_TEE_BNXT_FW=n.
Also change log message to netdev_info().

Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobnxt_en: Add support to invoke OP-TEE API to reset firmware
Vasundhara Volam [Thu, 31 Oct 2019 10:08:51 +0000 (15:38 +0530)]
bnxt_en: Add support to invoke OP-TEE API to reset firmware

In error recovery process when firmware indicates that it is
completely down, initiate a firmware reset by calling OP-TEE API.

Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agofirmware: broadcom: add OP-TEE based BNXT f/w manager
Vikas Gupta [Thu, 31 Oct 2019 10:08:50 +0000 (15:38 +0530)]
firmware: broadcom: add OP-TEE based BNXT f/w manager

This driver registers on TEE bus to interact with OP-TEE based
BNXT firmware management modules

Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Sheetal Tigadoli <sheetal.tigadoli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'mlxsw-Make-port-split-code-more-generic'
David S. Miller [Thu, 31 Oct 2019 17:54:47 +0000 (10:54 -0700)]
Merge branch 'mlxsw-Make-port-split-code-more-generic'

Ido Schimmel says:

====================
mlxsw: Make port split code more generic

Jiri says:

Currently, we assume some limitations and constant values which are not
applicable for Spectrum-3 which has 8 lanes ports (instead of previous 4
lanes).

This patch does 2 things:

1) Generalizes the code to not use constants so it can work for 4, 8 and
   possibly 16 lanes.

2) Enforces some assumptions we had in the code but did not check.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Generalize split count check
Jiri Pirko [Thu, 31 Oct 2019 09:42:21 +0000 (11:42 +0200)]
mlxsw: spectrum: Generalize split count check

Make the check generic for any possible value, not only 2 and 4.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Iterate over all ports in gap during unsplit create
Jiri Pirko [Thu, 31 Oct 2019 09:42:20 +0000 (11:42 +0200)]
mlxsw: spectrum: Iterate over all ports in gap during unsplit create

During recreation of original unsplit ports, just simply iterate over
the whole gap and recreate whatever originally existed.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Fix base port get for split count 4 and 8
Jiri Pirko [Thu, 31 Oct 2019 09:42:19 +0000 (11:42 +0200)]
mlxsw: spectrum: Fix base port get for split count 4 and 8

The current code considers only split by 2 or 4. Make the base port
getting generic and allow split by 8 to be handled correctly. Generalize
the used port checks as well.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Use port_module_max_width to compute base port index
Jiri Pirko [Thu, 31 Oct 2019 09:42:18 +0000 (11:42 +0200)]
mlxsw: spectrum: Use port_module_max_width to compute base port index

Instead of using constant value, use port_module_max_width which is
aligned with the cluster size.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Remember split base local port and use it in unsplit
Jiri Pirko [Thu, 31 Oct 2019 09:42:17 +0000 (11:42 +0200)]
mlxsw: spectrum: Remember split base local port and use it in unsplit

Don't compute the original base local port during unsplit, rather
remember it in mlxsw_sp_port structure during split port creation.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Introduce resource for getting offset of 4 lanes split port
Jiri Pirko [Thu, 31 Oct 2019 09:42:16 +0000 (11:42 +0200)]
mlxsw: spectrum: Introduce resource for getting offset of 4 lanes split port

In Spectrum-3 the modules have 8 lanes, so split by count 2 results in
two split ports each of 4 lanes. Add a resource that can be used to
obtain local port offset in that case.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Push getting offsets of split ports into a helper
Jiri Pirko [Thu, 31 Oct 2019 09:42:15 +0000 (11:42 +0200)]
mlxsw: spectrum: Push getting offsets of split ports into a helper

Get local port offsets of split port in a separate helper function and
use it in both split and unsplit function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Add sanity checks into module info get
Jiri Pirko [Thu, 31 Oct 2019 09:42:14 +0000 (11:42 +0200)]
mlxsw: spectrum: Add sanity checks into module info get

Driver assumes certain values in the PMLP register. Add checks that
verify that PMLP register provides fitting values.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Pass mapping values in port mapping structure
Jiri Pirko [Thu, 31 Oct 2019 09:42:13 +0000 (11:42 +0200)]
mlxsw: spectrum: Pass mapping values in port mapping structure

Pass the port mapping structure down to create, module_map and other
function instead of individual values.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Use mapping of port being split for creating split ports
Jiri Pirko [Thu, 31 Oct 2019 09:42:12 +0000 (11:42 +0200)]
mlxsw: spectrum: Use mapping of port being split for creating split ports

Don't use constant max width value and instead of that, use the actual
width of the port. Also don't pass module value and use the value
stored in the same structure.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Replace port_to_module array with array of structs
Jiri Pirko [Thu, 31 Oct 2019 09:42:11 +0000 (11:42 +0200)]
mlxsw: spectrum: Replace port_to_module array with array of structs

Store the initial PMLP register configuration into array of structures
instead of just simple array of module numbers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Distinguish between unsplittable and split port
Jiri Pirko [Thu, 31 Oct 2019 09:42:10 +0000 (11:42 +0200)]
mlxsw: spectrum: Distinguish between unsplittable and split port

Currently when user does split, he is not able to distinguish if the
port cannot be split because it is already split, or because it cannot
be split at all. Add another check for split flag to distinguish this.
Also add check forbidding split when maximal width is 1.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Move max_width check up before count check
Jiri Pirko [Thu, 31 Oct 2019 09:42:09 +0000 (11:42 +0200)]
mlxsw: spectrum: Move max_width check up before count check

The fact that the port cannot be split further should be checked before
checking the count, so move it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Use PMTM register to get max module width
Jiri Pirko [Thu, 31 Oct 2019 09:42:08 +0000 (11:42 +0200)]
mlxsw: spectrum: Use PMTM register to get max module width

Currently the max module width is hard-coded according to ASIC type.
That is not entirely correct, as the max module width might differ
per-board. Use PMTM register to query FW for maximal width of a module.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: reg: Add Port Module Type Mapping Register
Jiri Pirko [Thu, 31 Oct 2019 09:42:07 +0000 (11:42 +0200)]
mlxsw: reg: Add Port Module Type Mapping Register

The PMTM allows query or configuration of module types.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: reg: Extend PMLP tx/rx lane value size to 4 bits
Jiri Pirko [Thu, 31 Oct 2019 09:42:06 +0000 (11:42 +0200)]
mlxsw: reg: Extend PMLP tx/rx lane value size to 4 bits

The tx/rx lane fields got extended to 4 bits, update the reg field
description accordingly.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agocxgb4/l2t: Simplify 't4_l2e_free()' and '_t4_l2e_free()'
Christophe JAILLET [Thu, 31 Oct 2019 05:53:45 +0000 (06:53 +0100)]
cxgb4/l2t: Simplify 't4_l2e_free()' and '_t4_l2e_free()'

Use '__skb_queue_purge()' instead of re-implementing it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'Control-action-percpu-counters-allocation-by-netlink-flag'
David S. Miller [Thu, 31 Oct 2019 01:07:51 +0000 (18:07 -0700)]
Merge branch 'Control-action-percpu-counters-allocation-by-netlink-flag'

Vlad Buslov says:

====================
Control action percpu counters allocation by netlink flag

Currently, significant fraction of CPU time during TC filter allocation
is spent in percpu allocator. Moreover, percpu allocator is protected
with single global mutex which negates any potential to improve its
performance by means of recent developments in TC filter update API that
removed rtnl lock for some Qdiscs and classifiers. In order to
significantly improve filter update rate and reduce memory usage we
would like to allow users to skip percpu counters allocation for
specific action if they don't expect high traffic rate hitting the
action, which is a reasonable expectation for hardware-offloaded setup.
In that case any potential gains to software fast-path performance
gained by usage of percpu-allocated counters compared to regular integer
counters protected by spinlock are not important, but amount of
additional CPU and memory consumed by them is significant.

In order to allow configuring action counters allocation type at
runtime, implement following changes:

- Implement helper functions to update the action counters and use them
  in affected actions instead of updating counters directly. This steps
  abstracts actions implementation from counter types that are being
  used for particular action instance at runtime.

- Modify the new helpers to use percpu counters if they were allocated
  during action initialization and use regular counters otherwise.

- Extend action UAPI TCA_ACT space with TCA_ACT_FLAGS field. Add
  TCA_ACT_FLAGS_NO_PERCPU_STATS action flag and update
  hardware-offloaded actions to not allocate percpu counters when the
  flag is set.

With this changes users that prefer action update slow-path speed over
software fast-path speed can dynamically request actions to skip percpu
counters allocation without affecting other users.

Now, lets look at actual performance gains provided by this change.
Simple test is used to measure insertion rate - iproute2 TC is executed
in parallel by xargs in batch mode, its total execution time is measured
by shell builtin "time" command. The command runs 20 concurrent tc
instances, each with its own batch file with 100k rules:

$ time ls add* | xargs -n 1 -P 20 sudo tc -b

Two main rule profiles are tested. First is simple L2 flower classifier
with single gact drop action. The configuration is chosen as worst case
scenario because with single-action rules pressure on percpu allocator
is minimized. Example rule:

filter add dev ens1f0 protocol ip ingress prio 1 handle 1 flower skip_hw
    src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 action drop

Second profile is typical real-world scenario that uses flower
classifier with some L2-4 fields and two actions (tunnel_key+mirred).
Example rule:

filter add dev ens1f0_0 protocol ip ingress prio 1 handle 1 flower
    skip_hw src_mac e4:11:0:0:0:0 dst_mac e4:12:0:0:0:0 src_ip
    192.168.111.1 dst_ip 192.168.111.2 ip_proto udp dst_port 1 src_port
    1 action tunnel_key set id 1 src_ip 2.2.2.2 dst_ip 2.2.2.3 dst_port
    4789 action mirred egress redirect dev vxlan1

 Profile           |        percpu |     no_percpu | X improvement
                   | (k rules/sec) | (k rules/sec) |
-------------------+---------------+---------------+---------------
 Gact drop         |           203 |           259 |          1.28
 tunnel_key+mirred |            92 |           204 |          2.22

For simple drop action removing percpu allocation leads to ~25%
insertion rate improvement. Perf profiles highlights the bottlenecks.

Perf profile of run with percpu allocation (gact drop):

+ 89.11% 0.48% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 88.58% 0.04% tc [kernel.vmlinux] [k] do_syscall_64
+ 87.50% 0.04% tc libc-2.29.so [.] __libc_sendmsg
+ 86.96% 0.04% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 86.85% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 86.60% 0.05% tc [kernel.vmlinux] [k] sock_sendmsg
+ 86.55% 0.12% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 86.04% 0.13% tc [kernel.vmlinux] [k] netlink_unicast
+ 85.42% 0.03% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 84.68% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 84.56% 0.24% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 75.73% 0.65% tc [cls_flower] [k] fl_change
+ 71.30% 0.03% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 71.27% 0.13% tc [kernel.vmlinux] [k] tcf_action_init
+ 71.06% 0.01% tc [kernel.vmlinux] [k] tcf_action_init_1
+ 70.41% 0.04% tc [act_gact] [k] tcf_gact_init
+ 53.59% 1.21% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 52.34% 0.34% tc [kernel.vmlinux] [k] tcf_idr_create
- 51.23% 2.17% tc [kernel.vmlinux] [k] pcpu_alloc
  - 49.05% pcpu_alloc
    + 39.35% __mutex_lock.isra.0 4.99% memset_erms
    + 2.16% pcpu_alloc_area
  + 2.17% __libc_sendmsg
+ 45.89% 44.33% tc [kernel.vmlinux] [k] osq_lock
+ 9.94% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 7.76% 0.00% tc [kernel.vmlinux] [k] tcf_idr_insert
+ 6.50% 0.03% tc [kernel.vmlinux] [k] tfilter_notify
+ 6.24% 6.11% tc [kernel.vmlinux] [k] mutex_spin_on_owner
+ 5.73% 5.32% tc [kernel.vmlinux] [k] memset_erms
+ 5.31% 0.18% tc [kernel.vmlinux] [k] tcf_fill_node

Here bottleneck is clearly in pcpu_alloc() function that takes more than
half CPU time, which is mostly wasted busy-waiting for internal percpu
allocator global lock.

With percpu allocation removed (gact drop):

+ 87.50% 0.51% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 86.94% 0.07% tc [kernel.vmlinux] [k] do_syscall_64
+ 85.75% 0.04% tc libc-2.29.so [.] __libc_sendmsg
+ 85.00% 0.07% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 84.84% 0.07% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 84.59% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
+ 84.58% 0.14% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 83.95% 0.12% tc [kernel.vmlinux] [k] netlink_unicast
+ 83.34% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 82.39% 0.12% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 82.16% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 75.13% 0.84% tc [cls_flower] [k] fl_change
+ 69.92% 0.05% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 69.87% 0.11% tc [kernel.vmlinux] [k] tcf_action_init
+ 69.61% 0.02% tc [kernel.vmlinux] [k] tcf_action_init_1
- 68.80% 0.10% tc [act_gact] [k] tcf_gact_init
  - 68.70% tcf_gact_init
    + 36.08% tcf_idr_check_alloc
    + 31.88% tcf_idr_insert
+ 63.72% 0.58% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 58.80% 56.68% tc [kernel.vmlinux] [k] osq_lock
+ 36.08% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 31.88% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert

The gact actions (like all other actions types) are inserted in single
idr instance protected by global (per namespace) lock that becomes new
bottleneck with such simple rule profile and prevents achieving 2x+
performance increase that can be expected by looking at profiling data
for insertion action with percpu counter.

Perf profile of run with percpu allocation (tunnel_key+mirred):

+ 91.95% 0.21% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 91.74% 0.06% tc [kernel.vmlinux] [k] do_syscall_64
+ 90.74% 0.01% tc libc-2.29.so [.] __libc_sendmsg
+ 90.52% 0.01% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 90.50% 0.04% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 90.41% 0.02% tc [kernel.vmlinux] [k] sock_sendmsg
+ 90.38% 0.04% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 90.10% 0.06% tc [kernel.vmlinux] [k] netlink_unicast
+ 89.76% 0.01% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 89.28% 0.04% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 89.15% 0.03% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 83.41% 0.33% tc [cls_flower] [k] fl_change
+ 81.17% 0.04% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 81.13% 0.06% tc [kernel.vmlinux] [k] tcf_action_init
+ 81.04% 0.04% tc [kernel.vmlinux] [k] tcf_action_init_1
- 73.59% 2.16% tc [kernel.vmlinux] [k] pcpu_alloc
  - 71.42% pcpu_alloc
    + 61.41% __mutex_lock.isra.0 5.02% memset_erms
    + 2.93% pcpu_alloc_area
  + 2.16% __libc_sendmsg
+ 63.58% 0.17% tc [kernel.vmlinux] [k] tcf_idr_create
+ 63.40% 0.60% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 57.85% 56.38% tc [kernel.vmlinux] [k] osq_lock
+ 46.27% 0.13% tc [act_tunnel_key] [k] tunnel_key_init
+ 34.26% 0.02% tc [act_mirred] [k] tcf_mirred_init
+ 10.99% 0.00% tc [kernel.vmlinux] [k] dst_cache_init
+ 5.32% 5.11% tc [kernel.vmlinux] [k] memset_erms

With two times more actions pressure on percpu allocator doubles, so now
it takes ~74% of CPU execution time.

With percpu allocation removed (tunnel_key+mirred):

+ 86.02% 0.50% tc [kernel.vmlinux] [k] entry_SYSCALL_64
+ 85.51% 0.12% tc [kernel.vmlinux] [k] do_syscall_64
+ 84.40% 0.03% tc libc-2.29.so [.] __libc_sendmsg
+ 83.84% 0.03% tc [kernel.vmlinux] [k] __sys_sendmsg
+ 83.72% 0.01% tc [kernel.vmlinux] [k] ___sys_sendmsg
+ 83.56% 0.01% tc [kernel.vmlinux] [k] sock_sendmsg
+ 83.50% 0.08% tc [kernel.vmlinux] [k] netlink_sendmsg
+ 83.02% 0.17% tc [kernel.vmlinux] [k] netlink_unicast
+ 82.48% 0.00% tc [kernel.vmlinux] [k] netlink_rcv_skb
+ 81.89% 0.11% tc [kernel.vmlinux] [k] rtnetlink_rcv_msg
+ 81.71% 0.25% tc [kernel.vmlinux] [k] tc_new_tfilter
+ 73.99% 0.63% tc [cls_flower] [k] fl_change
+ 69.72% 0.00% tc [kernel.vmlinux] [k] tcf_exts_validate
+ 69.72% 0.09% tc [kernel.vmlinux] [k] tcf_action_init
+ 69.53% 0.05% tc [kernel.vmlinux] [k] tcf_action_init_1
+ 53.08% 0.91% tc [kernel.vmlinux] [k] __mutex_lock.isra.0
+ 45.52% 43.99% tc [kernel.vmlinux] [k] osq_lock
- 36.02% 0.21% tc [act_tunnel_key] [k] tunnel_key_init
  - 35.81% tunnel_key_init
    + 15.95% tcf_idr_check_alloc
    + 13.91% tcf_idr_insert
    - 4.70% dst_cache_init
      + 4.68% pcpu_alloc
+ 33.22% 0.04% tc [kernel.vmlinux] [k] tcf_idr_check_alloc
+ 32.34% 0.05% tc [act_mirred] [k] tcf_mirred_init
+ 28.24% 0.01% tc [kernel.vmlinux] [k] tcf_idr_insert
+ 7.79% 0.05% tc [kernel.vmlinux] [k] idr_alloc_u32
+ 7.67% 7.35% tc [kernel.vmlinux] [k] idr_get_free
+ 6.46% 6.22% tc [kernel.vmlinux] [k] mutex_spin_on_owner
+ 5.11% 0.05% tc [kernel.vmlinux] [k] tfilter_notify

With percpu allocation removed insertion rate is increased by ~120%.
Such rule profile scales much better than simple single action because
both types of actions were competing for single lock in percpu
allocator, but not for action idr lock, which is per-action. Note that
percpu allocator is still used by dst_cache in tunnel_key actions and
consumes 4.68% CPU time. Dst_cache seems like good opportunity for
further insertion rate optimization but is not addressed by this change.

Another improvement provided by this change is significantly reduced
memory usage. The test is implemented by sampling "used memory" value
from "vmstat -s" command output. Following table includes memory usage
measurements for same two configurations that were used for measuring
insertion rate:

 Profile           | Mem per rule | Mem per rule no_percpu | Less memory used
                   |         (KB) |                   (KB) |             (KB)
-------------------+--------------+------------------------+------------------
 Gact drop         |         3.91 |                   2.51 |              1.4
 tunnel_key+mirred |         6.73 |                   3.91 |              2.8

Results indicate that memory usage of percpu allocator per action is
~1.4 KB. Note that any measurements of percpu allocator memory usage is
inherently tied to particular setup since memory usage is linear to
number of cores in system. It is to be expected that on current top of
the line servers percpu allocator memory usage will be 2-5x more than on
24 CPUs setup that was used for testing.

Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory

Patches applied on top of net-next branch:

commit 2203cbf2c8b58a1e3bef98c47531d431d11639a0 (net-next) Author:
Russell King <rmk+kernel@armlinux.org.uk> Date: Tue Oct 15 11:38:39 2019
+0100

net: sfp: move fwnode parsing into sfp-bus layer

Changes V1 -> V2:

- Include memory measurements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotc-testing: implement tests for new fast_init action flag
Vlad Buslov [Wed, 30 Oct 2019 14:09:07 +0000 (16:09 +0200)]
tc-testing: implement tests for new fast_init action flag

Add basic tests to verify action creation with new fast_init flag for all
actions that support the flag.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: update action implementations to support flags
Vlad Buslov [Wed, 30 Oct 2019 14:09:06 +0000 (16:09 +0200)]
net: sched: update action implementations to support flags

Extend struct tc_action with new "tcfa_flags" field. Set the field in
tcf_idr_create() function and provide new helper
tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
value. Update individual hardware-offloaded actions init() to pass their
"flags" argument to new helper in order to skip percpu stats allocation
when user requested it through flags.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: extend TCA_ACT space with TCA_ACT_FLAGS
Vlad Buslov [Wed, 30 Oct 2019 14:09:05 +0000 (16:09 +0200)]
net: sched: extend TCA_ACT space with TCA_ACT_FLAGS

Extend TCA_ACT space with nla_bitfield32 flags. Add
TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
tcf_action_init_1() and pass resulting value as additional argument to
a_o->init().

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: modify stats helper functions to support regular stats
Vlad Buslov [Wed, 30 Oct 2019 14:09:04 +0000 (16:09 +0200)]
net: sched: modify stats helper functions to support regular stats

Modify stats update helper functions introduced in previous patches in this
series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are
not allocated for the action argument. If regular non-percpu allocated
counters are in use, then obtain action tcfa_lock while modifying them.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: don't expose action qstats to skb_tc_reinsert()
Vlad Buslov [Wed, 30 Oct 2019 14:09:03 +0000 (16:09 +0200)]
net: sched: don't expose action qstats to skb_tc_reinsert()

Previous commit introduced helper function for updating qstats and
refactored set of actions to use the helpers, instead of modifying qstats
directly. However, one of the affected action exposes its qstats to
skb_tc_reinsert(), which then modifies it.

Refactor skb_tc_reinsert() to return integer error code and don't increment
overlimit qstats in case of error, and use the returned error code in
tcf_mirred_act() to manually increment the overlimit counter with new
helper function.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: extract qstats update code into functions
Vlad Buslov [Wed, 30 Oct 2019 14:09:02 +0000 (16:09 +0200)]
net: sched: extract qstats update code into functions

Extract common code that increments cpu_qstats counters into standalone act
API functions. Change hardware offloaded actions that use percpu counter
allocation to use the new functions instead of accessing cpu_qstats
directly.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: extract bstats update code into function
Vlad Buslov [Wed, 30 Oct 2019 14:09:01 +0000 (16:09 +0200)]
net: sched: extract bstats update code into function

Extract common code that increments cpu_bstats counter into standalone act
API function. Change hardware offloaded actions that use percpu counter
allocation to use the new function instead of incrementing cpu_bstats
directly.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sched: extract common action counters update code into function
Vlad Buslov [Wed, 30 Oct 2019 14:09:00 +0000 (16:09 +0200)]
net: sched: extract common action counters update code into function

Currently, all implementations of tc_action_ops->stats_update() callback
have almost exactly the same implementation of counters update
code (besides gact which also updates drop counter). In order to simplify
support for using both percpu-allocated and regular action counters
depending on run-time flag in following patches, extract action counters
update code into standalone function in act API.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: qrtr: Simplify 'qrtr_tun_release()'
Christophe JAILLET [Wed, 30 Oct 2019 06:36:40 +0000 (07:36 +0100)]
net: qrtr: Simplify 'qrtr_tun_release()'

Use 'skb_queue_purge()' instead of re-implementing it.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Thu, 31 Oct 2019 00:51:25 +0000 (17:51 -0700)]
Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2019-10-29

This series contains updates to e1000e, igb, ixgbe and i40e drivers.

Sasha adds support for Intel client platforms Comet Lake and Tiger Lake
to the e1000e driver.  Also adds a fix for a compiler warning that was
recently introduced, when CONFIG_PM_SLEEP is not defined, so wrap the
code that requires this kernel configuration to be defined.

Alex fixes a potential race condition between network configuration and
power management for e1000e, which is similar to a past issue in the igb
driver.  Also provided a bit of code cleanup since the driver no longer
checks for __E1000_DOWN.

Josh Hunt adds UDP segmentation offload support for igb, ixgbe and i40e.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agowimax: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
zhong jiang [Wed, 30 Oct 2019 02:55:34 +0000 (10:55 +0800)]
wimax: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops

It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file
operation rather than DEFINE_SIMPLE_ATTRIBUTE.

It is detected with the help of coccinelle.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: add ethtool pause configuration support
Heiner Kallweit [Tue, 29 Oct 2019 21:32:48 +0000 (22:32 +0100)]
net: dsa: add ethtool pause configuration support

This patch adds glue logic to make pause settings per port
configurable vie ethtool.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovxlan: drop "vxlan" parameter in vxlan_fdb_alloc()
Guillaume Nault [Tue, 29 Oct 2019 20:57:10 +0000 (21:57 +0100)]
vxlan: drop "vxlan" parameter in vxlan_fdb_alloc()

This parameter has never been used.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: marvell: add downshift support for 88E1145
Heiner Kallweit [Tue, 29 Oct 2019 19:25:26 +0000 (20:25 +0100)]
net: phy: marvell: add downshift support for 88E1145

Add downshift support for 88E1145, it uses the same downshift
configuration registers as 88E1111.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'ICMP-flow-improvements'
David S. Miller [Thu, 31 Oct 2019 00:21:35 +0000 (17:21 -0700)]
Merge branch 'ICMP-flow-improvements'

Matteo Croce says:

====================
ICMP flow improvements

This series improves the flow inspector handling of ICMP packets:
The first two patches just add some comments in the code which would have saved
me a few minutes of time, and refactor a piece of code.
The third one adds to the flow inspector the capability to extract the
Identifier field, if present, so echo requests and replies are classified
as part of the same flow.
The fourth patch uses the function introduced earlier to the bonding driver,
so echo replies can be balanced across bonding slaves.

v1 -> v2:
 - remove unused struct members
 - add an helper to check for the Id field
 - use a local flow_dissector_key in the bonding to avoid
   changing behaviour of the flow dissector
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobonding: balance ICMP echoes in layer3+4 mode
Matteo Croce [Tue, 29 Oct 2019 13:50:53 +0000 (14:50 +0100)]
bonding: balance ICMP echoes in layer3+4 mode

The bonding uses the L4 ports to balance flows between slaves. As the ICMP
protocol has no ports, those packets are sent all to the same device:

    # tcpdump -qltnni veth0 ip |sed 's/^/0: /' &
    # tcpdump -qltnni veth1 ip |sed 's/^/1: /' &
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 315, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 315, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 316, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 316, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 317, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 317, seq 1, length 64

But some ICMP packets have an Identifier field which is
used to match packets within sessions, let's use this value in the hash
function to balance these packets between bond slaves:

    # ping -qc1 192.168.0.2
    0: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 303, seq 1, length 64
    0: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 303, seq 1, length 64
    # ping -qc1 192.168.0.2
    1: IP 192.168.0.1 > 192.168.0.2: ICMP echo request, id 304, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: ICMP echo reply, id 304, seq 1, length 64

Aso, let's use a flow_dissector_key which defines FLOW_DISSECTOR_KEY_ICMP,
so we can balance pings encapsulated in a tunnel when using mode encap3+4:

    # ping -q 192.168.1.2 -c1
    0: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 585, seq 1, length 64
    0: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 585, seq 1, length 64
    # ping -q 192.168.1.2 -c1
    1: IP 192.168.0.1 > 192.168.0.2: GREv0, length 102: IP 192.168.1.1 > 192.168.1.2: ICMP echo request, id 586, seq 1, length 64
    1: IP 192.168.0.2 > 192.168.0.1: GREv0, length 102: IP 192.168.1.2 > 192.168.1.1: ICMP echo reply, id 586, seq 1, length 64

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoflow_dissector: extract more ICMP information
Matteo Croce [Tue, 29 Oct 2019 13:50:52 +0000 (14:50 +0100)]
flow_dissector: extract more ICMP information

The ICMP flow dissector currently parses only the Type and Code fields.
Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
is used to correlate packets.
Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
with a more complex function which populate this field.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoflow_dissector: skip the ICMP dissector for non ICMP packets
Matteo Croce [Tue, 29 Oct 2019 13:50:51 +0000 (14:50 +0100)]
flow_dissector: skip the ICMP dissector for non ICMP packets

FLOW_DISSECTOR_KEY_ICMP is checked for every packet, not only ICMP ones.
Even if the test overhead is probably negligible, move the
ICMP dissector code under the big 'switch(ip_proto)' so it gets called
only for ICMP packets.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoflow_dissector: add meaningful comments
Matteo Croce [Tue, 29 Oct 2019 13:50:50 +0000 (14:50 +0100)]
flow_dissector: add meaningful comments

Documents two piece of code which can't be understood at a glance.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotc-testing: fixed two failing pedit tests
Roman Mashak [Wed, 30 Oct 2019 19:08:43 +0000 (15:08 -0400)]
tc-testing: fixed two failing pedit tests

Two pedit tests were failing due to incorrect operation
value in matchPattern, should be 'add' not 'val', so fix it.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotipc: add smart nagle feature
Jon Maloy [Wed, 30 Oct 2019 13:00:41 +0000 (14:00 +0100)]
tipc: add smart nagle feature

We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.

- The nagle property as such is determined by manipulating a new
  'maxnagle' field in struct tipc_sock. If certain conditions are met,
  'maxnagle' will define max size of the messages which can be bundled.
  If it is set to zero no messages are ever bundled, implying that the
  nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
  than 4 messages have been sent out without receiving any data message
  from the peer.
- A socket leaves nagle mode whenever it receives a data message from
  the peer.

In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.

The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.

- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
  no outstanding message where the 'ack_required' bit was set. As a
  consequence, the first message added after we enter nagle mode
  is always sent directly with this bit set.

This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.

Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'mlxsw-Update-firmware-version'
David S. Miller [Wed, 30 Oct 2019 19:07:05 +0000 (12:07 -0700)]
Merge branch 'mlxsw-Update-firmware-version'

Ido Schimmel says:

====================
mlxsw: Update firmware version

This patch set updates the firmware version for Spectrum-1 and enforces
a firmware version for Spectrum-2.

The version adds support for querying port module type. It will be used
by a followup patch set from Jiri to make port split code more generic.

Patch #1 increases the size of an existing register in order to be
compatible with the new firmware version. In the future the firmware
will assign default values to fields not specified by the driver.

Patch #2 temporarily increases the PCI reset timeout for SN3800 systems.
Note that in normal cases the driver will need to wait no longer than 5
seconds for the device to become ready following reset command.

Patch #3 bumps the firmware version for Spectrum-1.

Patch #4 enforces a minimum firmware version for Spectrum-2.

v2:
* Added patch #2
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: Enforce firmware version for Spectrum-2
Ido Schimmel [Wed, 30 Oct 2019 09:34:51 +0000 (11:34 +0200)]
mlxsw: Enforce firmware version for Spectrum-2

In a similar fashion to Spectrum-1, enforce a specific firmware version
for Spectrum-2 so that the driver and firmware are always in sync with
regards to new features.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: Bump firmware version to 13.2000.2308
Ido Schimmel [Wed, 30 Oct 2019 09:34:50 +0000 (11:34 +0200)]
mlxsw: Bump firmware version to 13.2000.2308

The version adds support for querying port module type. It will be used
by a followup patch set from Jiri to make port split code more generic.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: pci: Increase PCI reset timeout for SN3800 systems
Ido Schimmel [Wed, 30 Oct 2019 09:34:49 +0000 (11:34 +0200)]
mlxsw: pci: Increase PCI reset timeout for SN3800 systems

SN3800 Spectrum-2 based systems have gearboxes that need to be
initialized by the firmware during its initialization flow. In certain
cases, the firmware might need to flash these gearboxes, which is
currently a time-consuming process.

In newer firmware versions, the firmware will not signal to the driver
that it is ready until the gearboxes are flashed. Increase the PCI reset
timeout for these situations. In normal cases, the driver will need to
wait no longer than 5 seconds.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: reg: Increase size of MPAR register
Ido Schimmel [Wed, 30 Oct 2019 09:34:48 +0000 (11:34 +0200)]
mlxsw: reg: Increase size of MPAR register

In new firmware versions this register is extended with a sampling rate
for Spectrum-2 and future ASICs.

Increase the size of the register to ensure the field is initialized to
0 which means every packet is mirrored.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoe1000e: Fix compiler warning when CONFIG_PM_SLEEP is not set
Sasha Neftin [Wed, 23 Oct 2019 15:09:17 +0000 (18:09 +0300)]
e1000e: Fix compiler warning when CONFIG_PM_SLEEP is not set

When CONFIG_PM_SLEEP is not defined compiler complain as follow:
CC [M]  drivers/net/ethernet/intel/e1000e/netdev.o
drivers/net/ethernet/intel/e1000e/netdev.c:6302:12: warning: â€˜e1000e_s0ix_entry_flow’ defined but not used [-Wunused-function]
static void e1000e_s0ix_entry_flow(struct e1000_adapter *adapter)
drivers/net/ethernet/intel/e1000e/netdev.c:6411:12: warning: â€˜e1000e_s0ix_exit_flow’ defined but not used [-Wunused-function]
static void e1000e_s0ix_exit_flow(struct e1000_adapter *adapter)
LD [M]  drivers/net/ethernet/intel/e1000e/e1000e.o

Add wrap to fix these warnings.

Reported-by: kbuild test robot <lpk@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoe1000e: Add support for Tiger Lake
Sasha Neftin [Wed, 16 Oct 2019 08:08:38 +0000 (11:08 +0300)]
e1000e: Add support for Tiger Lake

Add devices ID's for the next LOM generations that will be
available on the next Intel Client platform (Tiger Lake)
This patch provides the initial support for these devices

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoi40e: Add UDP segmentation offload support
Josh Hunt [Fri, 11 Oct 2019 16:53:40 +0000 (12:53 -0400)]
i40e: Add UDP segmentation offload support

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the i40e driver.

CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoixgbe: Add UDP segmentation offload support
Josh Hunt [Fri, 11 Oct 2019 16:53:39 +0000 (12:53 -0400)]
ixgbe: Add UDP segmentation offload support

Repost from a series by Alexander Duyck to add UDP segmentation offload
support to the igb driver:
https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/

CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Willem de Bruijn <willemb@google.com>
Suggested-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoigb: Add UDP segmentation offload support
Josh Hunt [Fri, 11 Oct 2019 16:53:38 +0000 (12:53 -0400)]
igb: Add UDP segmentation offload support

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the igb driver.

CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Willem de Bruijn <willemb@google.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoMerge branch 'nfc-pn533-add-uart-phy-driver'
David S. Miller [Wed, 30 Oct 2019 04:05:26 +0000 (21:05 -0700)]
Merge branch 'nfc-pn533-add-uart-phy-driver'

Lars Poeschel says:

====================
nfc: pn533: add uart phy driver

The purpose of this patch series is to add a uart phy driver to the
pn533 nfc driver.
It first changes the dt strings and docs. The dt compatible strings
need to change, because I would add "pn532-uart" to the already
existing "pn533-i2c" one. These two are now unified into just
"pn532". Then the neccessary changes to the pn533 core driver are
made. Then the uart phy is added.
As the pn532 chip supports a autopoll, I wanted to use this instead
of the software poll loop in the pn533 core driver. It is added and
activated by the last to patches.
The way to add the autopoll later in seperate patches is chosen, to
show, that the uart phy driver can also work with the software poll
loop, if someone needs that for some reason.
In v11 of this patchseries I address a byte ordering issue reported
by kbuild test robot in patch 5/7.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfc: pn532_uart: Make use of pn532 autopoll
Lars Poeschel [Tue, 29 Oct 2019 14:48:02 +0000 (15:48 +0100)]
nfc: pn532_uart: Make use of pn532 autopoll

This switches the pn532 UART phy driver from manually polling to the new
autopoll mechanism.

Cc: Johan Hovold <johan@kernel.org>
Signed-off-by: Lars Poeschel <poeschel@lemonage.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfc: pn533: Add autopoll capability
Lars Poeschel [Tue, 29 Oct 2019 14:47:43 +0000 (15:47 +0100)]
nfc: pn533: Add autopoll capability

pn532 devices support an autopoll command, that lets the chip
automatically poll for selected nfc technologies instead of manually
looping through every single nfc technology the user is interested in.
This is faster and less cpu and bus intensive than manually polling.
This adds this autopoll capability to the pn533 driver.

Cc: Johan Hovold <johan@kernel.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Lars Poeschel <poeschel@lemonage.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfc: pn533: add UART phy driver
Lars Poeschel [Tue, 29 Oct 2019 14:47:14 +0000 (15:47 +0100)]
nfc: pn533: add UART phy driver

This adds the UART phy interface for the pn533 driver.
The pn533 driver can be used through UART interface this way.
It is implemented as a serdev device.

Cc: Johan Hovold <johan@kernel.org>
Cc: Claudiu Beznea <Claudiu.Beznea@microchip.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Lars Poeschel <poeschel@lemonage.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfc: pn533: Split pn533 init & nfc_register
Lars Poeschel [Tue, 29 Oct 2019 14:46:46 +0000 (15:46 +0100)]
nfc: pn533: Split pn533 init & nfc_register

There is a problem in the initialisation and setup of the pn533: It
registers with nfc too early. It could happen, that it finished
registering with nfc and someone starts using it. But setup of the pn533
is not yet finished. Bad or at least unintended things could happen.
So I split out nfc registering (and unregistering) to seperate functions
that have to be called late in probe then.
i2c requires a bit more love: i2c requests an irq in it's probe
function. 'Commit 32ecc75ded72 ("NFC: pn533: change order operations in
dev registation")' shows, this can not happen too early. An irq can be
served before structs are fully initialized. The way chosen to prevent
this is to request the irq after nfc_alloc_device initialized the
structs, but before nfc_register_device. So there is now this
pn532_i2c_nfc_alloc function.

Cc: Johan Hovold <johan@kernel.org>
Cc: Claudiu Beznea <Claudiu.Beznea@microchip.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Lars Poeschel <poeschel@lemonage.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfc: pn533: Add dev_up/dev_down hooks to phy_ops
Lars Poeschel [Tue, 29 Oct 2019 14:46:29 +0000 (15:46 +0100)]
nfc: pn533: Add dev_up/dev_down hooks to phy_ops

This adds hooks for dev_up and dev_down to the phy_ops. They are
optional.
The idea is to inform the phy driver when the nfc chip is really going
to be used. When it is not used, the phy driver can suspend it's
interface to the nfc chip to save some power. The nfc chip is considered
not in use before dev_up and after dev_down.

Cc: Johan Hovold <johan@kernel.org>
Signed-off-by: Lars Poeschel <poeschel@lemonage.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfc: pn532: Add uart phy docs and rename it
Lars Poeschel [Tue, 29 Oct 2019 14:45:58 +0000 (15:45 +0100)]
nfc: pn532: Add uart phy docs and rename it

This adds documentation about the uart phy to the pn532 binding doc. As
the filename "pn533-i2c.txt" is not appropriate any more, rename it to
the more general "pn532.txt".
This also documents the deprecation of the compatible strings ending
with "...-i2c".

Cc: Johan Hovold <johan@kernel.org>
Cc: Simon Horman <horms@verge.net.au>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Lars Poeschel <poeschel@lemonage.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfc: pn533: i2c: "pn532" as dt compatible string
Lars Poeschel [Tue, 29 Oct 2019 14:43:14 +0000 (15:43 +0100)]
nfc: pn533: i2c: "pn532" as dt compatible string

It is favourable to have one unified compatible string for devices that
have multiple interfaces. So this adds simply "pn532" as the devicetree
binding compatible string and makes a note that the old ones are
deprecated.

Cc: Johan Hovold <johan@kernel.org>
Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Lars Poeschel <poeschel@lemonage.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoe1000e: Drop unnecessary __E1000_DOWN bit twiddling
Alexander Duyck [Fri, 11 Oct 2019 15:34:59 +0000 (08:34 -0700)]
e1000e: Drop unnecessary __E1000_DOWN bit twiddling

Since we no longer check for __E1000_DOWN in e1000e_close we can drop the
spot where we were restoring the bit. This saves us a bit of unnecessary
complexity.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoe1000e: Use rtnl_lock to prevent race conditions between net and pci/pm
Alexander Duyck [Fri, 11 Oct 2019 15:34:52 +0000 (08:34 -0700)]
e1000e: Use rtnl_lock to prevent race conditions between net and pci/pm

This patch is meant to address possible race conditions that can exist
between network configuration and power management. A similar issue was
fixed for igb in commit 9474933caf21 ("igb: close/suspend race in
netif_device_detach").

In addition it consolidates the code so that the PCI error handling code
will essentially perform the power management freeze on the device prior to
attempting a reset, and will thaw the device afterwards if that is what it
is planning to do. Otherwise when we call close on the interface it should
see it is detached and not attempt to call the logic to down the interface
and free the IRQs again.

From what I can tell the check that was adding the check for __E1000_DOWN
in e1000e_close was added when runtime power management was added. However
it should not be relevant for us as we perform a call to
pm_runtime_get_sync before we call e1000_down/free_irq so it should always
be back up before we call into this anyway.

Reported-by: Morumuri Srivalli <smorumu1@in.ibm.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Tested-by: David Dai <zdai@linux.vnet.ibm.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
5 years agoe1000e: Add support for Comet Lake
Sasha Neftin [Thu, 10 Oct 2019 10:15:39 +0000 (13:15 +0300)]
e1000e: Add support for Comet Lake

Add devices ID's for the next LOM generations that will be
available on the next Intel Client platform (Comet Lake)
This patch provides the initial support for these devices

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
5 years agoMerge branch 'bridge-fdbs-bitops'
David S. Miller [Wed, 30 Oct 2019 01:12:49 +0000 (18:12 -0700)]
Merge branch 'bridge-fdbs-bitops'

Nikolay Aleksandrov says:

====================
net: bridge: convert fdbs to use bitops

We'd like to have a well-defined behaviour when changing fdb flags. The
problem is that we've added new fields which are changed from all
contexts without any locking. We are aware of the bit test/change races
and these are fine (we can remove them later), but it is considered
undefined behaviour to change bitfields from multiple threads and also
on some architectures that can result in unexpected results,
specifically when all fields between the changed ones are also
bitfields. The conversion to bitops shows the intent clearly and
makes them use functions with well-defined behaviour in such cases.
There is no overhead for the fast-path, the bit changing functions are
used only in special cases when learning and in the slow path.
In addition this conversion allows us to simplify fdb flag handling and
avoid bugs for future bits (e.g. a forgetting to clear the new bit when
allocating a new fdb). All bridge selftests passed, also tried all of the
converted bits manually in a VM.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: fdb: set flags directly in fdb_create
Nikolay Aleksandrov [Tue, 29 Oct 2019 11:45:59 +0000 (13:45 +0200)]
net: bridge: fdb: set flags directly in fdb_create

No need to have separate arguments for each flag, just set the flags to
whatever was passed to fdb_create() before the fdb is published.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: fdb: convert offloaded to use bitops
Nikolay Aleksandrov [Tue, 29 Oct 2019 11:45:58 +0000 (13:45 +0200)]
net: bridge: fdb: convert offloaded to use bitops

Convert the offloaded field to a flag and use bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: fdb: convert added_by_external_learn to use bitops
Nikolay Aleksandrov [Tue, 29 Oct 2019 11:45:57 +0000 (13:45 +0200)]
net: bridge: fdb: convert added_by_external_learn to use bitops

Convert the added_by_external_learn field to a flag and use bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: fdb: convert added_by_user to bitops
Nikolay Aleksandrov [Tue, 29 Oct 2019 11:45:56 +0000 (13:45 +0200)]
net: bridge: fdb: convert added_by_user to bitops

Straight-forward convert of the added_by_user field to bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: fdb: convert is_sticky to bitops
Nikolay Aleksandrov [Tue, 29 Oct 2019 11:45:55 +0000 (13:45 +0200)]
net: bridge: fdb: convert is_sticky to bitops

Straight-forward convert of the is_sticky field to bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: fdb: convert is_static to bitops
Nikolay Aleksandrov [Tue, 29 Oct 2019 11:45:54 +0000 (13:45 +0200)]
net: bridge: fdb: convert is_static to bitops

Convert the is_static to bitops, make use of the combined
test_and_set/clear_bit to simplify expressions in fdb_add_entry.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bridge: fdb: convert is_local to bitops
Nikolay Aleksandrov [Tue, 29 Oct 2019 11:45:53 +0000 (13:45 +0200)]
net: bridge: fdb: convert is_local to bitops

The patch adds a new fdb flags field in the hole between the two cache
lines and uses it to convert is_local to bitops.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/smc: remove unneeded include for smc.h
Ursula Braun [Tue, 29 Oct 2019 11:43:46 +0000 (12:43 +0100)]
net/smc: remove unneeded include for smc.h

The only smc-related reference in net/sock.h is struct smc_hashinfo.
But just its address is refered to. Thus there is no need for the
include of net/smc.h. Remove it.

Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotipc: improve throughput between nodes in netns
Hoang Le [Tue, 29 Oct 2019 00:51:21 +0000 (07:51 +0700)]
tipc: improve throughput between nodes in netns

Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.

We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.

To meet any security concerns, the following should be noted.

- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.

- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.

- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.

- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.

- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.

Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.

v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback

Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoinet: do not call sublist_rcv on empty list
Florian Westphal [Tue, 29 Oct 2019 00:44:04 +0000 (01:44 +0100)]
inet: do not call sublist_rcv on empty list

syzbot triggered struct net NULL deref in NF_HOOK_LIST:
RIP: 0010:NF_HOOK_LIST include/linux/netfilter.h:331 [inline]
RIP: 0010:ip6_sublist_rcv+0x5c9/0x930 net/ipv6/ip6_input.c:292
 ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:328
 __netif_receive_skb_list_ptype net/core/dev.c:5274 [inline]

Reason:
void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
                   struct net_device *orig_dev)
[..]
        list_for_each_entry_safe(skb, next, head, list) {
/* iterates list */
                skb = ip6_rcv_core(skb, dev, net);
/* ip6_rcv_core drops skb -> NULL is returned */
                if (skb == NULL)
                        continue;
[..]
}
/* sublist is empty -> curr_net is NULL */
        ip6_sublist_rcv(&sublist, curr_dev, curr_net);

Before the recent change NF_HOOK_LIST did a list iteration before
struct net deref, i.e. it was a no-op in the empty list case.

List iteration now happens after *net deref, causing crash.

Follow the same pattern as the ip(v6)_list_rcv loop and add a list_empty
test for the final sublist dispatch too.

Cc: Edward Cree <ecree@solarflare.com>
Reported-by: syzbot+c54f457cad330e57e967@syzkaller.appspotmail.com
Fixes: ca58fbe06c54 ("netfilter: add and use nf_hook_slow_list()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Leon Romanovsky <leonro@mellanox.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobroadcom: bnxt: Fix use true/false for bool
Saurav Girepunje [Mon, 28 Oct 2019 20:16:35 +0000 (01:46 +0530)]
broadcom: bnxt: Fix use true/false for bool

Use true/false for bool type in bnxt_timer function.

Signed-off-by: Saurav Girepunje <saurav.girepunje@gmail.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agocavium: thunder: Fix use true/false for bool type
Saurav Girepunje [Mon, 28 Oct 2019 20:09:50 +0000 (01:39 +0530)]
cavium: thunder: Fix use true/false for bool type

use true/false on bool type variables for assignment.

Signed-off-by: Saurav Girepunje <saurav.girepunje@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'net-phy-marvell-fix-and-extend-downshift-support'
David S. Miller [Wed, 30 Oct 2019 00:50:10 +0000 (17:50 -0700)]
Merge branch 'net-phy-marvell-fix-and-extend-downshift-support'

Heiner Kallweit says:

====================
net: phy: marvell: fix and extend downshift support

This series includes two fixes and two extensions for downshift support.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: marvell: add PHY tunable support for more PHY versions
Heiner Kallweit [Mon, 28 Oct 2019 19:54:17 +0000 (20:54 +0100)]
net: phy: marvell: add PHY tunable support for more PHY versions

More PHY versions are compatible with the existing downshift
implementation, so let's add downshift support for them.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: marvell: add downshift support for M88E1111
Heiner Kallweit [Mon, 28 Oct 2019 19:53:25 +0000 (20:53 +0100)]
net: phy: marvell: add downshift support for M88E1111

This patch adds downshift support for M88E1111. This PHY version uses
another register for downshift configuration, reading downshift status
is possible via the same register as for other PHY versions.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: marvell: fix downshift function naming
Heiner Kallweit [Mon, 28 Oct 2019 19:52:55 +0000 (20:52 +0100)]
net: phy: marvell: fix downshift function naming

I got access to the M88E1111 datasheet, and this PHY version uses
another register for downshift configuration. Therefore change prefix
to m88e1011, aligned with constants like MII_M1011_PHY_SCR.

Fixes: a3bdfce7bf9c ("net: phy: marvell: support downshift as PHY tunable")
Reported-by: Chris Healy <Chris.Healy@zii.aero>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: marvell: fix typo in constant MII_M1011_PHY_SRC_DOWNSHIFT_MASK
Heiner Kallweit [Mon, 28 Oct 2019 19:52:22 +0000 (20:52 +0100)]
net: phy: marvell: fix typo in constant MII_M1011_PHY_SRC_DOWNSHIFT_MASK

Fix typo and use PHY_SCR for PHY-specific Control Register.

Fixes: a3bdfce7bf9c ("net: phy: marvell: support downshift as PHY tunable")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoDocumentation: net-sysfs: describe missing statistics
Julian Wiedmann [Mon, 28 Oct 2019 16:08:00 +0000 (17:08 +0100)]
Documentation: net-sysfs: describe missing statistics

Sync the ABI description with the interface statistics that are currently
available through sysfs.

CC: Jarod Wilson <jarod@redhat.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: linux-doc@vger.kernel.org
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoionic: Remove set but not used variable 'sg_desc'
YueHaibing [Mon, 28 Oct 2019 12:01:21 +0000 (12:01 +0000)]
ionic: Remove set but not used variable 'sg_desc'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/pensando/ionic/ionic_txrx.c: In function 'ionic_rx_empty':
drivers/net/ethernet/pensando/ionic/ionic_txrx.c:405:28: warning:
 variable 'sg_desc' set but not used [-Wunused-but-set-variable]

It is never used, so can be removed.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: dp83867: support Wake on LAN
Thomas Haemmerle [Mon, 28 Oct 2019 08:08:14 +0000 (08:08 +0000)]
net: phy: dp83867: support Wake on LAN

This adds WoL support on TI DP83867 for magic, magic secure, unicast and
broadcast.

Signed-off-by: Thomas Haemmerle <thomas.haemmerle@wolfvision.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: aquantia: fix error handling in aq_ptp_poll
Gustavo A. R. Silva [Mon, 28 Oct 2019 07:04:47 +0000 (02:04 -0500)]
net: aquantia: fix error handling in aq_ptp_poll

Fix currenty ignored returned error by properly checking *err* after
calling aq_nic->aq_hw_ops->hw_ring_hwts_rx_fill().

Addresses-Coverity-ID: 1487357 ("Unused value")
Fixes: 04a1839950d9 ("net: aquantia: implement data PTP datapath")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: aquantia: remove unused including <linux/version.h>
YueHaibing [Sat, 26 Oct 2019 02:51:09 +0000 (02:51 +0000)]
net: aquantia: remove unused including <linux/version.h>

Remove including <linux/version.h> that don't need it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: LAN9303: select REGMAP when LAN9303 enable
Mao Wenan [Sat, 26 Oct 2019 02:21:39 +0000 (10:21 +0800)]
net: dsa: LAN9303: select REGMAP when LAN9303 enable

When NET_DSA_SMSC_LAN9303=y and NET_DSA_SMSC_LAN9303_MDIO=y,
below errors can be seen:
drivers/net/dsa/lan9303_mdio.c:87:23: error: REGMAP_ENDIAN_LITTLE
undeclared here (not in a function)
  .reg_format_endian = REGMAP_ENDIAN_LITTLE,
drivers/net/dsa/lan9303_mdio.c:93:3: error: const struct regmap_config
has no member named reg_read
  .reg_read = lan9303_mdio_read,

It should select REGMAP in config NET_DSA_SMSC_LAN9303.

Fixes: dc7005831523 ("net: dsa: LAN9303: add MDIO managed mode support")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: aquantia: make two symbols be static
Mao Wenan [Sat, 26 Oct 2019 02:07:38 +0000 (10:07 +0800)]
net: aquantia: make two symbols be static

When using ARCH=mips CROSS_COMPILE=mips-linux-gnu-
to build drivers/net/ethernet/aquantia/atlantic/aq_ptp.o
and drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.o,
below errors can be seen:
drivers/net/ethernet/aquantia/atlantic/aq_ptp.c:1378:6:
warning: symbol 'aq_ptp_poll_sync_work_cb' was not declared.
Should it be static?

drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c:1155:5:
warning: symbol 'hw_atl_b0_ts_to_sys_clock' was not declared.
Should it be static?

This patch to make aq_ptp_poll_sync_work_cb and hw_atl_b0_ts_to_sys_clock
be static to fix these warnings.

Fixes: 9c477032f7d0 ("net: aquantia: add support for PIN funcs")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Tue, 29 Oct 2019 23:08:54 +0000 (16:08 -0700)]
Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2019-10-25

This series contains updates to i40e only.  Several are fixes that could
go to 'net', but were intended for 'net-next'.

Sylwia changes how the driver function to read the NVM module data, so
that it is able to read the LLDP agent configuration to allow for
persistent LLDP.

Jaroslaw resolves an issue where the incorrect FEC settings were being
displayed in ethtool, by setting the proper FEC bits.

Piotr moves the hardware flags detection into a separate function, so
that the specific flags can be set based on the MAC and NVM.  Also
extends the PHY access function to include a command flag to let the
firmware know it should not change the page while accessing a OSFP module.
Updates the driver to display the driver and firmware version when in
recovery mode.

Aleksandr refactored the VF MAC filters accounting since an untrusted
VF was able to delete but not add a MAC filter, so refactor the code to
have more consistency and improved logging.

Nicholas updates the driver to use a default interval of 50 usecs,
instead of the current 100 usecs which was causing some regression
performance issues.

Damian resolved LED blinking issues for X710T*L devices by adding
specific flows for these devices in the LED operations.

Navid Emamdoost found where allocated memory is not being properly freed
upon a failure in setting up MAC VLANs, so added the missing kfree().

v2: Dropped patches 2 & 6 from the original series while we wait for the
    author to respond to community feedback.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: fec: remove redundant assignment to pointer bdp
Colin Ian King [Fri, 25 Oct 2019 17:22:55 +0000 (18:22 +0100)]
net: fec: remove redundant assignment to pointer bdp

The pointer bdp is being assigned with a value that is never
read, so the assignment is redundant and hence can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Fugang Duan <fugang.duan@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>