git.openwrt.org Git - openwrt/staging/blogic.git/log

qed: Add support for hardware offloaded iSCSI.

This adds the backbone required for the various HW initalizations
which are necessary for the iSCSI driver (qedi) for QLogic FastLinQ
4xxxx line of adapters - FW notification, resource initializations, etc.

Signed-off-by: Arun Easi <arun.easi@cavium.com>
Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: atarilance: use %8ph for printing hex string

This is already using the %pM printf extension; might as well also use
%ph to make the code smaller.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: skip loopback selftest with !CONFIG_INET

When CONFIG_INET is disabled, the new selftest results in a link
error:

drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: In function `mlx5e_test_loopback':
en_selftest.c:(.text.mlx5e_test_loopback+0x2ec): undefined reference to `ip_send_check'
en_selftest.c:(.text.mlx5e_test_loopback+0x34c): undefined reference to `udp4_hwcsum'

This hides the specific test in that configuration.

Fixes: 0952da791c97 ("net/mlx5e: Add support for loopback selftest")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf, xdp: drop rcu_read_lock from bpf_prog_run_xdp and move to caller

After 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
the rcu_read_lock() in bpf_prog_run_xdp() is superfluous, since callers
need to hold rcu_read_lock() already to make sure BPF program doesn't
get released in the background.

Thus, drop it from bpf_prog_run_xdp(), as it can otherwise be misleading.
Still keeping the bpf_prog_run_xdp() is useful as it allows for grepping
in XDP supported drivers and to keep the typecheck on the context intact.
For mlx4, this means we don't have a double rcu_read_lock() anymore. nfp can
just make use of bpf_prog_run_xdp(), too. For qede, just move rcu_read_lock()
out of the helper. When the driver gets atomic replace support, this will
move to call-sites eventually.

mlx5 needs actual fixing as it has the same issue as described already in
326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"),
that is, we're under RCU bh at this time, BPF programs are released via
call_rcu(), and call_rcu() != call_rcu_bh(), so we need to properly mark
read side as programs can get xchg()'ed in mlx5e_xdp_set() without queue
reset.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock: reset sk_err for ICMP packets read from error queue

Only when ICMP packets are enqueued onto the error queue,
sk_err is also set. Before f5f99309fa74 (sock: do not set sk_err
in sock_dequeue_err_skb), a subsequent error queue read
would set sk_err to the next error on the queue, or 0 if empty.
As no error types other than ICMP set this field, sk_err should
not be modified upon dequeuing them.

Only for ICMP errors, reset the (racy) sk_err. Some applications,
like traceroute, rely on it and go into a futile busy POLLERR
loop otherwise.

In principle, sk_err has to be set while an ICMP error is queued.
Testing is_icmp_err_skb(skb_next) approximates this without
requiring a full queue walk. Applications that receive both ICMP
and other errors cannot rely on this legacy behavior, as other
errors do not set sk_err in the first place.

Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb)
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'lwt-bpf'

Thomas Graf says:

====================
bpf: BPF for lightweight tunnel encapsulation

This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and see an L3
skb as context. Programs attached to input and output are read-only.
Programs attached to lwtunnel_xmit() can modify and redirect, push headers
and redirect packets.

The facility can be used to:
- Collect statistics and generate sampling data for a subset of traffic
   based on the dst utilized by the packet thus allowing to extend the
   existing realms.
- Apply additional per route/dst filters to prohibit certain outgoing
   or incoming packets based on BPF filters. In particular, this allows
   to maintain per dst custom state across multiple packets in BPF maps
   and apply filters based on statistics and behaviour observed over time.
- Attachment of L2 headers at transmit where resolving the L2 address
   is not required.
- Possibly many more.

v3 -> v4:
- Bumped LWT_BPF_MAX_HEADROOM from 128 to 256 (Alexei)
- Renamed bpf_skb_push() helper to bpf_skb_change_head() to relate to
   existing bpf_skb_change_tail() helper (Alexei/Daniel)
- Added check in __bpf_redirect_common() to verify that program added a
   link header before redirecting to a l2 device. Adding the check to
   lwt-bpf code was considered but dropped due to massive code required
   due to retrieval of net_device via per-cpu redirect buffer. A test
   case was added to cover the scenario when a program directs to an l2
   device without adding an appropriate l2 header.
   (Alexei)
- Prohibited access to tc_classid (Daniel)
- Collapsed bpf_verifier_ops instance for lwt in/out as they are
   identical (Daniel)
- Some cosmetic changes

v2 -> v3:
- Added real world sample lwt_len_hist_kern.c which demonstrates how to
   collect a histogram on packet sizes for all packets flowing through
   a number of routes.
- Restricted output to be read-only. Since the header can no longer
   be modified, the rerouting functionality has been removed again.
- Added test case which cover destructive modification of packet data.

v1 -> v2:
- Added new BPF_LWT_REROUTE return code for program to indicate
   that new route lookup should be performed. Suggested by Tom.
- New sample to illustrate rerouting
- New patch 05: Recursion limit for lwtunnel_output for the case
   when user creates circular dst redirection. Also resolves the
   issue for ILA.
- Fix to ensure headroom for potential future L2 header is still
   guaranteed
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: Add tests and samples for LWT-BPF

Adds a series of tests to verify the functionality of attaching
BPF programs at LWT hooks.

Also adds a sample which collects a histogram of packet sizes which
pass through an LWT hook.

$ ./lwt_len_hist.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

87380  16384  16384    10.00    39857.69
       1 -> 1        : 0        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 0        |                                      |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 0        |                                      |
      32 -> 63       : 22       |                                      |
      64 -> 127      : 98       |                                      |
     128 -> 255      : 213      |                                      |
     256 -> 511      : 1444251  |********                              |
     512 -> 1023     : 660610   |***                                   |
    1024 -> 2047     : 535241   |**                                    |
    2048 -> 4095     : 19       |                                      |
    4096 -> 8191     : 180      |                                      |
    8192 -> 16383    : 5578023  |************************************* |
   16384 -> 32767    : 632099   |***                                   |
   32768 -> 65535    : 6575     |                                      |

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: BPF for lightweight tunnel infrastructure

Registers new BPF program types which correspond to the LWT hooks:
  - BPF_PROG_TYPE_LWT_IN   => dst_input()
  - BPF_PROG_TYPE_LWT_OUT  => dst_output()
  - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

The separate program types are required to differentiate between the
capabilities each LWT hook allows:

* Programs attached to dst_input() or dst_output() are restricted and
   may only read the data of an skb. This prevent modification and
   possible invalidation of already validated packet headers on receive
   and the construction of illegal headers while the IP headers are
   still being assembled.

* Programs attached to lwtunnel_xmit() are allowed to modify packet
   content as well as prepending an L2 header via a newly introduced
   helper bpf_skb_change_head(). This is safe as lwtunnel_xmit() is
   invoked after the IP header has been assembled completely.

All BPF programs receive an skb with L3 headers attached and may return
one of the following error codes:

BPF_OK - Continue routing as per nexthop
BPF_DROP - Drop skb and return EPERM
BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                (Only valid in lwtunnel_xmit() context)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

route: Set lwtstate for local traffic and cached input dsts

A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

route: Set orig_output when redirecting to lwt on locally generated traffic

orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311d ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx5-updates'

Saeed Mahameed says:

====================
Mellanox 100G mlx5 updates 2016-11-29

The following series from Tariq and Roi, provides some critical fixes
and updates for the mlx5e driver.

From Tariq:
- Fix driver coherent memory huge allocation issues by fragmenting
   completion queues, in a way that is transparent to the netdev driver by
   providing a new buffer type "mlx5_frag_buf" with the same access API.
- Create UMR MKey per RQ to have better scalability.

From Roi:
- Some fixes for the encap-decap support and tc flower added lately to the
   mlx5e driver.

v1->v2:
- Fix start index in error flow of mlx5_frag_buf_alloc_node, pointed out by Eric.

This series was generated against commit:
31ac1c19455f ("geneve: fix ip_hdr_len reserved for geneve6 tunnel.")
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Remove flow encap entry in the correct place

Handling flow encap entry should be inside tc del flow
and is only relevant for offloaded eswitch TC rules.

Fixes: 11a457e9b6c1 ("net/mlx5e: Add basic TC tunnel set action for SRIOV offloads")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Refactor tc del flow to accept mlx5e_tc_flow instance

Change the function that deletes offloaded TC rule to get
struct mlx5e_tc_flow instance which contains both the flow
handle and flow attributes. This is a cleanup needed for
downstream patches, it doesn't change any functionality.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Correct cleanup order when deleting offloaded TC rules

According to the reverse unwinding principle, on delete time we should
first handle deletion of the steering rule and later handle the vlan
deletion from the eswitch.

Fixes: 8b32580df1cb ("net/mlx5e: Add TC vlan action for SRIOV offloads")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Remove redundant hashtable lookup in configure flower

We will never find a flow with the same cookie as cls_flower always
allocates a new flow and the cookie is the allocated memory address.

Fixes: e3a2b7ed018e ("net/mlx5e: Support offload cls_flower with drop action")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Create UMR MKey per RQ

In Striding RQ implementation, we used a single UMR
(User-Mode Memory Registration) memory key for all RQs.
When the product of RQs number*size gets high, we hit a
limitation of u16 field size in FW.

Here we move to using a UMR memory key per RQ, so we can
scale to any number of rings, with the maximum buffer
size in each.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Move function mlx5e_create_umr_mkey

In next patch we are going to create a UMR MKey per RQ, we need
mlx5e_create_umr_mkey declared before mlx5e_create_rq.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Implement Fragmented Work Queue (WQ)

Add new type of struct mlx5_frag_buf which is used to allocate fragmented
buffers rather than contiguous, and make the Completion Queues (CQs) use
it as they are big (default of 2MB per CQ in Striding RQ).

This fixes the failures of type:
"mlx5e_open_locked: mlx5e_open_channels failed, -12"
due to dma_zalloc_coherent insufficient contiguous coherent memory to
satisfy the driver's request when the user tries to setup more or larger
rings.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'altera-tse-sgmii-pcs'

Neill Whillans says:

====================
net: Add support for SGMII PCS on Altera TSE MAC

These patches were created as part of work to add support for SGMII
PCS functionality to the Altera TSE MAC. Patches are based on 4.9-rc6
git tree.

The first patch in the series adds support for the VSC8572 dual-port
Gigabit Ethernet transceiver, used in integration testing.

The second patch adds support for the SGMII PCS functionality to the
Altera TSE driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: altera_tse: add support for SGMII PCS

Add support for the (optional) SGMII PCS functionality of the Altera
TSE MAC. If the phy-mode is set to 'sgmii' then we attempt to discover
and initialise the PCS so that the MAC can communicate to the PHY.

The PCS IP block provides a scratch register for testing presence of
the PCS, which is mapped into one of the two MDIO spaces present in
the MAC's register space. Once we have determined that the scratch
register is functioning, we attempt to initialise the PCS to
auto-negotiate an SGMII link with the PHY. There is no need to monitor
or manage the SGMII link beyond this, since the normal PHY MDIO will
then be used to monitor the media layer.

The Altera TSE MAC has only one way in which it can be configured with an
SGMII PCS, and as such, this patch only looks to the phy-mode to select
whether or not to attempt to initialise the PCS registers. During
initialisation, we report the PCS's equivalent of a PHY ID register.
This can be parameterised during the IP instantiation and is often left
as '0x00000000' which is not an error.

Signed-off-by: Neill Whillans <neill.whillans@codethink.co.uk>
Reviewed-by: Daniel Silverstone <daniel.silverstone@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: vitesse: add support for VSC8572

Add support for the Vitesse VSC8572 which is functionally equivalent to
the already supported VSC8574. As such, all the same handling functions
are used since the VSC8572 merely has half the number of phy blocks
internally.

Signed-off-by: Stephen Agate <stephen.agate@uk.thalesgroup.com>
Signed-off-by: Neill Whillans <neill.whillans@codethink.co.uk>
Reviewed-by: Daniel Silverstone <daniel.silverstone@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sfc-defalconisation-fixups'

Edward Cree says:

====================
sfc: defalconisation fixups

A bug fix, the Kconfig change, and cleaning up a bit more unused code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: remove RESET_TYPE_RX_RECOVERY

It's no longer used now that Falcon is gone.

Also remove a reference in a comment to an ioctl that doesn't exist.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: don't select SFC_FALCON

Easy enough for Falcon users to enable it when making oldconfig.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: fix debug message format string in efx_farch_handle_rx_not_ok

Defalconisation removed one of the string arguments, but missed the
corresponding %s.

Fixes: 5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtnetlink: return the correct error code

Before this patch, function ndo_dflt_fdb_dump() will always return code
from uc fdb dump. The reture code of mc fdb dump is lost.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

audit: remove useless synchronize_net()

netlink kernel socket is protected by refcount, not RCU.
Its rcv path is neither protected by RCU. So the synchronize_net()
is just pointless.

Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mdix_ctrl'

Raju Lakkaraju says:

====================
Adding PHY MDI(X) support

I updated all review comments which were given by Andrew and Florian.

This series add support for PHY MDI(X), and implement it for MSCC phys.

Tested on Beaglebone Black with VSC 8531 PHY.

Change set:
v1:
- Initial patch submit the WoL and MDI-X in single set of patches

v2:
- Split the mdi(x) as signal set of patches.
- Remove the out_unlock as suggested by Andrew.
- Add mdix_ctrl parameter in "phy_device" to handle the user configure
mdi(x). Proposed implementation accepted by Florian.
- phydev->mdix_ctrl initialize with ETH_TP_MDI_AUTO. Ethernet controller
never initialize this parameter.
- Fix the mdix changes in marvell and microchip driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Fix the mdix_ctrl changes

PHY drivers to have an eth_tp_mdix_ctrl to indicate what is the configured
MDI setting, and read eth_tp_mdi to indicate what is the current status,

Add new parameter mdix_ctrl in phy_device structure and fix driver.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Add mdi(x) support in Microsemi PHYs driver

To connect two ports of the same configuration (MDI to MDI or
MDI-X to MDI-X) with a 10/100/1000 Mbit/s connection, an
Ethernet crossover cable is needed to cross over the transmit
and receive signals in the cable, so that they are matched at
the connector level.
When connecting an MDI port to an MDI-X port a straight through
cable is used while to connect two MDI ports or two MDI-X ports
a crossover cable must be used. Conventionally MDI is used on end
devices while MDI-X is used on hubs and switches

Auto MDI-X automatically detects the required cable connection
type and configures the connection appropriately, removing the
need for crossover cables to interconnect switches or connecting
PCs peer-to-peer.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: update the mdix_ctrl with correct value.

Update the mdix and mdix_ctrl with corresponding ethtool configuration
parameters.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add mdix_ctrl to hold the user configuration.

Add new parameter mdix_ctrl to hold the user configuration.
Existing mdix maintain the current status of MDI(X) crossover performed or
not.
mdix_ctrl can configure either ETH_TP_MDI or ETH_TP_MDI_X orETH_TP_MDI_AUTO.

Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

driver: ipvlan: Remove useless member mtu_adj of struct ipvl_dev

The mtu_adj is initialized to zero when alloc mem, there is no any
assignment to mtu_adj. It is only used in ipvlan_adjust_mtu as one
right value.
So it is useless member of struct ipvl_dev, then remove it.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet :mellanox :mlx5: Replace pci_pool_alloc by pci_pool_zalloc

In alloc_cmd_box(), pci_pool_alloc() followed by memset will be
replaced by pci_pool_zalloc()

Signed-off-by: Souptick joarder <jrdr.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet :mellanox :mlx4: Replace pci_pool_alloc by pci_pool_zalloc

In mlx4_alloc_cmd_mailbox(), pci_pool_alloc() followed by memset will be
replaced by pci_pool_zalloc()

Signed-off-by: Souptick joarder <jrdr.linux@gmail.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.

Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP
protocols.") made __build_flow_key call sock_net(sk) to determine
the network namespace of the passed-in socket. This crashes if sk
is NULL.

Fix this by getting the network namespace from the skb instead.

Fixes: e2d118a1cb5e ("net: inet: Support UID-based routing in IP protocols.")
Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: add test for the verifier equal logic bug

This is a test to verify that

bpf: fix states equal logic for varlen access

actually fixed the problem. The problem was if the register we added to our map
register was UNKNOWN in both the false and true branches and the only thing that
changed was the range then we'd incorrectly assume that the true branch was
valid, which it really wasnt. This tests this case and properly fails without
my fix in place and passes with it in place.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cpsw-per-channel-shaping'

Ivan Khoronzhuk says:

====================
cpsw: add per channel shaper configuration

This series is intended to allow user to set rate for per channel
shapers at cpdma level. This patchset doesn't have impact on performance.
The rate can be set with:

echo 100 > /sys/class/net/ethX/queues/tx-0/tx_maxrate

Tested on am572xx
Based on net-next/master
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: split tx budget according between channels

Split device budget between channels according to channel rate.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: optimize end of poll cycle

Check budget fullness only after it's updated and update
channel mask only once to keep budget balance between channels.
It's also needed for farther changes.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: add .ndo to set per-queue rate

This patch allows to rate limit queues tx queues for cpsw interface.
The rate is set in absolute Mb/s units and cannot be more a speed
an interface is connected with.

The rate for a tx queue can be tested with:

ethtool -L eth0 rx 4 tx 4

echo 100 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
echo 200 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
echo 50 > /sys/class/net/eth0/queues/tx-2/tx_maxrate
echo 30 > /sys/class/net/eth0/queues/tx-3/tx_maxrate

tc qdisc add dev eth0 root handle 1: multiq

tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip\
dport 5001 0xffff action skbedit queue_mapping 0

tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip\
dport 5002 0xffff action skbedit queue_mapping 1

tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip\
dport 5003 0xffff action skbedit queue_mapping 2

tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip\
dport 5004 0xffff action skbedit queue_mapping 3

iperf -c 192.168.2.1 -b 110M -p 5001 -f m -t 60
iperf -c 192.168.2.1 -b 215M -p 5002 -f m -t 60
iperf -c 192.168.2.1 -b 55M -p 5003 -f m -t 60
iperf -c 192.168.2.1 -b 32M -p 5004 -f m -t 60

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: davinci_cpdma: add set rate for a channel

The cpdma has 8 rate limited tx channels. This patch adds
ability for cpdma driver to use 8 tx h/w shapers. If at least one
channel is not rate limited then it must have higher number, this
is because the rate limited channels have to have higher priority
then not rate limited channels. The channel priority is set in low-hi
direction already, so that when a new channel is added with ethtool
and it doesn't have rate yet, it cannot affect on rate limited
channels. It can be useful for TSN streams and just in cases when
h/w rate limited channels are needed.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: davinci_cpdma: add weight function for channels

The weight of a channel is needed to split descriptors between
channels. The weight can depend on maximum rate of channels, maximum
rate of an interface or other reasons. The channel weight is in
percentage and is independent for rx and tx channels.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'qed-XDP-support'

Yuval Mintz says:

====================
qed*: Add XDP support

This patch series is intended to add XDP to the qede driver, although
it contains quite a bit of cleanups, refactorings and infrastructure
changes as well.

The content of this series can be roughly divided into:

- Datapath improvements - mostly focused on having the datapath utilize
parameters which can be more tightly contained in cachelines.
Patches #1, #2, #8, #9 belong to this group.

- Refactoring - done mostly in favour of XDP. Patches #3, #4, #5, #9.

- Infrastructure changes - done in favour of XDP. Paches #6 and #7 belong
to this category [#7 being by far the biggest patch in the series].

- Actual XDP support - last two patches [#10, #11].
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Add support for XDP_TX

Add support for forwarding via XDP. Once the eBPF is attached,
driver would allocate & configure a designated transmission queue
meant solely for forwarding packets. Said queue would share the
receive-queue's interrupt line, and would have it's own Tx statistics.

Infrastructure changes required for this [spread-out through the code]:
- Determine the DMA direction of the receive buffers based on the presence
of the eBPF program.
- Turn the sw Tx ring into a union, as regular/XDP queues have different
needs for releasing resources after completion [regular requires the SKB,
XDP requires the transmitted page].

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Add basic XDP support

Add support for the ndo_xdp callback. This patch would support XDP_PASS,
XDP_DROP and XDP_ABORTED commands.

This also adds a per Rx queue statistic which counts number of packets
which didn't reach the stack [due to XDP].

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Better utilize the qede_[rt]x_queue

Improve the cacheline usage of both queues by reordering -
This reduces the cachelines required for egress datapath processing
from 3 to 2 and those required by ingress datapath processing by 2.

It also changes a couple of datapath related functions that currently
require either the fastpath or the qede_dev, changing them to be based
on the tx/rx queue instead.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Don't check netdevice for rx-hash

Receive-hashing is a fixed feature, so there's no need to check
during the ingress datapath whether it's set or not.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed*: Handle-based L2-queues.

The driver needs to maintain several FW/HW-indices for each one of
its queues. Currently, that mapping is done by the QED where it uses
an rx/tx array of so-called hw-cids, populating them whenever a new
queue is opened and clearing them upon destruction of said queues.

This maintenance is far from ideal - there's no real reason why
QED needs to maintain such a data-structure. It becomes even worse
when considering the fact that the PF's queues and its child VFs' queues
are all mapped into the same data-structure.
As a by-product, the set of parameters an interface needs to supply for
queue APIs is non-trivial, and some of the variables in the API
structures have different meaning depending on their exact place
in the configuration flow.

This patch re-organizes the way L2 queues are configured and maintained.
In short:
  - Required parameters for queue init are now well-defined.
  - Qed would allocate a queue-cid based on parameters.
    Upon initialization success, it would return a handle to caller.
  - Queue-handle would be maintained by entity requesting queue-init,
    not necessarily qed.
  - All further queue-APIs [update, destroy] would use the opaque
    handle as reference for the queue instead of various indices.

The possible owners of such handles:
  - PF queues [qede] - complete handles based on provided configuration.
  - VF queues [qede] - fw-context-less handles, containing only relative
    information; Only the PF-side would need the absolute indices
    for configuration, so they're omitted here.
  - VF queues [qed, PF-side] - complete handles based on VF initialization.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Revise state locking scheme

As qede utilizes an internal-reload sequence as result of various
configuration changes, the netif state wouldn't always accurately describe
the status of the configuration.
To compensate, we're storing an internal state of the device, which should
only be accessed under the qede_lock.

This patch fixes and improves several state/lock interactions:
  - The internal state should only be checked while locked.
  - While holding lock, it's preferable to check state rather than
    the netdevice's state.
  - The reload sequence is not 'atomic' - unload and subsequent load
    are not in the same critical section.

This also add the 'locked' variant for the reload, which would later be
used by XDP - useful in the case where the correct sequence is 'lock,
check state and re-configure if good', instead of allowing the reload
itself to make the decision regarding the configurability of the device.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Refactor data-path Rx flow

Driver's NAPI poll is using a long sequence for processing ingress
packets, and it's going to get even longer once we do XDP.
Break down the main loop into a series of sub-functions to allow
better readability of the function.

While we're at it, correct the accounting of the NAPI budget -
currently we're counting only packets passed to the stack against
the budget, even in case those are actually aggregations.
After refactoring every CQE processed would be counted against the budget.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Refactor statistics gathering

Refactor logic for gathering statistics into a per-queue function.
This improves readability of the driver statistics' flows.

In addition, this would be required by the XDP forwarding queues
[as we'll need the Txq statistics gathering methods for those as well].

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Remove 'num_tc'.

Driver currently doesn't support multi-CoS, but it contains logic
where multiple transmission queues could be theoretically manipulated.
No point in maintaining the infrastructure at the moment.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Optimize qed_chain datapath usage

The chain structure and functions are widely used by the qed* modules,
both for configuration and datapath.
E.g., qede's Tx has one such chain and its Rx has two.

Currently, the strucutre's fields which are required for datapath
related functions [produce/consume] are intertwined with fields which
are required only for configuration purposes [init/destroy/etc.].

This patch re-arranges the chain structure so that all the fields which
are required for datapath usage could reside in a single cacheline instead
of the two which are required today.

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: Optimize aggregation information size

Driver needs to maintain a structure per-each concurrent possible
open aggregation, but the structure storing that metadata is far from
being optimized - biggest waste in it is that there are 2 buffer metadata,
one for a replacement buffer when the aggregation begins and the other for
holding the first aggregation's buffer after it begins [as firmware might
still update it]. Those 2 can safely be united into a single metadata
structure.

struct qede_agg_info changes the following:

/* size: 120, cachelines: 2, members: 9 */
/* sum members: 114, holes: 1, sum holes: 4 */
/* padding: 2 */
/* paddings: 2, sum paddings: 8 */
/* last cacheline: 56 bytes */
-->
/* size: 48, cachelines: 1, members: 9 */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 48 bytes */

Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ehea: Remove unnecessary memset of stats in netdev private data

The memory for netdev private data is allocated using kzalloc/vzalloc in
alloc_netdev_mqs, thus there is no need to zero the stats portion of it
again in the driver's probe function.

In any case, the size for the memset is wrong as the stats member is of
type rtnl_link_stats64, not net_device_stats.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

cgroup, bpf: remove unnecessary #include

this #include is unnecessary and brings whole set of
other headers into cgroup-defs.h. Remove it.

Fixes: 3007098494be ("cgroup: add support for eBPF programs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Rami Rosen <roszenrami@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

neigh: remove duplicate check for same neigh

Currently loop index 'idx' is used as the index in the neigh list of interest.
It's increased only when the neigh is dumped. It's not the absolute index in
the list. Because there is no info to record which neigh has already be scanned
by previous loop. This will cause the filtered out neighs to be scanned mulitple
times.

This patch make idx as the absolute index in the list, it will increase no matter
whether the neigh is filtered. This will prevent the above problem.

And this is in line with other dump functions.

v2:
- take David Ahern's advice to do simple change

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: fix include path

Fix the following build error:
HOSTCC samples/bpf/test_lru_dist.o
../samples/bpf/test_lru_dist.c:25:22: fatal error: bpf_util.h: No such file or directory

This is due to objtree != srctree.
Use srctree, since that's where bpf_util.h is located.

Fixes: e00c7b216f34 ("bpf: fix multiple issues in selftest suite and samples")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

macvtap: replace printk with netdev_err

This patch replaces printk() with netdev_err() for macvtap device.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'liquidio-VF'

Raghu Vatsavayi says:

====================
liquidio VF operations

This patchseries adds support for VF device specific operations
like mailbox, queues and register access. This V3 patchset also
has changes based on comments form earlier versions:

1) Removed extra 'void *' casting.
2) Fixed all cross compilations issues reported on S390 and Powerpc
architectures.

Please apply the patches in following order as these patches depend
on each other.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: VF init and destroy

Adds support for VF initialization and destroy resources.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: VF interrupt

Adds support for VF interrupt processing.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: VF mailbox

Adds support for VF mailbox setup.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: init VF softcommand queues

Adds support for initializing softcommand, dispatch and
instructions queues for VF.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: VF register access

This patch adds support for VF device register access.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: VF queue setup

Adds support for configuring VF input/output queues.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: VF config setup

Adds support for setting up VF configuration.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: VF registration

Adds support for cn23xx VF probe and registration.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio CN23XX: VF register definitions

Adds support for CN23xx VF registers.

Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples: bpf: Refactor test_cgrp2_attach -- use getopt, and add mode

This patch modifies test_cgrp2_attach to use getopt so we can use standard
command line parsing.

It also adds an option to run the program in detach only mode. This does
not attach a new filter at the cgroup, but only runs the detach command.

Lastly, it changes the attach code to not detach and then attach. It relies
on the 'hotswap' behaviour of CGroup BPF programs to be able to change
in-place. If detach-then-attach behaviour needs to be tested, the example
can be run in detach only mode prior to attachment.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: brocade: bna: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Acked-by: Rasesh Mody <Rasesh.Mody@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf, xdp: allow to pass flags to dev_change_xdp_fd

Add an IFLA_XDP_FLAGS attribute that can be passed for setting up
XDP along with IFLA_XDP_FD, which eventually allows user space to
implement typical add/replace/delete logic for programs. Right now,
calling into dev_change_xdp_fd() will always replace previous programs.

When passed XDP_FLAGS_UPDATE_IF_NOEXIST, we can handle this more
graceful when requested by returning -EBUSY in case we try to
attach a new program, but we find that another one is already
attached. This will be used by upcoming front-end for iproute2 as
well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'broadcom-phy-internal-counters'

Florian Fainelli says:

====================
net: phy: broadcom: Support for PHY counters

This patch series adds support for reading the Broadcom PHYs internal counters.

Changes in v3:

- fixed the allocation of priv->stats in bcm7xxx

Changes in v2:

- fixed modular build reported by kbuild

- constify array of stats
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: bcm7xxx: Plug in support for reading PHY error counters

Broadcom BCM7xxx internal PHYs can leverage the Broadcom PHY library
module PHY error counters helper functions, just implement the
appropriate PHY driver function calls to do so. We need to allocate some
storage space for our PHY statistics, and provide it to the Broadcom PHY
library, so do this in a specific probe function, and slightly wrap the
get_stats function call.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: Add support code for reading PHY counters

Broadcom PHYs expose a number of PHY error counters: receive errors,
false carrier sense, SerDes BER count, local and remote receive errors.
Add support code to allow retrieving these error counters. Since the
Broadcom PHY library code is used by several drivers, make it possible
for them to specify the storage for the software copy of the statistics.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver

Rationale: The differences between Falcon and Siena are in many ways larger
than those between Siena and EF10 (despite Siena being nominally "Falcon-
architecture"); for instance, Falcon has no MCPU, so there is no MCDI.
Removing Falcon support from the sfc driver should simplify the latter,
and avoid the possibility of Falcon support being broken by changes to sfc
(which are rarely if ever tested on Falcon, it being end-of-lifed hardware).

The sfc-falcon driver created in this changeset is essentially a copy of the
sfc driver, but with Siena- and EF10-specific code, including MCDI, removed
and with the "efx_" identifier prefix changed to "ef4_" (for "EFX 4000-
series") to avoid collisions when both drivers are built-in.

This changeset removes Falcon from the sfc driver's PCI ID table; then in
sfc I've removed obvious Falcon-related code: I removed the Falcon NIC
functions, Falcon PHY code, and EFX_REV_FALCON_*, then fixed up everything
that referenced them.

Also, increment minor version of both drivers (to 4.1).

For now, CONFIG_SFC selects CONFIG_SFC_FALCON, so that updating old configs
doesn't cause Falcon support to disappear; but that should be undone at
some point in the future.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cpsw: ethtool: add support for nway reset

This patch adds support for ethtool's '-r' command. Restarting
N-WAY negotiation can be useful to activate newly changed EEE
settings etc.

Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-sender-chronographs'

Yuchung Cheng says:

====================
tcp: sender chronographs instrumentation

This patch set provides instrumentation on TCP sender limitations.
While developing the BBR congestion control, we noticed that TCP
sending process is often limited by factors unrelated to congestion
control: insufficient sender buffer and/or insufficient receive
window/buffer to saturate the network bandwidth. Unfortunately these
limits are not visible to the users and often the poor performance
is attributed to the congestion control of choice.

Thie patch aims to help users get the high level understanding of
where sending process is limited by, similar to the TCP_INFO design.
It is not to replace detailed kernel tracing and instrumentation
facilities.

In addition this patch set provide a new option to the timestamping
work to instrument these limits on application data unit. For exampe,
one can use SO_TIMESTAMPING and this patch set to measure the how
long a particular HTTP response is limited by small receive window.

Patch set was initially written by Francis Yan then polished
by Yuchung Cheng, with lots of help from Eric Dumazet and Soheil
Hassas Yeganeh.
====================

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING

This patch exports the sender chronograph stats via the socket
SO_TIMESTAMPING channel. Currently we can instrument how long a
particular application unit of data was queued in TCP by tracking
SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
these sender chronograph stats exported simultaneously along with
these timestamps allow further breaking down the various sender
limitation. For example, a video server can tell if a particular
chunk of video on a connection takes a long time to deliver because
TCP was experiencing small receive window. It is not possible to
tell before this patch without packet traces.

To prepare these stats, the user needs to set
SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
while requesting other SOF_TIMESTAMPING TX timestamps. When the
timestamps are available in the error queue, the stats are returned
in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: export sender limits chronographs to TCP_INFO

This patch exports all the sender chronograph measurements collected
in the previous patches to TCP_INFO interface. Note that busy time
exported includes all the other sending limits (rwnd-limited,
sndbuf-limited). Internally the time unit is jiffy but externally
the measurements are in microseconds for future extensions.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: instrument how long TCP is limited by insufficient send buffer

This patch measures the amount of time when TCP runs out of new data
to send to the network due to insufficient send buffer, while TCP
is still busy delivering (i.e. write queue is not empty). The goal
is to indicate either the send buffer autotuning or user SO_SNDBUF
setting has resulted network under-utilization.

The measurement starts conservatively by checking various conditions
to minimize false claims (i.e. under-estimation is more likely).
The measurement stops when the SOCK_NOSPACE flag is cleared. But it
does not account the time elapsed till the next application write.
Also the measurement only starts if the sender is still busy sending
data, s.t. the limit accounted is part of the total busy time.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: instrument how long TCP is limited by receive window

This patch measures the total time when the TCP stops sending because
the receiver's advertised window is not large enough. Note that
once the limit is lifted we are likely in the busy status if we
have data pending.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: instrument how long TCP is busy sending

This patch measures TCP busy time, which is defined as the period
of time when sender has data (or FIN) to send. The time starts when
data is buffered and stops when the write queue is flushed by ACKs
or error events.

Note the busy time does not include SYN time, unless data is
included in SYN (i.e. Fast Open). It does include FIN time even
if the FIN carries no payload. Excluding pure FIN is possible but
would incur one additional test in the fast path, which may not
be worth it.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: instrument tcp sender limits chronographs

This patch implements the skeleton of the TCP chronograph
instrumentation on sender side limits:

1) idle (unspec)
2) busy sending data other than 3-4 below
3) rwnd-limited
4) sndbuf-limited

The limits are enumerated 'tcp_chrono'. Since a connection in
theory can idle forever, we do not track the actual length of this
uninteresting idle period. For the rest we track how long the sender
spends in each limit. At any point during the life time of a
connection, the sender must be in one of the four states.

If there are multiple conditions worthy of tracking in a chronograph
then the highest priority enum takes precedence over
the other conditions. So that if something "more interesting"
starts happening, stop the previous chrono and start a new one.

The time unit is jiffy(u32) in order to save space in tcp_sock.
This implies application must sample the stats no longer than every
49 days of 1ms jiffy.

Signed-off-by: Francis Yan <francisyyan@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cpsw: ethtool: add support for getting/setting EEE registers

Add the ability to query and set Energy Efficient Ethernet parameters
via ethtool for applicable devices.

This patch doesn't activate full EEE support in cpsw driver, but it
enables reading and writing EEE advertising settings. This way one
can disable advertising EEE for certain speeds.

Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
Acked-by: Rami Rosen <roszenrami@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: remove excessive logging on MTU change

When we change MTU or the number of channels on a netvsc device we get the
following logged:

hv_netvsc bf5edba8...: net device safe to remove
hv_netvsc: hv_netvsc channel opened successfully
hv_netvsc bf5edba8...: Send section size: 6144, Section count:2560
hv_netvsc bf5edba8...: Device MAC 00:15:5d:1e:91:12 link state up

This information is useful as debug at most.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-enhancements'

Jiri Pirko says:

====================
mlxsw: couple of enhancements and fixes

Couple of enhancements and fixes from Ido.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core: Change order of operations in removal path

We call bus->init() before allocating 'lag.mapping'. Change the order of
operations in removal path to reflect that.

This makes the error path of mlxsw_core_bus_device_register() symmetric
with mlxsw_core_bus_device_unregister().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core: Add missing rollback in error path

Without this rollback, the thermal zone is still registered during the
error path, whereas its private data is freed upon the destruction of
the underlying bus device due to the use of devm_kzalloc(). This results
in use after free.

Fix this by calling mlxsw_thermal_fini() from the appropriate place in
the error path.

Fixes: a50c1e35650b ("mlxsw: core: Implement thermal zone")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_buffers: Limit size of pools

The shared buffer pools are containers whose size is used to calculate
the maximum usage for packets from / to a specific port / {port, PG/TC},
when dynamic threshold is employed.

While it's perfectly fine for the sum of the pools to exceed the maximum
size of the shared buffer, a single pool cannot.

Add a check when the pool size is set and forbid sizes larger than the
maximum size of the shared buffer.

Without the patch:
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 999999999 thtype
dynamic
// No error is returned

With the patch:
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 999999999 thtype
dynamic
devlink answers: Invalid argument

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: resources: Add maximum buffer size

We need to be able to limit the size of shared buffer pools, so query
the maximum size from the device during init.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: switchib: add MLXSW_PCI dependency

The newly added switchib driver fails to link if MLXSW_PCI=m:

drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function^Cmlxsw_sib_module_exit':
switchib.c:(.exit.text+0x8): undefined reference to `mlxsw_pci_driver_unregister'
switchib.c:(.exit.text+0x10): undefined reference to `mlxsw_pci_driver_unregister'
drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function `mlxsw_sib_module_init':
switchib.c:(.init.text+0x28): undefined reference to `mlxsw_pci_driver_register'
switchib.c:(.init.text+0x38): undefined reference to `mlxsw_pci_driver_register'
switchib.c:(.init.text+0x48): undefined reference to `mlxsw_pci_driver_unregister'

The other two such sub-drivers have a dependency, so add the same one
here. In theory we could allow this driver if MLXSW_PCI is disabled,
but it's probably not worth it.

Fixes: d1ba52638456 ("mlxsw: switchib: Introduce SwitchIB and SwitchIB silicon driver")
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: Fix use after free in phy_detach()

If device_release_driver(&phydev->mdio.dev) is called, it releases all
resources belonging to the PHY device. Hence the subsequent call to
phy_led_triggers_unregister() will access already freed memory when
unregistering the LEDs.

Move the call to phy_led_triggers_unregister() before the possible call
to device_release_driver() to fix this.

Fixes: 2e0bc452f4721520 ("net: phy: leds: add support for led triggers on phy link state change")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Zach Brown <zach.brown@ni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: fix comments, make debug output consistent

Fix comments, add some new, and make debugfs output consistent.

Signed-off-by: Pavel Machek <pavel@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: cgroup: fix documentation of __cgroup_bpf_update()

There's a 'not' missing in one paragraph. Add it.

Fixes: 3007098494be ("cgroup: add support for eBPF programs")
Signed-off-by: Daniel Mack <daniel@zonque.org>
Reported-by: Rami Rosen <roszenrami@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'eee-broken-modes'

Jerome Brunet says:

====================
Fix OdroidC2 Gigabit Tx link issue

This patchset fixes an issue with the OdroidC2 board (DWMAC + RTL8211F).
The platform seems to enter LPI on the Rx path too often while performing
relatively high TX transfer. This eventually break the link (both Tx and
Rx), and require to bring the interface down and up again to get the Rx
path working again.

The root cause of this issue is not fully understood yet but disabling EEE
advertisement on the PHY prevent this feature to be negotiated.
With this change, the link is stable and reliable, with the expected
throughput performance.

The patchset adds options in the generic phy driver to disable EEE
advertisement, through device tree. The way it is done is very similar
to the handling of the max-speed property.

Changes since V2: [2]
- Rename "eee-advert-disable" to "eee-broken-modes" to make the intended
   purpose of this option clear (flag broken configuration, not a
   configuration option)
- Add DT bindings constants so the DT configuration is more user friendly
- Submit to net-next instead of net.

Changes since V1: [1]
- Disable the advertisement of EEE in the generic code instead of the
   realtek driver.

[1] : http://lkml.kernel.org/r/1479220154-25851-1-git-send-email-jbrunet@baylibre.com
[2] : http://lkml.kernel.org/r/1479742524-30222-1-git-send-email-jbrunet@baylibre.com
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

dt: bindings: add ethernet phy eee-broken-modes option documentation

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: add EEE capability constants

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
Tested-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add an option to disable EEE advertisement

This patch adds an option to disable EEE advertisement in the generic PHY
by providing a mask of prohibited modes corresponding to the value found in
the MDIO_AN_EEE_ADV register.

On some platforms, PHY Low power idle seems to be causing issues, even
breaking the link some cases. The patch provides a convenient way for these
platforms to disable EEE advertisement and work around the issue.

Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Yegor Yefremov <yegorslists@googlemail.com>
Tested-by: Andreas Färber <afaerber@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>