git.openwrt.org Git - openwrt/staging/blogic.git/log

net: sched: explicit locking in gso_cpu fallback

This work is preparing the qdisc layer to support egress lockless
qdiscs. If we are running the egress qdisc lockless in the case we
overrun the netdev, for whatever reason, the netdev returns a busy
error code and the skb is parked on the gso_skb pointer. With many
cores all hitting this case at once its possible to have multiple
sk_buffs here so we turn gso_skb into a queue.

This should be the edge case and if we see this frequently then
the netdev/qdisc layer needs to back off.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: a dflt qdisc may be used with per cpu stats

Enable dflt qdisc support for per cpu stats before this patch a dflt
qdisc was required to use the global statistics qstats and bstats.

This adds a static flags field to qdisc_ops that is propagated
into qdisc->flags in qdisc allocate call. This allows the allocation
block to completely allocate the qdisc object so we don't have
dangling allocations after qdisc init.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: provide per cpu qstat helpers

The per cpu qstats support was added with per cpu bstat support which
is currently used by the ingress qdisc. This patch adds a set of
helpers needed to make other qdiscs that use qstats per cpu as well.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: remove remaining uses for qdisc_qlen in xmit path

sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
of the routine only check if it is zero or not. Simplify the logic so
that we don't need to return an actual queue length value.

This introduces a case now where sch_direct_xmit would have returned
a qlen of zero but now it returns true. However in this case all
call sites of sch_direct_xmit will implement a dequeue() and get
a null skb and abort. This trades tracking qlen in the hotpath for
an extra dequeue operation. Overall this seems to be good for
performance.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: allow qdiscs to handle locking

This patch adds a flag for queueing disciplines to indicate the stack
does not need to use the qdisc lock to protect operations. This can
be used to build lockless scheduling algorithms and improving
performance.

The flag is checked in the tx path and the qdisc lock is only taken
if it is not set. For now use a conditional if statement. Later we
could be more aggressive if it proves worthwhile and use a static key
or wrap this in a likely().

Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason
for this is synchronizing a qlen counter across threads proves to
cost more than doing the enqueue/dequeue operations when tested with
pktgen.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: cleanup qdisc_run and __qdisc_run semantics

Currently __qdisc_run calls qdisc_run_end() but does not call
qdisc_run_begin(). This makes it hard to track pairs of
qdisc_run_{begin,end} across function calls.

To simplify reading these code paths this patch moves begin/end calls
into qdisc_run().

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

virtio_net: Disable interrupts if napi_complete_done rescheduled napi

Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been able
to be rescheduled within napi_complete_done() even in non-busypoll case,
but virtnet_poll() always enabled interrupts before complete, and when
napi was rescheduled within napi_complete_done() it did not disable
interrupts.
This caused more interrupts when event idx is disabled.

According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ
processing") we cannot place virtqueue_enable_cb_prepare() after
NAPI_STATE_SCHED is cleared, so disable interrupts again if
napi_complete_done() returned false.

Tested with vhost-user of OVS 2.7 on host, which does not have the event
idx feature.

* Before patch:

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1472   60.00     32763206      0    6430.32
212992           60.00     23384299           4589.56

Interrupts on guest: 9872369
Packets/interrupt:   2.37

* After patch

$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1472   60.00     32794646      0    6436.49
212992           60.00     32793501           6436.27

Interrupts on guest: 4941299
Packets/interrupt:   6.64

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2017-12-07

The following pull-request contains BPF updates for your net-next tree.

The main changes are:

1) Detailed documentation of BPF development process from Daniel.

2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops
from Lawrence.

3) Minor follow up for bpf_skb_set_tunnel_key() from William.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tuntap: fix possible deadlock when fail to register netdev

Private destructor could be called when register_netdev() fail with
rtnl lock held. This will lead deadlock in tun_free_netdev() who tries
to hold rtnl_lock. Fixing this by switching to use spinlock to
synchronize.

Fixes: 96f84061620c ("tun: add eBPF based queue selection method")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'smc-fixes-next'

Ursula Braun says:

====================
smc: fixes 2017-12-07

here are some smc-patches. The initial 4 patches are cleanups.
Patch 5 gets rid of ib_post_sends in tasklet context to avoid peer drops due
to out-of-order receivals.
Patch 6 makes sure, the Linux SMC code understands variable sized CLC proposal
messages built according to RFC7609.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

smc: support variable CLC proposal messages

According to RFC7609 [1] the CLC proposal message contains an area of
unknown length for future growth. Additionally it may contain up to
8 IPv6 prefixes. The current version of the SMC-code does not
understand CLC proposal messages using these variable length fields and,
thus, is incompatible with SMC implementations in other operating
systems.

This patch makes sure, SMC understands incoming CLC proposals
* with arbitrary length values for future growth
* with up to 8 IPv6 prefixes

[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

smc: no consumer update in tasklet context

The SMC protocol requires to send a separate consumer cursor update,
if it cannot be piggybacked to updates of the producer cursor.
When receiving a blocked signal from the sender, this update is sent
already in tasklet context. In addition consumer cursor updates are
sent after data receival.
Sending of cursor updates is controlled by sequence numbers.
Assuming receiving stray messages the receiver drops updates with older
sequence numbers than an already received cursor update with a higher
sequence number.
Sending consumer cursor updates in tasklet context may result in
wrong order sends and its corresponding drops at the receiver. Since
it is sufficient to send consumer cursor updates once the data is
received, this patch gets rid of the consumer cursor update in tasklet
context to guarantee in-sequence arrival of cursor updates.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

smc: cleanup close checking during data receival

When waiting for data to be received it must be checked if the
peer signals shutdown. The SMC code uses two different checks
for this purpose, even though just one check is sufficient.
This patch removes the superfluous test for SOCK_DONE.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

smc: no update for unused sk_write_pending

The smc code never checks the sk_write_pending sock field.
Thus there is no need to update it.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

smc: improve smc_clc_send_decline() error handling

Let smc_clc_send_decline() return with an error, if the amount
sent is smaller than the length of an smc decline message.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

smc: make smc_close_active_abort() static

smc_close_active_abort() is used in smc_close.c only.
Make it static.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Allow compiling out legacy support

Introduce a configuration option: CONFIG_NET_DSA_LEGACY allowing to compile out
support for the old platform device and Device Tree binding registration.
Support for these configurations is scheduled to be removed in 4.17.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Don't print "Link speed -1 no longer supported" messages.

On some dual port NICs, the 2 ports have to be configured with compatible
link speeds.  Under some conditions, a port's configured speed may no
longer be supported.  The firmware will send a message to the driver
when this happens.

Improve this logic that prints out the warning by only printing it if
we can determine the link speed that is no longer supported.  If the
speed is unknown or it is in autoneg mode, skip the warning message.

Reported-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-devel-doc'

Daniel Borkmann says:

====================
Few BPF doc updates

Two changes, i) add BPF trees into maintainers file, and ii) add
a BPF doc around the development process, similarly as we have
with netdev FAQ, but just describing BPF specifics. Thanks!
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, doc: add faq about bpf development process

In the same spirit of netdev FAQ, start a BPF FAQ as a collection
of expectations and/or workflow details in the context of BPF patch
processing.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, doc: add bpf trees and tps to maintainers entry

i) Add the bpf and bpf-next trees to the maintainers entry
   so they can be found easily and picked up by test bots
   etc that would integrate all trees from maintainers file.
   Suggested by Stephen while integrating the trees into
   linux-next.

ii) Add the two headers defining BPF/XDP tracepoints to the
    list of files as well.

Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ipvlan: Eliminate duplicated codes with existing function

The recv flow of ipvlan l2 mode performs as same as l3 mode for
non-multicast packet, so use the existing func ipvlan_handle_mode_l3
instead of these duplicated statements in non-multicast case.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: handle NETIF_F_HW_TC changes correctly

Currently, whenever the NETIF_F_HW_TC feature changes, we silently
always allow it, but we actually do not disable the flows in HW
on disable. That breaks user's expectations. So just forbid
the feature disable in case there are any filters offloaded.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tun: avoid unnecessary READ_ONCE in tun_net_xmit

The statement no longer serves a purpose.

Commit fa35864e0bb7 ("tuntap: Fix for a race in accessing numqueues")
added the ACCESS_ONCE to avoid a race condition with skb_queue_len.

Commit 436accebb530 ("tuntap: remove unnecessary sk_receive_queue
length check during xmit") removed the affected skb_queue_len check.

Commit 96f84061620c ("tun: add eBPF based queue selection method")
split the function, reading the field a second time in the callee.
The temp variable is now only read once, so just remove it.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: debug: fix null check on static array

t_name cannot be NULL since it is an array field of a struct.
Replacing null check on static array with string length check using
strnlen()

Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

act_mirred: get rid of mirred_list_lock spinlock

TC actions are no longer freed in RCU callbacks and we should
always have RTNL lock, so this spinlock is no longer needed.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

act_mirred: get rid of tcfm_ifindex from struct tcf_mirred

tcfm_dev always points to the correct netdev and we already
hold a refcnt, so no need to use tcfm_ifindex to lookup again.

If we would support moving target netdev across netns, using
pointer would be better than ifindex.

This also fixes dumping obsolete ifindex, now after the
target device is gone we just dump 0 as ifindex.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipv6-add-ip6erspan-collect_md-mode'

William Tu says:

====================
ipv6: add ip6erspan collect_md mode

Similar to erspan collect_md mode in ipv4, the first patch adds
support for ip6erspan collect metadata mode. The second patch
adds the test case using bpf_skb_[gs]et_tunnel_key helpers.

The corresponding iproute2 patch:
https://marc.info/?l=linux-netdev&m=151251545410047&w=2
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: add ip6erspan sample code

Extend the existing tests for ip6erspan.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6_gre: add ip6 erspan collect_md mode

Similar to ip6 gretap and ip4 gretap, the patch allows
erspan tunnel to operate in collect metadata mode.
bpf_skb_[gs]et_tunnel_key() helpers can make use of
it right away.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Uninitialized variable in bnxt_tc_parse_actions()

Smatch warns that:

drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c:160 bnxt_tc_parse_actions()
error: uninitialized symbol 'rc'.

"rc" is either uninitialized or set to zero here so we can just remove
the check.

Fixes: 8c95f773b4a3 ("bnxt_en: add support for Flower based vxlan encap/decap offload")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'macb-rx-filter-cleanups'

Julia Cartwright says:

====================
macb rx filter cleanups

Here's a proper patchset based on net-next.

v1 -> v2:
  - Rebased on net-next
  - Add Nicolas's Acks
  - Reorder commits, putting the list_empty() cleanups prior to the
    others.
  - Added commit reverting the GFP_ATOMIC change.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: change GFP_ATOMIC to GFP_KERNEL

Now that the rx_fs_lock is no longer held across allocation, it's safe
to use GFP_KERNEL for allocating new entries.

This reverts commit 81da3bf6e3f88 ("net: macb: change GFP_KERNEL to
GFP_ATOMIC").

Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: reduce scope of rx_fs_lock-protected regions

Commit ae8223de3df5 ("net: macb: Added support for RX filtering")
introduces a lock, rx_fs_lock which is intended to protect the list of
rx_flow items and synchronize access to the hardware rx filtering
registers.

However, the region protected by this lock is overscoped, unnecessarily
including things like slab allocation. Reduce this lock scope to only
include operations which must be performed atomically: list traversal,
addition, and removal, and hitting the macb filtering registers.

This fixes the use of kmalloc w/ GFP_KERNEL in atomic context.

Fixes: ae8223de3df5 ("net: macb: Added support for RX filtering")
Cc: Rafal Ozieblo <rafalo@cadence.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: kill useless use of list_empty()

The list_for_each_entry() macro already handles the case where the list
is empty (by not executing the loop body). It's not necessary to handle
this case specially, so stop doing so.

Cc: Rafal Ozieblo <rafalo@cadence.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: Julia Cartwright <julia@ni.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: remove unused parameter from act cleanup ops

No one actually uses it.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-use-per-port-upstream-port'

Vivien Didelot says:

====================
net: dsa: use per-port upstream port

An upstream port is a local switch port used to reach a CPU port.

DSA still considers a unique CPU port in the whole switch fabric and
thus return a unique upstream port for a given switch. This is wrong in
a multiple CPU ports environment.

We are now switching to using the dedicated CPU port assigned to each
port in order to get rid of the deprecated unique tree CPU port.

This patchset makes the dsa_upstream_port() helper take a port argument
and goes one step closer complete support for multiple CPU ports.

Changes in v2:
- reverse-christmas-tree-fy variables
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: return per-port upstream port

The current dsa_upstream_port() helper still assumes a unique CPU port
in the whole switch fabric. This is becoming wrong, as every port in the
fabric has its dedicated CPU port, thus every port has an upstream port.

Add a port argument to the dsa_upstream_port() helper and fetch its CPU
port instead of the deprecated unique fabric CPU port. A CPU or unused
port has no dedicated CPU port, so return itself in this case.

At the same time, change the return value from u8 to unsigned int since
there is no need to limit the size here.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: assign a CPU port to DSA port

DSA ports also need to have a dedicated CPU port assigned to them,
because they need to know where to egress frames targeting the CPU,
e.g. To_Cpu frames received on a Marvell Tag port.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: setup global upstream port

Move the setup of the global upstream port within the
mv88e6xxx_setup_upstream_port function.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: helper to setup upstream port

Add a helper function to setup the upstream port of a given port.

This is the port used to reach the dedicated CPU port. This function
will be extended later to setup the global upstream port as well.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: egress floods all DSA ports

The mv88e6xxx driver currently assumes a single CPU port in the fabric
and thus floods frames with unknown DA on a single DSA port, the one
that is one hop closer to the CPU port.

With multiple CPU ports in mind, this isn't true anymore because CPU
ports could be found behind both DSA ports of a device in-between
others.

For example in a A <-> B <-> C fabric, both A and C having CPU ports,
device B will have to flood such frame to its two DSA ports.

This patch considers both CPU and DSA ports of a device as upstream
ports, where to flood frames with unknown DA addresses.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sch_api-style'

Alexander Aring says:

====================
net: sched: sch_api: fix coding style issues for extack

this patch prepares to handle extack for qdiscs and fixes checkpatch
issues.

There are a bunch of warnings issued by checkpatch which bothered me.
This first patchset is to get rid of those warnings to make way for
the next patchsets.

I plan to followup with qdiscs, classifiers and actions after this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: sch_api: rearrange init handling

This patch fixes the following checkpatch error:

ERROR: do not use assignment in if condition

by rearranging the if condition to execute init callback only if init
callback exists. The whole setup afterwards is called in any case,
doesn't matter if init callback is set or not. This patch has the same
behaviour as before, just without assign err variable in if condition.
It also makes the code easier to read.

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: sch_api: fix code style issues

This patch fix checkpatch issues for upcomming patches according to the
sched api file. It changes checking on null pointer, remove unnecessary
brackets, add variable names for parameters and adjust 80 char width.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'nfp-enhanced-debug-dump-via-ethtool'

Simon Horman says:

====================
nfp: enhanced debug dump via ethtool

Add debug dump implementation to the NFP driver. This makes use of
existing ethtool infrastructure. ethtool -W is used to select the dump
level and ethtool -w is used to dump NFP state.

The existing behaviour of dump level 0, dumping the arm.diag resource, is
preserved. Dump levels greater than 0 are implemented by this patchset and
optionally supported by firmware providing a _abi_dump_spec rtsym. This
rtsym provides a specification, in TLV format, of the information to be
dumped from the NFP at each supported dump level.

Dumps are also structured using a TLVs. They consist a prolog and the data
described int he corresponding dump.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: dump indirect ME CSRs

- The spec defines CSR address ranges for indirect ME CSRs. For Each TLV
  chunk in the spec, dump a chunk that includes the spec and the data
  over the defined address range.
- Each indirect CSR has 8 contexts. To read one context, first write the
  context to a specific derived address, read it back, and then read the
  register value.
- For each address, read and dump all 8 contexts in this manner.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: dump CPP, XPB and direct ME CSRs

- The spec defines CSR address ranges for these types.
- Dump each TLV chunk in the spec as a chunk that includes the spec and
the data over the defined address range.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: dump firmware name

Dump FW name as TLV, based on dump specification.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: dump single hwinfo field by key

- Add spec TLV for hwinfo field, containing key string as data.
- Add dump TLV for hwinfo field, with data being key and value as packed
zero-terminated strings.
- If specified hwinfo field is not found, dump the spec TLV as -ENOENT
error.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: dump all hwinfo

- Dump hwinfo as separate TLV chunk, in a packed format containing
zero-separated key and value strings.
- This provides additional debug context, if requested by the dumpspec.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: dump rtsyms

- Support rtsym TLVs.
- If specified rtsym is not found, dump the spec TLV as -ENOENT error.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: dumpspec TLV traversal

- Perform dumpspec traversals for calculating size and populating the
dump.
- Initially, wrap all spec TLVs in dump error TLVs (changed by later
patches in the series).

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: dump prolog

- Use a TLV structure, with the typed chunks aligned to 8-byte sizes.
- Dump numeric fields as big-endian.
- Prolog contains the dump level.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: load debug dump spec

Load the TLV-based binary specification of what needs to be included in
a dump, from the "_abi_dump_spec" rtsymbol. If the symbol is not defined,
then dumps for levels >= 1 are not supported.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: debug dump ethtool ops

- Skeleton code to perform a binary debug dump via ethtoolops
  "set_dump", "get_dump_flags" and "get_dump_data", i.e. the ethtool
  -W/w mechanism.
- Skeleton functions for debugdump operations provided.
- An integer "dump level" can be specified, this is stored between
  ethtool invocations. Dump level 0 is still the "arm.diag" resource for
  backward compatibility. Other dump levels each define a set of state
  information to include in the dump, driven by a spec from FW.

Signed-off-by: Carl Heymann <carl.heymann@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: get rid of rcu_barrier() in tcf_block_put_ext()

Both Eric and Paolo noticed the rcu_barrier() we use in
tcf_block_put_ext() could be a performance bottleneck when
we have a lot of tc classes.

Paolo provided the following to demonstrate the issue:

tc qdisc add dev lo root htb
for I in `seq 1 1000`; do
        tc class add dev lo parent 1: classid 1:$I htb rate 100kbit
        tc qdisc add dev lo parent 1:$I handle $((I + 1)): htb
        for J in `seq 1 10`; do
                tc filter add dev lo parent $((I + 1)): u32 match ip src 1.1.1.$J
        done
done
time tc qdisc del dev root

real    0m54.764s
user    0m0.023s
sys     0m0.000s

The rcu_barrier() there is to ensure we free the block after all chains
are gone, that is, to queue tcf_block_put_final() at the tail of workqueue.
We can achieve this ordering requirement by refcnt'ing tcf block instead,
that is, the tcf block is freed only when the last chain in this block is
gone. This also simplifies the code.

Paolo reported after this patch we get:

real    0m0.017s
user    0m0.000s
sys     0m0.017s

Tested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ieee802154-for-davem-2017-12-04' of git://git./linux/kernel/git/sschmidt/wpan-next

Stefan Schmidt says:

====================
pull-request: ieee802154-next 2017-12-04

Some update from ieee802154 to *net-next*

Jian-Hong Pan updated our docs to match the APIs in code.
Michael Hennerichs enhanced the adf7242 driver to work with adf7241
devices and reworked the IRQ and packet handling in the driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

netdevsim: make functions nsim_bpf_create_prog and nsim_bpf_destroy_prog static

Functions nsim_bpf_create_prog and nsim_bpf_destroy_prog are local to the
source and do not need to be in global scope, so make them static.

Cleans up sparse warnings:
symbol 'nsim_bpf_create_prog' was not declared. Should it be static?
symbol 'nsim_bpf_destroy_prog' was not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'phylib-hard-resetting-devices'

Geert Uytterhoeven says:

====================
Teach phylib hard-resetting devices

This patch series adds optional PHY reset support to phylib.

The first two patches are destined for David's net-next tree. They add
core PHY reset code, and update a driver that currently uses its own
reset code.

The last two patches are destined for Simon's renesas tree.  They add
properties to describe the EthernetAVB PHY reset topology to the common
Salvator-X/XS and ULCB DTS files, which solves two issues:
  1. On Salvator-XS, the enable pin of the regulator providing PHY power
     is connected to PRESETn, and PSCI powers down the SoC during system
     suspend.  Hence a PHY reset is needed to restore network
     functionality after system resume.
  2. Linux should not rely on the boot loader having reset the PHY, but
     should reset the PHY during driver probe.

Changes compared to v3:
  - Remove Florian's Acked-by,
  - Add missing #include <linux/gpio/consumer.h>,
  - Re-add the gpiod check, as the dummy gpiod_set_value() for !GPIOLIB
    does not ignore NULL, and calls WARN_ON(1),
  - Do not reassert the reset signal if {mdio,phy}_probe() or
    phy_device_register() succeeded, as that may destroy initial setup,
  - Do not deassert the reset signal in {mdio,phy}_remove(), as it
    should already be deasserted,
  - Bring the PHY back into reset state in phy_device_remove(),
  - Move/consolidate GPIO descriptor acquiring code from
    of_mdiobus_register_phy() and of_mdiobus_register_device() to
    mdiobus_register_device().
    Note that this changes behavior slightly, in that the reset signal
    is now also asserted when called from of_mdiobus_register_device().
  - Add Reviewed-by,

Changes compared to v2, as sent by Sergei Shtylyov:
  - Fix fwnode_get_named_gpiod() call due to added parameters (which
    allowed to eliminate the gpiod_direction_output() call),
  - Rebased, refreshed, reworded,
  - Take over from Sergei,
  - Add Acked-by,
  - Remove unneeded gpiod check, as gpiod_set_value() handles NULL fine,
  - Handle fwnode_get_named_gpiod() errors correctly:
      - -ENOENT is ignored (the GPIO is optional), and turned into NULL,
which allowed to remove all later !IS_ERR() checks,
      - Other errors (incl. -EPROBE_DEFER) are propagated,
  - Extract DTS patches from series "[PATCH 0/4] ravb: Add PHY reset
    support" (https://www.spinics.net/lists/netdev/msg457308.html), and
    incorporate in this series, after moving reset-gpios from the
    ethernet to the ethernet-phy node.

Given (1) the new reset-gpios DT property in the PHY node follows
established practises, (2) the DT binding change in the first patch has
been acked by Rob, and (3) the DTS patch does not cause any regressions
if it is applied before the PHY driver patches, the DTS patches can be
applied independently.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

macb: Kill PHY reset code

With the phylib now being aware of the "reset-gpios" PHY node property,
there should be no need to frob the PHY reset in this driver anymore...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylib: Add device reset GPIO support

The PHY devices sometimes do have their reset signal (maybe even power
supply?) tied to some GPIO and sometimes it also does happen that a boot
loader does not leave it deasserted. So far this issue has been attacked
from (as I believe) a wrong angle: by teaching the MAC driver to manipulate
the GPIO in question; that solution, when applied to the device trees, led
to adding the PHY reset GPIO properties to the MAC device node, with one
exception: Cadence MACB driver which could handle the "reset-gpios" prop
in a PHY device subnode. I believe that the correct approach is to teach
the 'phylib' to get the MDIO device reset GPIO from the device tree node
corresponding to this device -- which this patch is doing...

Note that I had to modify the AT803x PHY driver as it would stop working
otherwise -- it made use of the reset GPIO for its own purposes...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Rob Herring <robh@kernel.org>
[geert: Propagate actual errors from fwnode_get_named_gpiod()]
[geert: Avoid destroying initial setup]
[geert: Consolidate GPIO descriptor acquiring code]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Richard Leitner <richard.leitner@skidata.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

flow_dissector: dissect tunnel info outside __skb_flow_dissect()

Move dissection of tunnel info to outside of the main flow dissection
function, __skb_flow_dissect(). The sole user of this feature, the flower
classifier, is updated to call tunnel info dissection directly, using
skb_flow_dissect_tunnel_info().

This results in a slightly less complex implementation of
__skb_flow_dissect(), in particular removing logic from that call path
which is not used by the majority of users. The expense of this is borne by
the flower classifier which now has to make an extra call for tunnel info
dissection.

This patch should not result in any behavioural change.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tun: add eBPF based queue selection method

This patch introduces an eBPF based queue selection method. With this,
the policy could be offloaded to userspace completely through a new
ioctl TUNSETSTEERINGEBPF.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hns3-reset-refactor'

Salil Mehta says:

====================
net: hns3: Refactors "reset" handling code in HCLGE layer of HNS3 driver

This patch refactors the code of the reset feature in HCLGE layer
of HNS3 PF driver. Prime motivation to do this change is:
1. To reduce the time for which common miscellaneous Vector 0
   interrupt is disabled because of the reset. Simplification
   of the common miscellaneous interrupt handler routine(for
   Vector 0) used to handle reset and other sources of Vector
   0 interrupt.
2. Separate the task for handling the reset
3. Simplification of reset request submission and pending reset
   logic.

To achieve above below few things have been done:
1. Interrupt is disabled while common miscellaneous interrupt
   handler is entered and re-enabled before it is exit. This
   reduces the interrupt handling latency as compared to older
   interrupt handling scheme where interrupt was being disabled
   in interrupt handler context and re-enabled in task context
   some time later. Made Miscellaneous interrupt handler more
   generic to handle all sources including reset interrupt source.
2. New reset service task has been introduced to service the
   reset handling.
3. Introduces new reset service task for honoring software reset
   requests like from network stack related to timeout and serving
   the pending reset request(to reset the driver and associated
   clients).

Change Log:
Patch V2: Addressed comment by Andrew Lunn
Link: https://lkml.org/lkml/2017/12/1/366
Patch V1: Initial Submit
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Refactors the requested reset & pending reset handling code

In exisiting code, the way to detect if driver/client reset should
be executed or if hardware should be be soft resetted was overly
complex.

Existing code use to read the interrupt status register from task
context to figure out if the interrupt source event was reset and
then use clear the interrupt source for reset while waiting for the
hardware to finish the reset. This behaviour again was confusing
and overly complex in terms of the flow.

This patch simplifies the handling of the requested reset and the
pending reset(i.e. reset which have already been asserted by the
software and hardware has acknowledged back to driver that it is
processing the hardware reset through interrupt)

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add reset service task for handling reset requests

Existing common service task was being used to service the reset
requests. This patch tries to make the handling of reset cleaner
by separating task to handle the reset requests. This might in
turn help in adapting similar handling approach for other
interrupt events like mailbox, sharing vector 0 interrupt.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Refactor of the reset interrupt handling logic

The reset interrupt event shares common miscellaneous interrupt
Vector 0. In the existing reset interrupt handling we disable
the Vector 0 interrupt in misc interrupt handler and re-enable
them later in context to common service task.

This also means other event sources like mailbox would also be
deferred or if the interrupt event was due to mailbox(which shall
be supported for VF soon), it could delay the reset handling.

This patch reorganizes the reset interrupt handling logic and
makes it more fair to other events.

Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: change GFP_KERNEL to GFP_ATOMIC

Function gem_add_flow_filter called on line 2958 inside lock on line 2949
but uses GFP_KERNEL

Generated by: scripts/coccinelle/locks/call_kern.cocci

Fixes: ae8223de3df5 ("net: macb: Added support for RX filtering")
CC: Rafal Ozieblo <rafalo@cadence.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'SFP-phylink-updates'

Russell King says:

====================
SFP/phylink updates

This series, which follows on from the fixes posted earlier, improves
the phylink/sfp support.  Changes included here are:

- Merge 802.3z and SGMII modes into one "in-band" mode, using the
  PHY_INTERFACE_MODE_xxx definition to determine which should be used.
  This allows more flexibility as more interface modes become
  available.

- Allow 2500base-X and 10GBASE-KR to be requested from SFP.

- Remove unused and unnecessary phylink_init_eee()

- Restart 802.3z autonegotiation when starting the network device to
  ensure that the negotiated parameters are always correct.  It has
  been observed on mvneta that this is not always the case without
  this change.

- Add kerneldoc documentation for phylink and sfp upstream facing APIs
  and link it in to the networking documentation.

- Resolve a sparse warning in sfp-bus.c

- Convert phylink/sfp to use fwnode rather than DT so that other firmware
  systems can take advantage of this - I have received a request for it
  to be usable with ACPI.  The exception to this is our interactions with
  phylib, as phylib itself does not yet support fwnode.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: convert to fwnode

Convert phylink to fwnode, switching phylink_create() from taking a
device_node to taking a fwnode_handle. This will allow other firmware
systems to take advantage of sfp/phylink support.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfp: convert to fwnode

Convert sfp-bus to use fwnode rather than device_node internally, so
we can support more than just device tree firmware.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfp: fix sparse warning

drivers/net/phy/sfp-bus.c:298:13: warning: context imbalance in 'sfp_bus_release' - wrong count at exit

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfp: add documentation for kernel APIs

Add kernel-doc documentation for sfp kernel APIs, and link it into the
networking kapi documentation under "Network device support".

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: add documentation for kernel APIs

Add kernel-doc documentation for phylink kernel APIs, and link it into
the networking kapi documentation under "Network device support".

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: restart 802.3z negotiation when starting net device

Restart 802.3z negotiation when the net device is brought up to ensure
that the link partner has our current link modes.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: remove phylink_init_eee()

phylink_init_eee() serves no purpose, remove it.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: add support for 2500baseX and 10GbaseKR

Add support for handling the faster 2.5G and 10G link modes when used
with SFP modules.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: get rid of separate Cisco SGMII and 802.3z modes

Since the handling of SGMII and 802.3z is now the same, combine the
MLO_AN_xxx constants into a single MLO_AN_INBAND, and use the PHY
interface mode to distinguish between Cisco SGMII and 802.3z.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: merge SGMII and 802.3z handling

The code handling SGMII and 802.3z is essentially the same, except that
we assume 802.3z has no PHY. Re-organise the code such that these cases
are merged, and exclude 802.3z mode from having a PHY attached. This
results in the same link handling behaviour as before.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phy: add phy_interface_mode_is_8023z() helper

Add and use phy_interface_mode_is_8023z() helper to identify the
interface modes that use 802.3z negotiation. Use it in phylink's
phylink_mac_an_restart().

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtnetlink: fix rtnl_link msghandler rcu annotations

Incorrect/missing annotations caused a few sparse warnings:

rtnetlink.c:155:15: incompatible types .. (different address spaces)
rtnetlink.c:157:23: incompatible types .. (different address spaces)
rtnetlink.c:185:15: incompatible types .. (different address spaces)
rtnetlink.c:285:15: incompatible types .. (different address spaces)
rtnetlink.c:317:9: incompatible types .. (different address spaces)
rtnetlink.c:3054:23: incompatible types .. (different address spaces)

no change in generated code.

Fixes: addf9b90de22f7 ("net: rtnetlink: use rcu to free rtnl message handlers")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/davem/net

Small overlapping change conflict ('net' changed a line,
'net-next' added a line right afterwards) in flexcan.c

Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: Add access to snd_cwnd and others in sock_ops

Adds read access to snd_cwnd and srtt_us fields of tcp_sock. Since these
fields are only valid if the socket associated with the sock_ops program
call is a full socket, the field is_fullsock is also added to the
bpf_sock_ops struct. If the socket is not a full socket, reading these
fields returns 0.

Note that in most cases it will not be necessary to check is_fullsock to
know if there is a full socket. The context of the call, as specified by
the 'op' field, can sometimes determine whether there is a full socket.

The struct bpf_sock_ops has the following fields added:

  __u32 is_fullsock;      /* Some TCP fields are only valid if
                           * there is a full socket. If not, the
                           * fields read as zero.
   */
  __u32 snd_cwnd;
  __u32 srtt_us;          /* Averaged RTT << 3 in usecs */

There is a new macro, SOCK_OPS_GET_TCP32(NAME), to make it easier to add
read access to more 32 bit tcp_sock fields.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: move bpf csum flag check

trivial move the BPF_F_ZERO_CSUM_TX check right below the
'flags & BPF_F_DONT_FRAGMENT', so common tun_flags handling
is logically together.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
"virtio and qemu bugfixes

  A couple of bugfixes that just became ready"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_balloon: fix increment of vb->num_pfns in fill_balloon()
  virtio: release virtio index when fail to device_register
  fw_cfg: fix driver remove

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Various TCP control block fixes, including one that crashes with
    SELinux, from David Ahern and Eric Dumazet.

2) Fix ACK generation in rxrpc, from David Howells.

3) ipvlan doesn't set the mark properly in the ipv4 route lookup key,
    from Gao Feng.

4) SIT configuration doesn't take on the frag_off ipv4 field
    configuration properly, fix from Hangbin Liu.

5) TSO can fail after device down/up on stmmac, fix from Lars Persson.

6) Various bpftool fixes (mostly in JSON handling) from Quentin Monnet.

7) Various SKB leak fixes in vhost/tun/tap (mostly observed as
    performance problems). From Wei Xu.

8) mvpps's TX descriptors were not zero initialized, from Yan Markman.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
  tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match()
  tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()
  rxrpc: Fix the MAINTAINERS record
  rxrpc: Use correct netns source in rxrpc_release_sock()
  liquidio: fix incorrect indentation of assignment statement
  stmmac: reset last TSO segment size after device open
  ipvlan: Add the skb->mark as flow4's member to lookup route
  s390/qeth: build max size GSO skbs on L2 devices
  s390/qeth: fix GSO throughput regression
  s390/qeth: fix thinko in IPv4 multicast address tracking
  tap: free skb if flags error
  tun: free skb in early errors
  vhost: fix skb leak in handle_rx()
  bnxt_en: Fix a variable scoping in bnxt_hwrm_do_send_msg()
  bnxt_en: fix dst/src fid for vxlan encap/decap actions
  bnxt_en: wildcard smac while creating tunnel decap filter
  bnxt_en: Need to unconditionally shut down RoCE in bnxt_shutdown
  phylink: ensure we take the link down when phylink_stop() is called
  sfp: warn about modules requiring address change sequence
  sfp: improve RX_LOS handling
  ...

arch/tile: mark as orphaned

The chip family of TILEPro and TILE-Gx was developed by Tilera, which
was eventually acquired by Mellanox. The tile architecture was added to
the kernel in 2010 and first appeared in 2.6.36.

Now at Mellanox we are developing new chips based on the ARM64
architecture; our last TILE-Gx chip (the Gx72) was released in 2013, and
our customers using tile architecture products are not, as far as we
know, looking to upgrade to newer kernel releases. In the absence of
someone in the community stepping up to take over maintainership, this
commit marks the architecture as orphaned.

Cc: Chris Metcalf <metcalf@alum.mit.edu>
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

rtnetlink: ipv6: convert remaining users to rtnl_register_module

convert remaining users of rtnl_register to rtnl_register_module
and un-export rtnl_register.

Requested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-next-for-4.16-20171201' of git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2017-12-01

this is a pull request of 10 patches for net-next/master.

The first two patches are by Arnd Bergmann, they convert the peak_usb
from using "struct timeval" to "ktime_t". The error handling in the
vxcan driver is clean up by Markus Elfring's patch. Bhumika Goyal
contributes a patch for the c_can_pci driver to make the pci data const.
The six patches by Pankaj Bansal for the flexcan driver add LS1021A
support by making the endianness of the driver configurable by the
device tree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2017-12-03

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Addition of a software model for BPF offloads in order to ease
   testing code changes in that area and make semantics more clear.
   This is implemented in a new driver called netdevsim, which can
   later also be extended for other offloads. SR-IOV support is added
   as well to netdevsim. BPF kernel selftests for offloading are
   added so we can track basic functionality as well as exercising
   all corner cases around BPF offloading, from Jakub.

2) Today drivers have to drop the reference on BPF progs they hold
   due to XDP on device teardown themselves. Change this in order
   to make XDP handling inside the drivers less error prone, and
   move disabling XDP to the core instead, also from Jakub.

3) Misc set of BPF verifier improvements and cleanups as preparatory
   work for upcoming BPF-to-BPF calls. Among others, this set also
   improves liveness marking such that pruning can be slightly more
   effective. Register and stack liveness information is now included
   in the verifier log as well, from Alexei.

4) nfp JIT improvements in order to identify load/store sequences in
   the BPF prog e.g. coming from memcpy lowering and optimizing them
   through the NPU's command push pull (CPP) instruction, from Jiong.

5) Cleanups to test_cgrp2_attach2.c BPF sample code in oder to remove
   bpf_prog_attach() magic values and replacing them with actual proper
   attach flag instead, from David.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rtnetlink-rework-handler-registration'

Florian Westphal says:

====================
rtnetlink: rework handler (un)registering

Peter Zijlstra reported (referring to commit 019a316992ee0d983,
"rtnetlink: add reference counting to prevent module unload while dump is in progress"):

1) it not in fact a refcount, so using refcount_t is silly
2) there is a distinct lack of memory barriers, so we can easily
    observe the decrement while the msg_handler is still in progress.
3) waiting with a schedule()/yield() loop is complete crap and subject
    life-locks, imagine doing that rtnl_unregister_all() from a RT task.

In ancient times rtnetlink exposed a statically-sized table with
preset doit/dumpit handlers to be called for a protocol/type pair.

Later the rtnl_register interface was added and the table was allocated
on demand.  Eventually these were also used by modules.

Problem is that nothing prevents module unload while a netlink dump
is in progress.  netlink dumps can be span multiple recv calls and
netlink core saves the to-be-repeated dumper address for later invocation.

To prevent rmmod the netlink core expects callers to pass in the owning
module so a reference can be taken.

So far rtnetlink wasn't doing this, add new interface to pass THIS_MODULE.
Moreover, when converting parts of the rtnetlink handling to rcu this code
gained way too many READ_ONCE spots, remove them and the extra refcounting.

Take a module reference when running dumpit and doit callbacks
and never alter content of rtnl_link structures after they have been
published via rcu_assign_pointer.

Based partially on earlier patch from Peter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

rtnetlink: remove __rtnl_register

This removes __rtnl_register and switches callers to either
rtnl_register or rtnl_register_module.

Also, rtnl_register() will now print an error if memory allocation
failed rather than panic the kernel.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: use rtnl_register_module where needed

all of these can be compiled as a module, so use new
_module version to make sure module can no longer be removed
while callback/dump is in use.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtnetlink: get reference on module before invoking handlers

Add yet another rtnl_register function. It will be used by modules
that can be removed.

The passed module struct is used to prevent module unload while
a netlink dump is in progress or when a DOIT_UNLOCKED doit callback
is called.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: rtnetlink: use rcu to free rtnl message handlers

rtnetlink is littered with READ_ONCE() because we can have read accesses
while another cpu can write to the structure we're reading by
(un)registering doit or dumpit handlers.

This patch changes this so that (un)registering cpu allocates a new
structure and then publishes it via rcu_assign_pointer, i.e. once
another cpu can see such pointer no modifications will occur anymore.

based on initial patch from Peter Zijlstra.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: re-add mistakenly removed config settings

Previous patch mistakenly removed three chip-specific config settings.
Add them again.

Fixes: 80274abafc60 "net: phy: remove generic settings for callbacks config_aneg and read_status from drivers"
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipv6-gre-collect_md'

William Tu says:

====================
add ip6 gre and gretap collect_md mode

Similar to gre, vxlan, geneve, ipip tunnels, allow ip6gretap tunnels to
operate in collect metadata mode. The first patch adds the support to
ip6_gre.c. The second patch enables unsetting the csum for ipv6 tunnel,
when using bpf_skb_[gs]et_tunnel_key() helpers. Finally, the last patch
adds the ip6 gre and gretap tunnel test cases to BPF sample code.

The corresponding iproute2 patch:
https://marc.info/?l=linux-netdev&m=151216943128087&w=2
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: extend test_tunnel_bpf.sh with ip6gre

Extend existing tests for vxlan, gre, geneve, ipip, erspan,
to include ip6 gre and gretap tunnel.

Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: allow disabling tunnel csum for ipv6

Before the patch, BPF_F_ZERO_CSUM_TX can be used only for ipv4 tunnel.
With introduction of ip6gretap collect_md mode, the flag should be also
supported for ipv6.

Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>