git.openwrt.org Git - openwrt/staging/blogic.git/log

tcp: don't mark recently sent packets lost on RTO

An RTO event indicates the head has not been acked for a long time
after its last (re)transmission. But the other packets are not
necessarily lost if they have been only sent recently (for example
due to application limit). This patch would prohibit marking packets
sent within an RTT to be lost on RTO event, using similar logic in
TCP RACK detection.

Normally the head (SND.UNA) would be marked lost since RTO should
fire strictly after the head was sent. An exception is when the
most recent RACK RTT measurement is larger than the (previous)
RTO. To address this exception the head is always marked lost.

Congestion control interaction: since we may not mark every packet
lost, the congestion window may be more than 1 (inflight plus 1).
But only one packet will be retransmitted after RTO, since
tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The
connection still performs slow start from one packet (with Cubic
congestion control).

This commit was tested in an A/B test with Google web servers,
and showed a reduction of 2% in (spurious) retransmits post
timeout (SlowStartRetrans), and correspondingly reduced DSACKs
(DSACKIgnoredOld) by 7%.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: new helper tcp_rack_skb_timeout

Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
to prepare the final RTO change.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: separate loss marking and state update on RTO

Previously when TCP times out, it first updates cwnd and ssthresh,
marks packets lost, and then updates congestion state again. This
was fine because everything not yet delivered is marked lost,
so the inflight is always 0 and cwnd can be safely set to 1 to
retransmit one packet on timeout.

But the inflight may not always be 0 on timeout if TCP changes to
mark packets lost based on packet sent time. Therefore we must
first mark the packet lost, then set the cwnd based on the
(updated) inflight.

This is not a pure refactor. Congestion control may potentially
break if it uses (not yet updated) inflight to compute ssthresh.
Fortunately all existing congestion control modules does not do that.
Also it changes the inflight when CA_LOSS_EVENT is called, and only
westwood processes such an event but does not use inflight.

This change has two other minor side benefits:
1) consistent with Fast Recovery s.t. the inflight is updated
first before tcp_enter_recovery flips state to CA_Recovery.

2) avoid intertwining loss marking with state update, making the
code more readable.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: new helper tcp_timeout_mark_lost

Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets
lost upon RTO.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: account lost retransmit after timeout

The previous approach for the lost and retransmit bits was to
wipe the slate clean: zero all the lost and retransmit bits,
correspondingly zero the lost_out and retrans_out counters, and
then add back the lost bits (and correspondingly increment lost_out).

The new approach is to treat this very much like marking packets
lost in fast recovery. We don’t wipe the slate clean. We just say
that for all packets that were not yet marked sacked or lost, we now
mark them as lost in exactly the same way we do for fast recovery.

This fixes the lost retransmit accounting at RTO time and greatly
simplifies the RTO code by sharing much of the logic with Fast
Recovery.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: simpler NewReno implementation

This is a rewrite of NewReno loss recovery implementation that is
simpler and standalone for readability and better performance by
using less states.

Note that NewReno refers to RFC6582 as a modification to the fast
recovery algorithm. It is used only if the connection does not
support SACK in Linux. It should not to be confused with the Reno
(AIMD) congestion control.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: disable RFC6675 loss detection

This patch disables RFC6675 loss detection and make sysctl
net.ipv4.tcp_recovery = 1 controls a binary choice between RACK
(1) or RFC6675 (0).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: support DUPACK threshold in RACK

This patch adds support for the classic DUPACK threshold rule
(#DupThresh) in RACK.

When the number of packets SACKed is greater or equal to the
threshold, RACK sets the reordering window to zero which would
immediately mark all the unsacked packets below the highest SACKed
sequence lost. Since this approach is known to not work well with
reordering, RACK only uses it if no reordering has been observed.

The DUPACK threshold rule is a particularly useful extension to the
fast recoveries triggered by RACK reordering timer. For example
data-center transfers where the RTT is much smaller than a timer
tick, or high RTT path where the default RTT/4 may take too long.

Note that this patch differs slightly from RFC6675. RFC6675
considers a packet lost when at least #DupThresh higher-sequence
packets are SACKed.

With RACK, for connections that have seen reordering, RACK
continues to use a dynamically-adaptive time-based reordering
window to detect losses. But for connections on which we have not
yet seen reordering, this patch considers a packet lost when at
least one higher sequence packet is SACKed and the total number
of SACKed packets is at least DupThresh. For example, suppose a
connection has not seen reordering, and sends 10 packets, and
packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
lost. RACK considers packets 1, 2, 4, 6 lost.

There is some small risk of spurious retransmits here due to
reordering. However, this is mostly limited to the first flight of
a connection on which the sender receives SACKs from reordering.
And RFC 6675 and FACK loss detection have a similar risk on the
first flight with reordering (it's just that the risk of spurious
retransmits from reordering was slightly narrower for those older
algorithms due to the margin of 3*MSS).

Also the minimum reordering window is reduced from 1 msec to 0
to recover quicker on short RTT transfers. Therefore RACK is more
aggressive in marking packets lost during recovery to reduce the
reordering window timeouts.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: disable mq feature for "AM33xx ES1.0" devices

The early versions of am33xx devices, related to ES1.0 SoC revision
have errata limiting mq support. That's the same errata as
commit 7da1160002f1 ("drivers: net: cpsw: add am335x errata workarround for
interrutps")

AM33xx Errata [1] Advisory 1.0.9
http://www.ti.com/lit/er/sprz360f/sprz360f.pdf

After additional investigation were found that drivers w/a is
propagated on all AM33xx SoCs and on DM814x. But the errata exists
only for ES1.0 of AM33xx family, limiting mq support for revisions
after ES1.0. So, disable mq support only for related SoCs and use
separate polls for revisions allowing mq.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sched-refactor-NOLOCK-qdiscs'

Paolo Abeni says:

====================
sched: refactor NOLOCK qdiscs

With the introduction of NOLOCK qdiscs, pfifo_fast performances in the
uncontended scenario degraded measurably, especially after the commit
eb82a9944792 ("net: sched, fix OOO packets with pfifo_fast").

This series restore the pfifo_fast performances in such scenario back the
previous level, mainly reducing the number of atomic operations required to
perform the qdisc_run() call. Even performances in the contended scenario
increase measurably.

Note: This series is on top of:

sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

pfifo_fast: drop unneeded additional lock on dequeue

After the previous patch, for NOLOCK qdiscs, q->seqlock is
always held when the dequeue() is invoked, we can drop
any additional locking to protect such operation.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sched: replace __QDISC_STATE_RUNNING bit with a spin lock

So that we can use lockdep on it.
The newly introduced sequence lock has the same scope of busylock,
so it shares the same lockdep annotation, but it's only used for
NOLOCK qdiscs.

With this changeset we acquire such lock in the control path around
flushing operation (qdisc reset), to allow more NOLOCK qdisc perf
improvement in the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-05-17

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Provide a new BPF helper for doing a FIB and neighbor lookup
   in the kernel tables from an XDP or tc BPF program. The helper
   provides a fast-path for forwarding packets. The API supports
   IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
   implemented in this initial work, from David (Ahern).

2) Just a tiny diff but huge feature enabled for nfp driver by
   extending the BPF offload beyond a pure host processing offload.
   Offloaded XDP programs are allowed to set the RX queue index and
   thus opening the door for defining a fully programmable RSS/n-tuple
   filter replacement. Once BPF decided on a queue already, the device
   data-path will skip the conventional RSS processing completely,
   from Jakub.

3) The original sockmap implementation was array based similar to
   devmap. However unlike devmap where an ifindex has a 1:1 mapping
   into the map there are use cases with sockets that need to be
   referenced using longer keys. Hence, sockhash map is added reusing
   as much of the sockmap code as possible, from John.

4) Introduce BTF ID. The ID is allocatd through an IDR similar as
   with BPF maps and progs. It also makes BTF accessible to user
   space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
   through BPF_OBJ_GET_INFO_BY_FD, from Martin.

5) Enable BPF stackmap with build_id also in NMI context. Due to the
   up_read() of current->mm->mmap_sem build_id cannot be parsed.
   This work defers the up_read() via a per-cpu irq_work so that
   at least limited support can be enabled, from Song.

6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
   JIT conversion as well as implementation of an optimized 32/64 bit
   immediate load in the arm64 JIT that allows to reduce the number of
   emitted instructions; in case of tested real-world programs they
   were shrinking by three percent, from Daniel.

7) Add ifindex parameter to the libbpf loader in order to enable
   BPF offload support. Right now only iproute2 can load offloaded
   BPF and this will also enable libbpf for direct integration into
   other applications, from David (Beckett).

8) Convert the plain text documentation under Documentation/bpf/ into
   RST format since this is the appropriate standard the kernel is
   moving to for all documentation. Also add an overview README.rst,
   from Jesper.

9) Add __printf verification attribute to the bpf_verifier_vlog()
   helper. Though it uses va_list we can still allow gcc to check
   the format string, from Mathieu.

10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
    is a bash 4.0+ feature which is not guaranteed to be available
    when calling out to shell, therefore use a more portable variant,
    from Joe.

11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
    instead of relying on the gcc built-in, from Björn.

12) Fix a sock hashmap kmalloc warning reported by syzbot when an
    overly large key size is used in hashmap then causing overflows
    in htab->elem_size. Reject bogus attr->key_size early in the
    sock_hash_alloc(), from Yonghong.

13) Ensure in BPF selftests when urandom_read is being linked that
    --build-id is always enabled so that test_stacktrace_build_id[_nmi]
    won't be failing, from Alexei.

14) Add bitsperlong.h as well as errno.h uapi headers into the tools
    header infrastructure which point to one of the arch specific
    uapi headers. This was needed in order to fix a build error on
    some systems for the BPF selftests, from Sirio.

15) Allow for short options to be used in the xdp_monitor BPF sample
    code. And also a bpf.h tools uapi header sync in order to fix a
    selftest build failure. Both from Prashant.

16) More formally clarify the meaning of ID in the direct packet access
    section of the BPF documentation, from Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: sockmap, on update propagate errors back to userspace

When an error happens in the update sockmap element logic also pass
the err up to the user.

Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: fix sock hashmap kmalloc warning

syzbot reported a kernel warning below:
  WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x1b9/0x294 lib/dump_stack.c:113
   panic+0x22f/0x4de kernel/panic.c:184
   __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
   report_bug+0x252/0x2d0 lib/bug.c:186
   fixup_bug arch/x86/kernel/traps.c:178 [inline]
   do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
   invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
  RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
  RSP: 0018:ffff8801d907fc58 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff8801aeecb280 RCX: ffffffff8185ebd7
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffe1
  RBP: ffff8801d907fc58 R08: ffff8801adb5e1c0 R09: ffffed0035a84700
  R10: ffffed0035a84700 R11: ffff8801ad423803 R12: ffff8801aeecb280
  R13: 00000000fffffff4 R14: ffff8801ad891a00 R15: 00000000014200c0
   __do_kmalloc mm/slab.c:3713 [inline]
   __kmalloc+0x25/0x760 mm/slab.c:3727
   kmalloc include/linux/slab.h:517 [inline]
   map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
   __do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
   __se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
   __x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

The test case is against sock hashmap with a key size 0xffffffe1.
Such a large key size will cause the below code in function
sock_hash_alloc() overflowing and produces a smaller elem_size,
hence map creation will be successful.
    htab->elem_size = sizeof(struct htab_elem) +
                      round_up(htab->map.key_size, 8);

Later, when map_get_next_key is called and kernel tries
to allocate the key unsuccessfully, it will issue
the above warning.

Similar to hashtab, ensure the key size is at most
MAX_BPF_STACK for a successful map creation.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: add ifindex to enable offload support

BPF programs currently can only be offloaded using iproute2. This
patch will allow programs to be offloaded using libbpf calls.

Signed-off-by: David Beckett <david.beckett@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: add __printf verification to bpf_verifier_vlog

__printf is useful to verify format and arguments. ‘bpf_verifier_vlog’
function is used twice in verifier.c in both cases the caller function
already uses the __printf gcc attribute.

Remove the following warning, triggered with W=1:

kernel/bpf/verifier.c:176:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

samples/bpf: Decrement ttl in fib forwarding example

Only consider forwarding packets if ttl in received packet is > 1 and
decrement ttl before handing off to bpf_redirect_map.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-sock-hashmap'

John Fastabend says:

====================
In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets that need to be referenced using longer keys.

This series adds support for a sockhash map reusing as much of
the sockmap code as possible. I made the decision to add sockhash
specific helpers vs trying to generalize the existing helpers
because (a) they have sockmap in the name and (b) the keys are
different types. I prefer to be explicit here rather than play
type games or do something else tricky.

To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.

v2: fix file stats and add v2 tag
v3: move tool updates into test patch, move bpftool updates into
    its own patch, and fixup the test patch stats to catch the
    renamed file and provide only diffs ± on that.
v4: Add documentation to UAPI bpf.h
v5: Add documentation to tools UAPI bpf.h
v6: 'git add' test_sockhash_kern.c which was previously missing
    but was not causing issues because of typo in test script,
    noticed by Daniel. After this the git format-patch -M option
    no longer tracks the rename of the test_sockmap_kern files for
    some reason. I guess the diff has exceeded some threshold.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: bpftool, support for sockhash

This adds the SOCKHASH map type to bpftools so that we get correct
pretty printing.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: selftest additions for SOCKHASH

This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.

We then run the entire test suite with each type.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

cxgb4: update LE-TCAM collection for T6

For T6, clip table is separated from main TCAM. So, update LE-TCAM
collection logic to collect clip table TCAM as well. IPv6 takes
4 entries in clip table TCAM compared to 2 entries in main TCAM.

Also, in case of errors, keep LE-TCAM collected so far and set the
status to partial dump.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'qed-LL2-fixes'

Michal Kalderon says:

====================
qed: LL2 fixes

This series fixes some issues in ll2 related to synchronization
and resource freeing
====================

Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>

qed: Fix LL2 race during connection terminate

Stress on qedi/qedr load unload lead to list_del corruption.
This is due to ll2 connection terminate freeing resources without
verifying that no more ll2 processing will occur.

This patch unregisters the ll2 status block before terminating
the connection to assure this race does not occur.

Fixes: 1d6cff4fca4366 ("qed: Add iSCSI out of order packet handling")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: Fix possibility of list corruption during rmmod flows

The ll2 flows of flushing the txq/rxq need to be synchronized with the
regular fp processing. Caused list corruption during load/unload stress
tests.

Fixes: 0a7fb11c23c0f ("qed: Add Light L2 support")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qed: LL2 flush isles when connection is closed

Driver should free all pending isles once it gets a FLUSH cqe from FW.
Part of iSCSI out of order flow.

Fixes: 1d6cff4fca4366 ("qed: Add iSCSI out of order packet handling")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethoc: Remove useless test before clk_disable_unprepare

clk_disable_unprepare() already checks that the clock pointer is valid.
No need to test it before calling it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: Remove useless test before clk_disable_unprepare

clk_disable_unprepare() already checks that the clock pointer is valid.
No need to test it before calling it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qcom/emac: Encapsulate sgmii ops under one structure

This patch introduces ops structure for sgmii, This by ensures that
we do not need dummy functions in case of emulation platforms.

Signed-off-by: Hemanth Puranik <hpuranik@codeaurora.org>
Acked-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rmnet-next'

Subash Abhinov Kasiviswanathan says:

====================
net: qualcomm: rmnet: Updates 2018-05-14

Patch 1 adds tx_drops counter to more places.
Patch 2 adds ethtool private stats support to make it easy to debug
the checksum offload path.
Patch 3 is a cleanup in command packet processing path.

v1->v2: Fix the incorrect if / else statement in
rmnet_map_checksum_downlink_packet() and define rmnet_ethtool_ops
as static as mentioned by kbuild test robot.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: qualcomm: rmnet: Remove redundant command check

The command packet size is already checked once in
rmnet_map_deaggregate() for the header, packet and trailer size, so
this additional check is not needed.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qualcomm: rmnet: Add support for ethtool private stats

Add ethtool private stats handler to debug the handling of packets
with checksum offload header / trailer. This allows to keep track of
the number of packets for which hardware computes the checksum and
counts and reasons where checksum computation was skipped in hardware
and was done in the network stack.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qualcomm: rmnet: Capture all drops in transmit path

Packets in transmit path could potentially be dropped if there were
errors while adding the MAP header or the checksum header.
Increment the tx_drops stats in these cases.

Additionally, refactor the code to free the packet and increment
the tx_drops stat under a single label.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'of-mdio-Fall-back-to-mdiobus_register-with-NULL-device_node'

Florian Fainelli says:

====================
of: mdio: Fall back to mdiobus_register() with NULL device_node

This patch series updates of_mdiobus_register() such that when the device_node
argument is NULL, it calls mdiobus_register() directly. This is consistent with
the behavior of of_mdiobus_register() when CONFIG_OF=n.

I only converted the most obvious drivers, there are others that have a much
less obvious behavior and specifically attempt to deal with CONFIG_ACPI.

Changes in v2:

- fixed build error in davincin_mdio.c (Grygorii)
- reworked first patch a bit: commit message, subject and removed useless
code comment
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: Remove device_node checks with of_mdiobus_register()

A number of drivers have the following pattern:

if (np)
of_mdiobus_register()
else
mdiobus_register()

which the implementation of of_mdiobus_register() now takes care of.
Remove that pattern in drivers that strictly adhere to it.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Fugang Duan <fugang.duan@nxp.com>
Reviewed-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

of: mdio: Fall back to mdiobus_register() with NULL device_node

When the device_node specified is NULL, fall back to mdiobus_register().
We have a number of drivers having a similar pattern which is:

if (np)
of_mdiobus_register()
else
mdiobus_register()

so incorporate that behavior within the core of_mdiobus_register()
function. This is also consistent with the stub version that we defined
when CONFIG_OF=n.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw-phy-sel: check bus_find_device() ret value

This fixes klockworks warnings: Pointer 'dev' returned from call to
function 'bus_find_device' at line 179 may be NULL and will be dereferenced
at line 181.

    cpsw-phy-sel.c:179: 'dev' is assigned the return value from function 'bus_find_device'.
    bus.c:342: 'bus_find_device' explicitly returns a NULL value.
    cpsw-phy-sel.c:181: 'dev' is dereferenced by passing argument 1 to function 'dev_get_drvdata'.
    device.h:1024: 'dev' is passed to function 'dev_get_drvdata'.
    device.h:1026: 'dev' is explicitly dereferenced.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
[nsekhar@ti.com: add an error message, fix return path]
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "bonding: allow carrier and link status to determine link state"

This reverts commit 1386c36b30388f46a95100924bfcae75160db715.

We don't want to encourage drivers to not report carrier status
correctly, therefore remove this commit.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tc-testing: updated mirred and vlan with more tests

Added extra test cases for different control actions (reclassify, pipe
etc.), cookies, max values & exceeding maximum, and replace existing
actions unit tests.

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tc-testing: fixed copy-pasting error in police tests

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers

Currently NOLOCK qdiscs pay a measurable overhead to atomically
manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per
packet in the uncontended scenario with packet rate below the
line rate: on packed dequeue and on the next, failing dequeue attempt.

This changeset moves the bit manipulation into the qdisc_run_{begin,end}
helpers, so that the bit is now flipped only once per packet, with
measurable performance improvement in the uncontended scenario.

This also allows simplifying the qdisc teardown code path - since
qdisc_is_running() is now effective for each qdisc type - and avoid a
possible race between qdisc_run() and dev_deactivate_many(), as now
the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy
dequeuing packets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bonding-performance-and-reliability'

Debabrata Banerjee says:

====================
bonding: performance and reliability

Series of fixes to how rlb updates are handled, code cleanup, allowing
higher performance tx hashing in balance-alb mode, and reliability of
link up/down monitoring.

v2: refactor bond_is_nondyn_tlb with inline fn, update log comment to
point out that multicast addresses will not get rlb updates.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: allow carrier and link status to determine link state

In a mixed environment it may be difficult to tell if your hardware
support carrier, if it does not it can always report true. With a new
use_carrier option of 2, we can check both carrier and link status
sequentially, instead of one or the other

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: allow use of tx hashing in balance-alb

The rx load balancing provided by balance-alb is not mutually
exclusive with using hashing for tx selection, and should provide a decent
speed increase because this eliminates spinlocks and cache contention.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: use common mac addr checks

Replace homegrown mac addr checks with faster defs from etherdevice.h

Note that this will also prevent any rlb arp updates for multicast
addresses, however this should have been forbidden anyway.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: don't queue up extraneous rlb updates

arps for incomplete entries can't be sent anyway.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-smc-enhancements-2018-05-15'

Ursula Braun says:

====================
net/smc: enhancements 2018/05/15

here are smc patches for net-next. The first one is a fix for net-next
commit 01d2f7e2cdd3 "net/smc: sockopts TCP_NODELAY and TCP_CORK".
Patch 7 improves Connection Layer Control error handling, patch 10
improves abnormal termination of link groups. The remaining patches
from Karsten improve Link Layer Control code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: check for pending termination

Avoid to run the processing in smc_lgr_terminate() more than once,
remember when the link group termination is triggered.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: drop messages when link state is inactive

Drop incoming messages when the link is flagged as inactive.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: set link inactive before calling smc_lgr_free()

Before smc_lgr_free() is called the link must be set inactive by calling
smc_llc_link_inactive().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: handle all error codes from smc_conn_create()

Always set a reason_code when smc_conn_create() returns an error code.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: use a workqueue to defer llc send

SMC handles deferred work in tasklets. As tasklets cannot sleep this
can result in rare EBUSY conditions, so defer this work in a work queue.
The high level api functions do not defer work because they can sleep
until the llc send is actually completed.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: move link llc initialization to llc layer

Move the llc layer specific initialization and cleanup out of smc_core.c
into smc_llc.c (smc_llc_link_init and smc_llc_link_clear). Move all
initialization of a link into the new init function.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: simplify test_link function usage

Make smc_llc_send_test_link() static and remove it from the header file.
And to send a test_link response set the response flag and send the
message back as-is, without using smc_llc_send_test_link(). And because
smc_llc_send_test_link() must no longer send responses, remove the
response flag handling from the function.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: remove unnecessary cast

Remove an unneeded (void *) cast from the calls to
smc_llc_send_message(). No functional changes.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: register new rmbs with the peer

Register new rmb buffers with the remote peer by exchanging a
confirm_rkey llc message.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/smc: no tx work trigger for fallback sockets

If TCP_NODELAY is set or TCP_CORK is reset, setsockopt triggers the
tx worker. This does not make sense, if the SMC socket switched to
the TCP fallback when the connection is created. This patch adds
the additional check for the fallback case.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5e-updates-2018-05-14' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5e-updates-2018-05-14

Misc update for mlx5e netdevice driver

From Gal Pressman:
- Remove MLX5E_TEST_BIT macros and use test_bit instead
- Use __set_bit when possible

From Eran Ben Elisha:
  - Improve debug print on initial RX posting timeout

From Or Gerlitz:
- Support offloaded TC flows with no matches on headers
- mlx5e TC cleanups

Trivial cleanups From Roi, Tariq and Saeed:
  - Use bool as return type for mlx5e_xdp_handle
  - Use u8 instead of int for LRO number of segments
  - Skip redundant checks when providing NUD lastuse feedback
  - Remove redundant vport context vlan update
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Misc-Bug-Fixes-and-clean-ups-for-HNS3-Driver'

Salil Mehta says:

====================
Misc. Bug Fixes and clean-ups for HNS3 Driver

This patch-set mainly introduces various bug fixes, cleanups and one
very small enhancement to existing HN3 driver code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fixes the missing PCI iounmap for various legs

We call pcim_iomap in hclge_pci_init, pcim_iounmap should be called
in error handle of hclge_init_ae_dev.

We call pcim_iomap in hclge_pci_init, but do not call pcim_iounmap in
hclge_pci_uninit. When we remove the hclge.ko and insert it again, a
problem that pci can not map will happen. pcim_iounmap need to be called
in hclge_pci_uninit.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add support of .sriov_configure in HNS3 driver

As HNS3 driver will enable SRIOV default and enable all VFs the
HW support, if PF and VF driver compiled to kernel, VF driver
will work on host default, it is not right.

This patch adds support for hns3_driver.sriov_configure to support
user configs the VF_num, and do not enable sriov default.

Signed-off-by: Peng Li <lipeng321@huawei.com>
Suggested-by: Zhou Wang <wangzhou1@hisilicon.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix for fiber link up problem

When hclge_ae_start is called, hdev->hw.mac.link may be set
to one after up/down multi-times, which does not correspond to
the link state of netdev when the netdev is up.

This fixes it by setting hdev->hw.mac.link to zero when
hclge_ae_start is called.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fixes the back pressure setting when sriov is enabled

When sriov is enabled, the Qset and tc mapping is not longer one
to one relation.

This patch fixes it by mapping all pf and vf's Qset to tc.

Fixes: 848440544b41 ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Change return value in hnae3_register_client

A client includes many client instance. Just like ae_algo, Initializing
client instance failed does not represent registering client failed.
The action of registering client just is adding client to the client
list and the result always is true. This patch changes the return
value of hnae3_register_client form a variable value to a fixed value,
makes the function always return ok.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Change return type of hnae3_register_ae_algo

The ae_algo is used by many ae_devs. It is not only belong to just a
ae_dev. Initializing ae_dev failed does not represent registering ae_algo
failed. Because the action of registering ae_algo just is adding ae_algo
to the ae_algo list and it is always is true, it make no sense to define
return type as int.

This patch changes the return type of hnae3_register_ae_algo from int to
void.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Change return type of hnae3_register_ae_dev

If hclge.ko has not been inserted, the value of ret always is zero
in hnae3_register_ae_dev. If hclge.ko has been inserted, the value
of ret is zero or non zero. Different execution ways have different
results. It is confusing.

The ae_dev which is initialized failed can be reinitialized when we
remove hclge.ko and insert it again. For the case initializing client
instance, it is just like the case initializing ae_dev. The main function
of hnae3_register_ae_dev is adding the ae_dev to ad_dev list. Because
adding ae_dev is always ok, we does not need to return any in this
function.

This patch changes the return type of hnae3_register_ae_dev from int
to void.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Add a check for client instance init state

If the client instance is initializd failed, we do not need to uninit it.
This patch adds a state check to check init state of client instance.

Fixes: 38caee9d3ee8 ("net: hns3: Add support of the HNAE3 framework")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix for the null pointer problem occurring when initializing ae_dev failed

When initializing ae_dev failed during loading hclge.ko, the drvdata will
be set to null. When removing hns3.ko, we get a null ae_dev. It causes the
null pointer problem.

This patch removes pci_set_drvdata from error handle of hclge_init_ae_dev
to fix the bug, since pci_set_drvdata has been called in hns3_remove.
Also, we do not need to uninit the ae_dev which is not initialized. And
it may be the one which is initialized failed.

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: Fix for deadlock problem occurring when unregistering ae_algo

When hnae3_unregister_ae_algo is called by PF, pci_disable_sriov is
called. And then, hns3_remove is called by VF. We get deadlocked in
this case.

Since VF pci device is dependent on PF pci device, When PF pci device
is removed, VF pci device must be removed. Also, To solve the deadlock
problem, VF pci device should be removed before PF pci device is removed.

This patch moves pci_enable/disable_sriov from hclge to hns3 to solve
the deadlock problem.

Also, we do not need to return EPROBE_DEFER in hnae3_register_ae_dev,
because SRIOV is no longer enabled in the context calling
hnae3_register_ae_dev. Mutex_trylock can be replaced with mutex_lock.

Fixes: 424eb834a9be ("net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Microsemi-Ocelot-Ethernet-switch-support'

Alexandre Belloni says:

====================
Microsemi Ocelot Ethernet switch support

This series adds initial support for the Microsemi Ethernet switch
present on Ocelot SoCs.

This only has bridging (and STP) support for now and it uses the
switchdev framework.
Coming features are VLAN filtering, link aggregation, IGMP snooping.

The switch can also be connected to an external CPU using PCIe.

Also, support for integration on other SoCs will be submitted.

The ocelot dts changes are here for reference and should probably go
through the MIPS tree once the bindings are accepted.

Changes in v3:
- Collected Reviewed-by
* Switchdev driver:
   - Fixed two issues reported by kbuild
   - Modified ethtool statistics to support different layoiut on different chips and take care of counter overflow

Changes in v2:
- Dropped Microsemi Ocelot PHY support
* MIIM driver:
   - Documented interrupts bindings
   - Moved the driver to drivers/net/phy/
   - Removed unused mutex
   - Removed MDIO bus scanning
* Switchdev driver:
   - Changed compatible to mscc,vsc7514-switch
   - Removed unused header inclusion
   - Factorized MAC table selection in ocelot_mact_select()
   - Disable the port in ocelot_port_stop()
   - Fixed the smatch endianness warnings
   - int to unsinged int where necessary
   - Removed VID handling for the FDB it has been reworked anyway and will be
     submitted with VLAN support
   - Fixed up unused cases in ocelot_port_attr_set()
   - Added a loop to register all the IO register spaces
   - the ports are now in an ethernet-ports node

I've tried switching to NAPI but this is not working well, mainly because the
only way to disable interrupts is to actually mask them in the interrupt
controller (it is not possible to tell the switch to stop generating
interrupts).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

MAINTAINERS: Add entry for Microsemi Ethernet switches

Add myself as a maintainer for the Microsemi Ethernet switches.

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: Add initial Ocelot switch support

Add a driver for Microsemi Ocelot Ethernet switch support.

This makes two modules:
mscc_ocelot_common handles all the common features that doesn't depend on
how the switch is integrated in the SoC. Currently, it handles offloading
bridging to the hardware. ocelot_io.c handles register accesses. This is
unfortunately needed because the register layout is packed and then depends
on the number of ports available on the switch. The register definition
files are automatically generated.

ocelot_board handles the switch integration on the SoC and on the board.

Frame injection and extraction to/from the CPU port is currently done using
register accesses which is quite slow. DMA is possible but the port is not
able to absorb the whole switch bandwidth.

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: add DT bindings for Microsemi Ocelot Switch

DT bindings for the Ethernet switch found on Microsemi Ocelot platforms.

Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mscc-miim: Add MDIO driver

Add a driver for the Microsemi MII Management controller (MIIM) found on
Microsemi SoCs.
On Ocelot, there are two controllers, one is connected to the internal
PHYs, the other one can communicate with external PHYs.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: add DT bindings for Microsemi MIIM

DT bindings for the Microsemi MII Management Controller found on Microsemi
SoCs

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: sockmap, add hash map support

Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.

To support this add hash support.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sockmap, refactor sockmap routines to work with hashmap

This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.

Most the fallout comes from three changes,

  - Pushing bpf programs into an independent structure so we
    can use it from the htab struct in the next patch.
  - Generalizing helpers to use void *key instead of the hardcoded
    u32.
  - Instead of passing map/key through the metadata we now do
    the lookup inline. This avoids storing the key in the metadata
    which will be useful when keys can be longer than 4 bytes. We
    rename the sk pointers to sk_redir at this point as well to
    avoid any confusion between the current sk pointer and the
    redirect pointer sk_redir.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: make sure build-id is on

--build-id may not be a default linker config.
Make sure it's used when linking urandom_read test program.
Otherwise test_stacktrace_build_id[_nmi] tests will be failling.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'convert-doc-to-rst'

Jesper Dangaard Brouer says:

====================
The kernel is moving files under Documentation to use the RST
(reStructuredText) format and Sphinx [1]. This patchset converts the
files under Documentation/bpf/ into RST format. The Sphinx
integration is left as followup work.

[1] https://www.kernel.org/doc/html/latest/doc-guide/sphinx.html

This patchset have been uploaded as branch bpf_doc10 on github[2], so
reviewers can see how GitHub renders this.

[2] https://github.com/netoptimizer/linux/tree/bpf_doc10/Documentation/bpf
====================

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, doc: howto use/run the BPF selftests

I always forget howto run the BPF selftests. Thus, lets add that info
to the QA document.

Documentation was based on Cilium's documentation:
http://cilium.readthedocs.io/en/latest/bpf/#verifying-the-setup

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, doc: convert bpf_devel_QA.rst to use RST formatting

Same story as bpf_design_QA.rst RST format conversion.

Again thanks to Quentin Monnet <quentin.monnet@netronome.com> for
fixes and patches that have been squashed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, doc: convert bpf_design_QA.rst to use RST formatting

The RST formatting is done such that that when rendered or converted
to different formats, an automatic index with links are created to the
subsections.

Thus, the questions are created as sections (or subsections), in-order
to get the wanted auto-generated FAQ/QA index.

Special thanks to Quentin Monnet <quentin.monnet@netronome.com> who
have reviewed and corrected both RST formatting and GitHub rendering
issues in this file. Those commits have been squashed.

I've manually tested that this also renders nicely if included as part
of the kernel 'make htmldocs'. As the end-goal is for this to become
more integrated with kernel-doc project/movement.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, doc: rename txt files to rst files

This will cause them to get auto rendered, e.g. when viewing them on GitHub.
Followup patches will correct the content to be RST compliant.

Also adjust README.rst to point to the renamed files.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, doc: add basic README.rst file

A README.rst file in a directory have special meaning for sites like
github, which auto renders the contents. Plus search engines like
Google also index these README.rst files.

Auto rendering allow us to use links, for (re)directing eBPF users to
other places where docs live. The end-goal would be to direct users
towards https://www.kernel.org/doc/html/latest but we haven't written
the full docs yet, so we start out small and take this incrementally.

This directory itself contains some useful docs, which can be linked
to from the README.rst file (verified this works for github).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'fix-samples'

Jakub Kicinski says:

====================
Following patches address build issues after recent move to libbpf.
For out-of-tree builds we would see the following error:

gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or directory

libbpf build system is now always invoked explicitly rather than
relying on building single objects most of the time. We need to
resolve the friction between Kbuild and tools/ build system.

Mini-library called libbpf.h in samples is renamed to bpf_insn.h,
using linux/filter.h seems not completely trivial since some samples
get upset when order on include search path in changed. We do have
to rename libbpf.h, however, because otherwise it's hard to reliably
get to libbpf's header in out-of-tree builds.

v2:
- fix the build error harder (patch 3);
- add patch 5 (make clang less noisy).
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples: bpf: make the build less noisy

Building samples with clang ignores the $(Q) setting, always
printing full command to the output. Make it less verbose.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples: bpf: move libbpf from object dependencies to libs

Make complains that it doesn't know how to make libbpf.a:

scripts/Makefile.host:106: target 'samples/bpf/../../tools/lib/bpf/libbpf.a' doesn't match the target pattern

Now that we have it as a dependency of the sources simply add libbpf.a
to libraries not objects.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples: bpf: fix build after move to compiling full libbpf.a

There are many ways users may compile samples, some of them got
broken by commit 5f9380572b4b ("samples: bpf: compile and link
against full libbpf"). Improve path resolution and make libbpf
building a dependency of source files to force its build.

Samples should now again build with any of:
cd samples/bpf; make
make samples/bpf/
make -C samples/bpf
cd samples/bpf; make O=builddir
make samples/bpf/ O=builddir
make -C samples/bpf O=builddir
export KBUILD_OUTPUT=builddir
make samples/bpf/
make -C samples/bpf

Fixes: 5f9380572b4b ("samples: bpf: compile and link against full libbpf")
Reported-by: Björn Töpel <bjorn.topel@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples: bpf: rename libbpf.h to bpf_insn.h

The libbpf.h file in samples is clashing with libbpf's header.
Since it only includes a subset of filter.h instruction helpers
rename it to bpf_insn.h. Drop the unnecessary include of bpf/bpf.h.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

samples: bpf: include bpf/bpf.h instead of local libbpf.h

There are two files in the tree called libbpf.h which is becoming
problematic. Most samples don't actually need the local libbpf.h
they simply include it to get to bpf/bpf.h. Include bpf/bpf.h
directly instead.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'sctp-Introduce-sctp_flush_ctx'

Marcelo Ricardo Leitner says:

====================
sctp: Introduce sctp_flush_ctx

This struct will hold all the context used during the outq flush, so we
don't have to pass lots of pointers all around.

Checked on x86_64, the compiler inlines all these functions and there is no
derreference added because of the struct.

This patchset depends on 'sctp: refactor sctp_outq_flush'

Changes since v1:
- updated to build on top of v2 of 'sctp: refactor sctp_outq_flush'

Changes since v2:
- fixed a rebase issue which reverted a change in patch 2.
- rebased on v3 of 'sctp: refactor sctp_outq_flush'
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: checkpatch fixups

A collection of fixups from previous patches, left for later to not
introduce unnecessary changes while moving code around.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: add asoc and packet to sctp_flush_ctx

Pre-compute these so the compiler won't reload them (due to
no-strict-aliasing).

Changes since v2:
- Do not replace a return with a break in sctp_outq_flush_data

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: add sctp_flush_ctx, a context struct on outq_flush routines

With this struct we avoid passing lots of variables around and taking care
of updating the current transport/packet.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sctp-refactor-sctp_outq_flush'

Marcelo Ricardo Leitner says:

====================
sctp: refactor sctp_outq_flush

Currently sctp_outq_flush does many different things and arguably
unrelated, such as doing transport selection and outq dequeueing.

This patchset refactors it into smaller and more dedicated functions.
The end behavior should be the same.

The next patchset will rework the function parameters.

Changes since v1:
- fix build issues on patches 3 and 4, and updated 5 and 8 because of
it.

Changes since v2:
- fixed panic if building with just up to patch 3 applied
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: rework switch cases in sctp_outq_flush_data

Remove an inner one, which tended to be error prone due to the cascading
and it can be replaced by a simple if ().

Rework the outer one so that the actual flush code is not inside it. Now
we first validate if we can or cannot send data, return if not, and then
the flush code.

Suggested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: make use of gfp on retransmissions

Retransmissions may be triggered when in user context, so lets make use
of gfp.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: move transport flush code out of sctp_outq_flush

To the new sctp_outq_flush_transports.

Comment on Nagle is outdated and removed. Nagle is performed earlier, while
checking if the chunk fits the packet: if the outq length is not enough to
fill the packet, it returns SCTP_XMIT_DELAY.

So by when it gets to sctp_outq_flush_transports, it has to go through all
enlisted transports.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: move flushing of data chunks out of sctp_outq_flush

To the new sctp_outq_flush_data. Again, smaller functions and with well
defined objectives.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: move outq data rtx code out of sctp_outq_flush

This patch renames current sctp_outq_flush_rtx to __sctp_outq_flush_rtx
and create a new sctp_outq_flush_rtx, with the code that was on
sctp_outq_flush. Again, the idea is to have functions with small and
defined objectives.

Yes, there is an open-coded path selection in the now sctp_outq_flush_rtx.
That is kept as is for now because it may be very different when we
implement retransmission path selection algorithms for CMT-SCTP.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>