openwrt/staging/blogic.git
5 years agolibbpf: fix using uninitialized ioctl results
Ilya Maximets [Tue, 23 Jul 2019 12:08:10 +0000 (15:08 +0300)]
libbpf: fix using uninitialized ioctl results

'channels.max_combined' initialized only on ioctl success and
errno is only valid on ioctl failure.

The code doesn't produce any runtime issues, but makes memory
sanitizers angry:

 Conditional jump or move depends on uninitialised value(s)
    at 0x55C056F: xsk_get_max_queues (xsk.c:336)
    by 0x55C05B2: xsk_create_bpf_maps (xsk.c:354)
    by 0x55C089F: xsk_setup_xdp_prog (xsk.c:447)
    by 0x55C0E57: xsk_socket__create (xsk.c:601)
  Uninitialised value was created by a stack allocation
    at 0x55C04CD: xsk_get_max_queues (xsk.c:318)

Additionally fixed warning on uninitialized bytes in ioctl arguments:

 Syscall param ioctl(SIOCETHTOOL) points to uninitialised byte(s)
    at 0x648D45B: ioctl (in /usr/lib64/libc-2.28.so)
    by 0x55C0546: xsk_get_max_queues (xsk.c:330)
    by 0x55C05B2: xsk_create_bpf_maps (xsk.c:354)
    by 0x55C089F: xsk_setup_xdp_prog (xsk.c:447)
    by 0x55C0E57: xsk_socket__create (xsk.c:601)
  Address 0x1ffefff378 is on thread 1's stack
  in frame #1, created by xsk_get_max_queues (xsk.c:318)
  Uninitialised value was created by a stack allocation
    at 0x55C04CD: xsk_get_max_queues (xsk.c:318)

CC: Magnus Karlsson <magnus.karlsson@intel.com>
Fixes: 1cad07884239 ("libbpf: add support for using AF_XDP sockets")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoMerge branch 'fix-gso_segs'
Alexei Starovoitov [Tue, 23 Jul 2019 21:12:38 +0000 (14:12 -0700)]
Merge branch 'fix-gso_segs'

Eric Dumazet says:

====================
First patch changes the kernel, second patch
adds a new test.

Note that other patches might be needed to take
care of similar issues in sock_ops_convert_ctx_access()
and SOCK_OPS_GET_FIELD()
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoselftests/bpf: add another gso_segs access
Eric Dumazet [Tue, 23 Jul 2019 10:15:38 +0000 (03:15 -0700)]
selftests/bpf: add another gso_segs access

Use BPF_REG_1 for source and destination of gso_segs read,
to exercise "bpf: fix access to skb_shared_info->gso_segs" fix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: fix access to skb_shared_info->gso_segs
Eric Dumazet [Tue, 23 Jul 2019 10:15:37 +0000 (03:15 -0700)]
bpf: fix access to skb_shared_info->gso_segs

It is possible we reach bpf_convert_ctx_access() with
si->dst_reg == si->src_reg

Therefore, we need to load BPF_REG_AX before eventually
mangling si->src_reg.

syzbot generated this x86 code :
   3:   55                      push   %rbp
   4:   48 89 e5                mov    %rsp,%rbp
   7:   48 81 ec 00 00 00 00    sub    $0x0,%rsp // Might be avoided ?
   e:   53                      push   %rbx
   f:   41 55                   push   %r13
  11:   41 56                   push   %r14
  13:   41 57                   push   %r15
  15:   6a 00                   pushq  $0x0
  17:   31 c0                   xor    %eax,%eax
  19:   48 8b bf c0 00 00 00    mov    0xc0(%rdi),%rdi
  20:   44 8b 97 bc 00 00 00    mov    0xbc(%rdi),%r10d
  27:   4c 01 d7                add    %r10,%rdi
  2a:   48 0f b7 7f 06          movzwq 0x6(%rdi),%rdi // Crash
  2f:   5b                      pop    %rbx
  30:   41 5f                   pop    %r15
  32:   41 5e                   pop    %r14
  34:   41 5d                   pop    %r13
  36:   5b                      pop    %rbx
  37:   c9                      leaveq
  38:   c3                      retq

Fixes: d9ff286a0f59 ("bpf: allow BPF programs access skb_shared_info->gso_segs field")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: fix narrower loads on s390
Ilya Leoshkevich [Fri, 19 Jul 2019 09:18:15 +0000 (11:18 +0200)]
bpf: fix narrower loads on s390

The very first check in test_pkt_md_access is failing on s390, which
happens because loading a part of a struct __sk_buff field produces
an incorrect result.

The preprocessed code of the check is:

{
__u8 tmp = *((volatile __u8 *)&skb->len +
((sizeof(skb->len) - sizeof(__u8)) / sizeof(__u8)));
if (tmp != ((*(volatile __u32 *)&skb->len) & 0xFF)) return 2;
};

clang generates the following code for it:

      0: 71 21 00 03 00 00 00 00 r2 = *(u8 *)(r1 + 3)
      1: 61 31 00 00 00 00 00 00 r3 = *(u32 *)(r1 + 0)
      2: 57 30 00 00 00 00 00 ff r3 &= 255
      3: 5d 23 00 1d 00 00 00 00 if r2 != r3 goto +29 <LBB0_10>

Finally, verifier transforms it to:

  0: (61) r2 = *(u32 *)(r1 +104)
  1: (bc) w2 = w2
  2: (74) w2 >>= 24
  3: (bc) w2 = w2
  4: (54) w2 &= 255
  5: (bc) w2 = w2

The problem is that when verifier emits the code to replace a partial
load of a struct __sk_buff field (*(u8 *)(r1 + 3)) with a full load of
struct sk_buff field (*(u32 *)(r1 + 104)), an optional shift and a
bitwise AND, it assumes that the machine is little endian and
incorrectly decides to use a shift.

Adjust shift count calculation to account for endianness.

Fixes: 31fd85816dbe ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoselftests/bpf: fix sendmsg6_prog on s390
Ilya Leoshkevich [Fri, 19 Jul 2019 09:06:11 +0000 (11:06 +0200)]
selftests/bpf: fix sendmsg6_prog on s390

"sendmsg6: rewrite IP & port (C)" fails on s390, because the code in
sendmsg_v6_prog() assumes that (ctx->user_ip6[0] & 0xFFFF) refers to
leading IPv6 address digits, which is not the case on big-endian
machines.

Since checking bitwise operations doesn't seem to be the point of the
test, replace two short comparisons with a single int comparison.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agolibbpf: Avoid designated initializers for unnamed union members
Arnaldo Carvalho de Melo [Fri, 19 Jul 2019 14:34:07 +0000 (11:34 -0300)]
libbpf: Avoid designated initializers for unnamed union members

As it fails to build in some systems with:

  libbpf.c: In function 'perf_buffer__new':
  libbpf.c:4515: error: unknown field 'sample_period' specified in initializer
  libbpf.c:4516: error: unknown field 'wakeup_events' specified in initializer

Doing as:

    attr.sample_period = 1;

I.e. not as a designated initializer makes it build everywhere.

Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: fb84b8224655 ("libbpf: add perf buffer API")
Link: https://lkml.kernel.org/n/tip-hnlmch8qit1ieksfppmr32si@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agolibbpf: Fix endianness macro usage for some compilers
Arnaldo Carvalho de Melo [Fri, 19 Jul 2019 14:34:06 +0000 (11:34 -0300)]
libbpf: Fix endianness macro usage for some compilers

Using endian.h and its endianness macros makes this code build in a
wider range of compilers, as some don't have those macros
(__BYTE_ORDER__, __ORDER_LITTLE_ENDIAN__, __ORDER_BIG_ENDIAN__),
so use instead endian.h's macros (__BYTE_ORDER, __LITTLE_ENDIAN,
__BIG_ENDIAN) which makes this code even shorter :-)

Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: 12ef5634a855 ("libbpf: simplify endianness check")
Fixes: e6c64855fd7a ("libbpf: add btf__parse_elf API to load .BTF and .BTF.ext")
Link: https://lkml.kernel.org/n/tip-eep5n8vgwcdphw3uc058k03u@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoMerge branch 'bpf-sockmap-tls-fixes'
Daniel Borkmann [Mon, 22 Jul 2019 14:04:17 +0000 (16:04 +0200)]
Merge branch 'bpf-sockmap-tls-fixes'

Jakub Kicinski says:

====================
John says:

Resolve a series of splats discovered by syzbot and an unhash
TLS issue noted by Eric Dumazet.

The main issues revolved around interaction between TLS and
sockmap tear down. TLS and sockmap could both reset sk->prot
ops creating a condition where a close or unhash op could be
called forever. A rare race condition resulting from a missing
rcu sync operation was causing a use after free. Then on the
TLS side dropping the sock lock and re-acquiring it during the
close op could hang. Finally, sockmap must be deployed before
tls for current stack assumptions to be met. This is enforced
now. A feature series can enable it.

To fix this first refactor TLS code so the lock is held for the
entire teardown operation. Then add an unhash callback to ensure
TLS can not transition from ESTABLISHED to LISTEN state. This
transition is a similar bug to the one found and fixed previously
in sockmap. Then apply three fixes to sockmap to fix up races
on tear down around map free and close. Finally, if sockmap
is destroyed before TLS we add a new ULP op update to inform
the TLS stack it should not call sockmap ops. This last one
appears to be the most commonly found issue from syzbot.

v4:
 - fix some use after frees;
 - disable disconnect work for offload (ctx lifetime is much
   more complex);
 - remove some of the dead code which made it hard to understand
   (for me) that things work correctly (e.g. the checks TLS is
   the top ULP);
 - add selftets.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/tls: add shutdown tests
Jakub Kicinski [Fri, 19 Jul 2019 17:29:27 +0000 (10:29 -0700)]
selftests/tls: add shutdown tests

Add test for killing the connection via shutdown.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/tls: close the socket with open record
Jakub Kicinski [Fri, 19 Jul 2019 17:29:26 +0000 (10:29 -0700)]
selftests/tls: close the socket with open record

Add test which sends some data with MSG_MORE and then
closes the socket (never calling send without MSG_MORE).
This should make sure we clean up open records correctly.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/tls: add a bidirectional test
Jakub Kicinski [Fri, 19 Jul 2019 17:29:25 +0000 (10:29 -0700)]
selftests/tls: add a bidirectional test

Add a simple test which installs the TLS state for both directions,
sends and receives data on both sockets.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/tls: test error codes around TLS ULP installation
Jakub Kicinski [Fri, 19 Jul 2019 17:29:24 +0000 (10:29 -0700)]
selftests/tls: test error codes around TLS ULP installation

Test the error codes returned when TCP connection is not
in ESTABLISHED state.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/tls: add a test for ULP but no keys
Jakub Kicinski [Fri, 19 Jul 2019 17:29:23 +0000 (10:29 -0700)]
selftests/tls: add a test for ULP but no keys

Make sure we test the TLS_BASE/TLS_BASE case both with data
and the tear down/clean up path.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpf: sockmap/tls, close can race with map free
John Fastabend [Fri, 19 Jul 2019 17:29:22 +0000 (10:29 -0700)]
bpf: sockmap/tls, close can race with map free

When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

v4:
 - make sure we don't free things for device;
 - remove the checks which swap the callbacks back
   only if TLS is at the top.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpf: sockmap, only create entry if ulp is not already enabled
John Fastabend [Fri, 19 Jul 2019 17:29:21 +0000 (10:29 -0700)]
bpf: sockmap, only create entry if ulp is not already enabled

Sockmap does not currently support adding sockets after TLS has been
enabled. There never was a real use case for this so it was never
added. But, we lost the test for ULP at some point so add it here
and fail the socket insert if TLS is enabled. Future work could
make sockmap support this use case but fixup the bug here.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpf: sockmap, synchronize_rcu before free'ing map
John Fastabend [Fri, 19 Jul 2019 17:29:20 +0000 (10:29 -0700)]
bpf: sockmap, synchronize_rcu before free'ing map

We need to have a synchronize_rcu before free'ing the sockmap because
any outstanding psock references will have a pointer to the map and
when they use this could trigger a use after free.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpf: sockmap, sock_map_delete needs to use xchg
John Fastabend [Fri, 19 Jul 2019 17:29:19 +0000 (10:29 -0700)]
bpf: sockmap, sock_map_delete needs to use xchg

__sock_map_delete() may be called from a tcp event such as unhash or
close from the following trace,

  tcp_bpf_close()
    tcp_bpf_remove()
      sk_psock_unlink()
        sock_map_delete_from_link()
          __sock_map_delete()

In this case the sock lock is held but this only protects against
duplicate removals on the TCP side. If the map is free'd then we have
this trace,

  sock_map_free
    xchg()                  <- replaces map entry
    sock_map_unref()
      sk_psock_put()
        sock_map_del_link()

The __sock_map_delete() call however uses a read, test, null over the
map entry which can result in both paths trying to free the map
entry.

To fix use xchg in TCP paths as well so we avoid having two references
to the same map entry.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agonet/tls: fix transition through disconnect with close
John Fastabend [Fri, 19 Jul 2019 17:29:18 +0000 (10:29 -0700)]
net/tls: fix transition through disconnect with close

It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call the tls close callback. Because of this a user could
disconnect a socket then put it in a LISTEN state which would break
our assumptions about sockets always being ESTABLISHED state.

More directly because close() can call unhash() and unhash is
implemented by sockmap if a sockmap socket has TLS enabled we can
incorrectly destroy the psock from unhash() and then call its close
handler again. But because the psock (sockmap socket representation)
is already destroyed we call close handler in sk->prot. However,
in some cases (TLS BASE/BASE case) this will still point at the
sockmap close handler resulting in a circular call and crash reported
by syzbot.

To fix both above issues implement the unhash() routine for TLS.

v4:
 - add note about tls offload still needing the fix;
 - move sk_proto to the cold cache line;
 - split TX context free into "release" and "free",
   otherwise the GC work itself is in already freed
   memory;
 - more TX before RX for consistency;
 - reuse tls_ctx_free();
 - schedule the GC work after we're done with context
   to avoid UAF;
 - don't set the unhash in all modes, all modes "inherit"
   TLS_BASE's callbacks anyway;
 - disable the unhash hook for TLS_HW.

Fixes: 3c4d7559159bf ("tls: kernel TLS support")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agonet/tls: remove sock unlock/lock around strp_done()
John Fastabend [Fri, 19 Jul 2019 17:29:17 +0000 (10:29 -0700)]
net/tls: remove sock unlock/lock around strp_done()

The tls close() callback currently drops the sock lock to call
strp_done(). Split up the RX cleanup into stopping the strparser
and releasing most resources, syncing strparser and finally
freeing the context.

To avoid the need for a strp_done() call on the cleanup path
of device offload make sure we don't arm the strparser until
we are sure init will be successful.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agonet/tls: remove close callback sock unlock/lock around TX work flush
John Fastabend [Fri, 19 Jul 2019 17:29:16 +0000 (10:29 -0700)]
net/tls: remove close callback sock unlock/lock around TX work flush

The tls close() callback currently drops the sock lock, makes a
cancel_delayed_work_sync() call, and then relocks the sock.

By restructuring the code we can avoid droping lock and then
reclaiming it. To simplify this we do the following,

 tls_sk_proto_close
 set_bit(CLOSING)
 set_bit(SCHEDULE)
 cancel_delay_work_sync() <- cancel workqueue
 lock_sock(sk)
 ...
 release_sock(sk)
 strp_done()

Setting the CLOSING bit prevents the SCHEDULE bit from being
cleared by any workqueue items e.g. if one happens to be
scheduled and run between when we set SCHEDULE bit and cancel
work. Then because SCHEDULE bit is set now no new work will
be scheduled.

Tested with net selftests and bpf selftests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agonet/tls: don't call tls_sk_proto_close for hw record offload
Jakub Kicinski [Fri, 19 Jul 2019 17:29:15 +0000 (10:29 -0700)]
net/tls: don't call tls_sk_proto_close for hw record offload

The deprecated TOE offload doesn't actually do anything in
tls_sk_proto_close() - all TLS code is skipped and context
not freed. Remove the callback to make it easier to refactor
tls_sk_proto_close().

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agonet/tls: don't arm strparser immediately in tls_set_sw_offload()
Jakub Kicinski [Fri, 19 Jul 2019 17:29:14 +0000 (10:29 -0700)]
net/tls: don't arm strparser immediately in tls_set_sw_offload()

In tls_set_device_offload_rx() we prepare the software context
for RX fallback and proceed to add the connection to the device.
Unfortunately, software context prep includes arming strparser
so in case of a later error we have to release the socket lock
to call strp_done().

In preparation for not releasing the socket lock half way through
callbacks move arming strparser into a separate function.
Following patches will make use of that.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agolibbpf: sanitize VAR to conservative 1-byte INT
Andrii Nakryiko [Fri, 19 Jul 2019 19:46:03 +0000 (12:46 -0700)]
libbpf: sanitize VAR to conservative 1-byte INT

If VAR in non-sanitized BTF was size less than 4, converting such VAR
into an INT with size=4 will cause BTF validation failure due to
violationg of STRUCT (into which DATASEC was converted) member size.
Fix by conservatively using size=1.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agolibbpf: fix SIGSEGV when BTF loading fails, but .BTF.ext exists
Andrii Nakryiko [Fri, 19 Jul 2019 19:32:42 +0000 (12:32 -0700)]
libbpf: fix SIGSEGV when BTF loading fails, but .BTF.ext exists

In case when BTF loading fails despite sanitization, but BPF object has
.BTF.ext loaded as well, we free and null obj->btf, but not
obj->btf_ext. This leads to an attempt to relocate .BTF.ext later on
during bpf_object__load(), which assumes obj->btf is present. This leads
to SIGSEGV on null pointer access. Fix bug by freeing and nulling
obj->btf_ext as well.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agotcp: fix tcp_set_congestion_control() use from bpf hook
Eric Dumazet [Fri, 19 Jul 2019 02:28:14 +0000 (19:28 -0700)]
tcp: fix tcp_set_congestion_control() use from bpf hook

Neal reported incorrect use of ns_capable() from bpf hook.

bpf_setsockopt(...TCP_CONGESTION...)
  -> tcp_set_congestion_control()
   -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
    -> ns_capable_common()
     -> current_cred()
      -> rcu_dereference_protected(current->cred, 1)

Accessing 'current' in bpf context makes no sense, since packets
are processed from softirq context.

As Neal stated : The capability check in tcp_set_congestion_control()
was written assuming a system call context, and then was reused from
a BPF call site.

The fix is to add a new parameter to tcp_set_congestion_control(),
so that the ns_capable() call is only performed under the right
context.

Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoag71xx: fix return value check in ag71xx_probe()
Wei Yongjun [Fri, 19 Jul 2019 01:22:06 +0000 (01:22 +0000)]
ag71xx: fix return value check in ag71xx_probe()

In case of error, the function of_get_mac_address() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().

Fixes: d51b6ce441d3 ("net: ethernet: add ag71xx driver")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoag71xx: fix error return code in ag71xx_probe()
Wei Yongjun [Fri, 19 Jul 2019 01:21:57 +0000 (01:21 +0000)]
ag71xx: fix error return code in ag71xx_probe()

Fix to return error code -ENOMEM from the dmam_alloc_coherent() error
handling case instead of 0, as done elsewhere in this function.

Fixes: d51b6ce441d3 ("net: ethernet: add ag71xx driver")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agousb: qmi_wwan: add D-Link DWM-222 A2 device ID
Rogan Dawes [Wed, 17 Jul 2019 09:14:33 +0000 (11:14 +0200)]
usb: qmi_wwan: add D-Link DWM-222 A2 device ID

Signed-off-by: Rogan Dawes <rogan@dawes.za.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
Michael Chan [Wed, 17 Jul 2019 07:07:23 +0000 (03:07 -0400)]
bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.

Unlike legacy chips, 57500 chips don't need additional VNIC resources
for aRFS/ntuple.  Fix the code accordingly so that we don't reserve
and allocate additional VNICs on 57500 chips.  Without this patch,
the driver is failing to initialize when it tries to allocate extra
VNICs.

Fixes: ac33906c67e2 ("bnxt_en: Add support for aRFS on 57500 chips.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: dsa: sja1105: Fix missing unlock on error in sk_buff()
Wei Yongjun [Wed, 17 Jul 2019 06:29:56 +0000 (06:29 +0000)]
net: dsa: sja1105: Fix missing unlock on error in sk_buff()

Add the missing unlock before return from function sk_buff()
in the error handling case.

Fixes: f3097be21bf1 ("net: dsa: sja1105: Add a state machine for RX timestamping")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agogve: replace kfree with kvfree
Chuhong Yuan [Wed, 17 Jul 2019 02:05:11 +0000 (10:05 +0800)]
gve: replace kfree with kvfree

Variables allocated by kvzalloc should not be freed by kfree.
Because they may be allocated by vmalloc.
So we replace kfree with kvfree here.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
David S. Miller [Thu, 18 Jul 2019 21:04:45 +0000 (14:04 -0700)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2019-07-18

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) verifier precision propagation fix, from Andrii.

2) BTF size fix for typedefs, from Andrii.

3) a bunch of big endian fixes, from Ilya.

4) wide load from bpf_sock_addr fixes, from Stanislav.

5) a bunch of misc fixes from a number of developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests/bpf: fix test_xdp_noinline on s390
Ilya Leoshkevich [Wed, 17 Jul 2019 12:26:20 +0000 (14:26 +0200)]
selftests/bpf: fix test_xdp_noinline on s390

test_xdp_noinline fails on s390 due to a handful of endianness issues.
Use ntohs for parsing eth_proto.
Replace bswaps with ntohs/htons.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoselftests/bpf: fix "valid read map access into a read-only array 1" on s390
Ilya Leoshkevich [Thu, 18 Jul 2019 09:13:35 +0000 (11:13 +0200)]
selftests/bpf: fix "valid read map access into a read-only array 1" on s390

This test looks up a 32-bit map element and then loads it using a 64-bit
load. This does not work on s390, which is a big-endian machine.

Since the point of this test doesn't seem to be loading a smaller value
using a larger load, simply use a 32-bit load.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agonet/mlx5: Replace kfree with kvfree
Chuhong Yuan [Wed, 17 Jul 2019 10:14:57 +0000 (18:14 +0800)]
net/mlx5: Replace kfree with kvfree

Variable allocated by kvmalloc should not be freed by kfree.
Because it may be allocated by vmalloc.
So replace kfree with kvfree here.

Fixes: 9b1f298236057 ("net/mlx5: Add support for FW fatal reporter dump")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMAINTAINERS: update netsec driver
Ilias Apalodimas [Thu, 18 Jul 2019 14:38:30 +0000 (17:38 +0300)]
MAINTAINERS: update netsec driver

Add myself to maintainers since i provided the XDP and page_pool
implementation

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jassi Brar <jaswinder.singh@linaro.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv6: Unlink sibling route in case of failure
Ido Schimmel [Wed, 17 Jul 2019 20:39:33 +0000 (23:39 +0300)]
ipv6: Unlink sibling route in case of failure

When a route needs to be appended to an existing multipath route,
fib6_add_rt2node() first appends it to the siblings list and increments
the number of sibling routes on each sibling.

Later, the function notifies the route via call_fib6_entry_notifiers().
In case the notification is vetoed, the route is not unlinked from the
siblings list, which can result in a use-after-free.

Fix this by unlinking the route from the siblings list before returning
an error.

Audited the rest of the call sites from which the FIB notification chain
is called and could not find more problems.

Fixes: 2233000cba40 ("net/ipv6: Move call_fib6_entry_notifiers up for route adds")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alexander Petrovskiy <alexpe@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'wireless-drivers-for-davem-2019-07-18' of git://git.kernel.org/pub/scm...
David S. Miller [Thu, 18 Jul 2019 19:00:16 +0000 (12:00 -0700)]
Merge tag 'wireless-drivers-for-davem-2019-07-18' of git://git./linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
wireless-drivers fixes for 5.3

First set of fixes for 5.3.

iwlwifi

* add new cards for 9000 and 20000 series and qu c-step devices

ath10k

* workaround an uninitialised variable warning

rt2x00

* fix rx queue hand on USB
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoliquidio: Replace vmalloc + memset with vzalloc
Chuhong Yuan [Thu, 18 Jul 2019 07:45:42 +0000 (15:45 +0800)]
liquidio: Replace vmalloc + memset with vzalloc

Use vzalloc and vzalloc_node instead of using vmalloc and
vmalloc_node and then zeroing the allocated memory by
memset 0.
This simplifies the code.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoudp: Fix typo in net/ipv4/udp.c
Su Yanjun [Thu, 18 Jul 2019 02:19:23 +0000 (10:19 +0800)]
udp: Fix typo in net/ipv4/udp.c

Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bcmgenet: use promisc for unsupported filters
Justin Chen [Wed, 17 Jul 2019 21:58:53 +0000 (14:58 -0700)]
net: bcmgenet: use promisc for unsupported filters

Currently we silently ignore filters if we cannot meet the filter
requirements. This will lead to the MAC dropping packets that are
expected to pass. A better solution would be to set the NIC to promisc
mode when the required filters cannot be met.

Also correct the number of MDF filters supported. It should be 17,
not 16.

Signed-off-by: Justin Chen <justinpopo6@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv6: rt6_check should return NULL if 'from' is NULL
David Ahern [Wed, 17 Jul 2019 22:08:43 +0000 (15:08 -0700)]
ipv6: rt6_check should return NULL if 'from' is NULL

Paul reported that l2tp sessions were broken after the commit referenced
in the Fixes tag. Prior to this commit rt6_check returned NULL if the
rt6_info 'from' was NULL - ie., the dst_entry was disconnected from a FIB
entry. Restore that behavior.

Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes")
Reported-by: Paul Donohue <linux-kernel@PaulSD.com>
Tested-by: Paul Donohue <linux-kernel@PaulSD.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotipc: initialize 'validated' field of received packets
Jon Maloy [Wed, 17 Jul 2019 21:43:44 +0000 (23:43 +0200)]
tipc: initialize 'validated' field of received packets

The tipc_msg_validate() function leaves a boolean flag 'validated' in
the validated buffer's control block, to avoid performing this action
more than once. However, at reception of new packets, the position of
this field may already have been set by lower layer protocols, so
that the packet is erroneously perceived as already validated by TIPC.

We fix this by initializing the said field to 'false' before performing
the initial validation.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'ipv4-relax-source-validation-check-for-loopback-packets'
David S. Miller [Wed, 17 Jul 2019 22:23:39 +0000 (15:23 -0700)]
Merge branch 'ipv4-relax-source-validation-check-for-loopback-packets'

Cong Wang says:

====================
ipv4: relax source validation check for loopback packets

This patchset fixes a corner case when loopback packets get dropped
by rp_filter when we route them from veth to lo. Patch 1 is the fix
and patch 2 provides a simplified test case for this scenario.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: add a test case for rp_filter
Cong Wang [Wed, 17 Jul 2019 21:41:59 +0000 (14:41 -0700)]
selftests: add a test case for rp_filter

Add a test case to simulate the loopback packet case fixed
in the previous patch.

This test gets passed after the fix:

IPv4 rp_filter tests
    TEST: rp_filter passes local packets                                [ OK ]
    TEST: rp_filter passes loopback packets                             [ OK ]

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agofib: relax source validation check for loopback packets
Cong Wang [Wed, 17 Jul 2019 21:41:58 +0000 (14:41 -0700)]
fib: relax source validation check for loopback packets

In a rare case where we redirect local packets from veth to lo,
these packets fail to pass the source validation when rp_filter
is turned on, as the tracing shows:

  <...>-311708 [040] ..s1 7951180.957825: fib_table_lookup: table 254 oif 0 iif 1 src 10.53.180.130 dst 10.53.180.130 tos 0 scope 0 flags 0
  <...>-311708 [040] ..s1 7951180.957826: fib_table_lookup_nh: nexthop dev eth0 oif 4 src 10.53.180.130

So, the fib table lookup returns eth0 as the nexthop even though
the packets are local and should be routed to loopback nonetheless,
but they can't pass the dev match check in fib_info_nh_uses_dev()
without this patch.

It should be safe to relax this check for this special case, as
normally packets coming out of loopback device still have skb_dst
so they won't even hit this slow path.

Cc: Julian Anastasov <ja@ssi.bg>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'mlxsw-Two-fixes'
David S. Miller [Wed, 17 Jul 2019 22:19:46 +0000 (15:19 -0700)]
Merge branch 'mlxsw-Two-fixes'

Ido Schimmel says:

====================
mlxsw: Two fixes

This patchset contains two fixes for mlxsw.

Patch #1 from Petr fixes an issue in which DSCP rewrite can occur even
if the egress port was switched to Trust L2 mode where priority mapping
is based on PCP.

Patch #2 fixes a problem where packets can be learned on a non-existing
FID if a tc filter with a redirect action is configured on a bridged
port. The problem and fix are explained in detail in the commit message.

Please consider both patches for 5.2.y
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum: Do not process learned records with a dummy FID
Ido Schimmel [Wed, 17 Jul 2019 20:29:08 +0000 (23:29 +0300)]
mlxsw: spectrum: Do not process learned records with a dummy FID

The switch periodically sends notifications about learned FDB entries.
Among other things, the notification includes the FID (Filtering
Identifier) and the port on which the MAC was learned.

In case the driver does not have the FID defined on the relevant port,
the following error will be periodically generated:

mlxsw_spectrum2 0000:06:00.0 swp32: Failed to find a matching {Port, VID} following FDB notification

This is not supposed to happen under normal conditions, but can happen
if an ingress tc filter with a redirect action is installed on a bridged
port. The redirect action will cause the packet's FID to be changed to
the dummy FID and a learning notification will be emitted with this FID
- which is not defined on the bridged port.

Fix this by having the driver ignore learning notifications generated
with the dummy FID and delete them from the device.

Another option is to chain an ignore action after the redirect action
which will cause the device to disable learning, but this means that we
need to consume another action whenever a redirect action is used. In
addition, the scenario described above is merely a corner case.

Fixes: cedbb8b25948 ("mlxsw: spectrum_flower: Set dummy FID before forward action")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alex Kushnarov <alexanderk@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Alex Kushnarov <alexanderk@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomlxsw: spectrum_dcb: Configure DSCP map as the last rule is removed
Petr Machata [Wed, 17 Jul 2019 20:29:07 +0000 (23:29 +0300)]
mlxsw: spectrum_dcb: Configure DSCP map as the last rule is removed

Spectrum systems use DSCP rewrite map to update DSCP field in egressing
packets to correspond to priority that the packet has. Whether rewriting
will take place is determined at the point when the packet ingresses the
switch: if the port is in Trust L3 mode, packet priority is determined from
the DSCP map at the port, and DSCP rewrite will happen. If the port is in
Trust L2 mode, 802.1p is used for packet prioritization, and no DSCP
rewrite will happen.

The driver determines the port trust mode based on whether any DSCP
prioritization rules are in effect at given port. If there are any, trust
level is L3, otherwise it's L2. When the last DSCP rule is removed, the
port is switched to trust L2. Under that scenario, if DSCP of a packet
should be rewritten, it should be rewritten to 0.

However, when switching to Trust L2, the driver neglects to also update the
DSCP rewrite map. The last DSCP rule thus remains in effect, and packets
egressing through this port, if they have the right priority, will have
their DSCP set according to this rule.

Fix by first configuring the rewrite map, and only then switching to trust
L2 and bailing out.

Fixes: b2b1dab6884e ("mlxsw: spectrum: Support ieee_setapp, ieee_delapp")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reported-by: Alex Veber <alexve@mellanox.com>
Tested-by: Alex Veber <alexve@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ag71xx: Add missing header
Rosen Penev [Wed, 17 Jul 2019 19:46:45 +0000 (12:46 -0700)]
net: ag71xx: Add missing header

ag71xx uses devm_ioremap_nocache. This fixes usage of an implicit function

Fixes: d51b6ce441d3 ("net: ethernet: add ag71xx driver")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet_sched: unset TCQ_F_CAN_BYPASS when adding filters
Cong Wang [Tue, 16 Jul 2019 20:57:30 +0000 (13:57 -0700)]
net_sched: unset TCQ_F_CAN_BYPASS when adding filters

For qdisc's that support TC filters and set TCQ_F_CAN_BYPASS,
notably fq_codel, it makes no sense to let packets bypass the TC
filters we setup in any scenario, otherwise our packets steering
policy could not be enforced.

This can be reproduced easily with the following script:

 ip li add dev dummy0 type dummy
 ifconfig dummy0 up
 tc qd add dev dummy0 root fq_codel
 tc filter add dev dummy0 parent 8001: protocol arp basic action mirred egress redirect dev lo
 tc filter add dev dummy0 parent 8001: protocol ip basic action mirred egress redirect dev lo
 ping -I dummy0 192.168.112.1

Without this patch, packets are sent directly to dummy0 without
hitting any of the filters. With this patch, packets are redirected
to loopback as expected.

This fix is not perfect, it only unsets the flag but does not set it back
because we have to save the information somewhere in the qdisc if we
really want that. Note, both fq_codel and sfq clear this flag in their
->bind_tcf() but this is clearly not sufficient when we don't use any
class ID.

Fixes: 23624935e0c4 ("net_sched: TCQ_F_CAN_BYPASS generalization")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'net-rds-RDMA-fixes'
David S. Miller [Wed, 17 Jul 2019 19:06:52 +0000 (12:06 -0700)]
Merge branch 'net-rds-RDMA-fixes'

Gerd Rausch says:

====================
net/rds: RDMA fixes

A number of net/rds fixes necessary to make "rds_rdma.ko"
pass some basic Oracle internal tests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/rds: Initialize ic->i_fastreg_wrs upon allocation
Gerd Rausch [Tue, 16 Jul 2019 22:29:23 +0000 (15:29 -0700)]
net/rds: Initialize ic->i_fastreg_wrs upon allocation

Otherwise, if an IB connection is torn down before "rds_ib_setup_qp"
is called, the value of "ic->i_fastreg_wrs" is still at zero
(as it wasn't initialized by "rds_ib_setup_qp").
Consequently "rds_ib_conn_path_shutdown" will spin forever,
waiting for it to go back to "RDS_IB_DEFAULT_FR_WR",
which of course will never happen as there are no
outstanding work requests.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/rds: Keep track of and wait for FRWR segments in use upon shutdown
Gerd Rausch [Tue, 16 Jul 2019 22:29:17 +0000 (15:29 -0700)]
net/rds: Keep track of and wait for FRWR segments in use upon shutdown

Since "rds_ib_free_frmr" and "rds_ib_free_frmr_list" simply put
the FRMR memory segments on the "drop_list" or "free_list",
and it is the job of "rds_ib_flush_mr_pool" to reap those entries
by ultimately issuing a "IB_WR_LOCAL_INV" work-request,
we need to trigger and then wait for all those memory segments
attached to a particular connection to be fully released before
we can move on to release the QP, CQ, etc.

So we make "rds_ib_conn_path_shutdown" wait for one more
atomic_t called "i_fastreg_inuse_count" that keeps track of how
many FRWR memory segments are out there marked "FRMR_IS_INUSE"
(and also wake_up rds_ib_ring_empty_wait, as they go away).

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/rds: Set fr_state only to FRMR_IS_FREE if IB_WR_LOCAL_INV had been successful
Gerd Rausch [Tue, 16 Jul 2019 22:29:12 +0000 (15:29 -0700)]
net/rds: Set fr_state only to FRMR_IS_FREE if IB_WR_LOCAL_INV had been successful

Fix a bug where fr_state first goes to FRMR_IS_STALE, because of a failure
of operation IB_WR_LOCAL_INV, but then gets set back to "FRMR_IS_FREE"
uncoditionally, even though the operation failed.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/rds: Fix NULL/ERR_PTR inconsistency
Gerd Rausch [Tue, 16 Jul 2019 22:29:07 +0000 (15:29 -0700)]
net/rds: Fix NULL/ERR_PTR inconsistency

Make function "rds_ib_try_reuse_ibmr" return NULL in case
memory region could not be allocated, since callers
simply check if the return value is not NULL.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after posting IB_WR_...
Gerd Rausch [Tue, 16 Jul 2019 22:29:02 +0000 (15:29 -0700)]
net/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after posting IB_WR_LOCAL_INV

In order to:
1) avoid a silly bouncing between "clean_list" and "drop_list"
   triggered by function "rds_ib_reg_frmr" as it is releases frmr
   regions whose state is not "FRMR_IS_FREE" right away.

2) prevent an invalid access error in a race from a pending
   "IB_WR_LOCAL_INV" operation with a teardown ("dma_unmap_sg", "put_page")
   and de-registration ("ib_dereg_mr") of the corresponding
   memory region.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/rds: Get rid of "wait_clean_list_grace" and add locking
Gerd Rausch [Tue, 16 Jul 2019 22:28:57 +0000 (15:28 -0700)]
net/rds: Get rid of "wait_clean_list_grace" and add locking

Waiting for activity on the "clean_list" to quiesce is no substitute
for proper locking.

We can have multiple threads competing for "llist_del_first"
via "rds_ib_reuse_mr", and a single thread competing
for "llist_del_all" and "llist_del_first" via "rds_ib_flush_mr_pool".

Since "llist_del_first" depends on "list->first->next" not to change
in the midst of the operation, simply waiting for all current calls
to "rds_ib_reuse_mr" to quiesce across all CPUs is woefully inadequate:

By the time "wait_clean_list_grace" is done iterating over all CPUs to see
that there is no concurrent caller to "rds_ib_reuse_mr", a new caller may
have just shown up on the first CPU.

Furthermore, <linux/llist.h> explicitly calls out the need for locking:
 * Cases where locking is needed:
 * If we have multiple consumers with llist_del_first used in one consumer,
 * and llist_del_first or llist_del_all used in other consumers,
 * then a lock is needed.

Also, while at it, drop the unused "pool" parameter
from "list_to_llist_nodes".

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/rds: Give fr_state a chance to transition to FRMR_IS_FREE
Gerd Rausch [Tue, 16 Jul 2019 22:28:51 +0000 (15:28 -0700)]
net/rds: Give fr_state a chance to transition to FRMR_IS_FREE

In the context of FRMR (ib_frmr.c):

Memory regions make it onto the "clean_list" via "rds_ib_flush_mr_pool",
after the memory region has been posted for invalidation via
"rds_ib_post_inv".

At that point in time, "fr_state" may still be in state "FRMR_IS_INUSE",
since the only place where "fr_state" transitions to "FRMR_IS_FREE"
is in "rds_ib_mr_cqe_handler", which is triggered by a tasklet.

So in case we notice that "fr_state != FRMR_IS_FREE" (see below),
we wait for "fr_inv_done" to trigger with a maximum of 10msec.
Then we check again, and only put the memory region onto the drop_list
(via "rds_ib_free_frmr") in case the situation remains unchanged.

This avoids the problem of memory-regions bouncing between "clean_list"
and "drop_list" before they even have a chance to be properly invalidated.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/sched: Make NET_ACT_CT depends on NF_NAT
YueHaibing [Tue, 16 Jul 2019 07:16:02 +0000 (15:16 +0800)]
net/sched: Make NET_ACT_CT depends on NF_NAT

If NF_NAT is m and NET_ACT_CT is y, build fails:

net/sched/act_ct.o: In function `tcf_ct_act':
act_ct.c:(.text+0x21ac): undefined reference to `nf_ct_nat_ext_add'
act_ct.c:(.text+0x229a): undefined reference to `nf_nat_icmp_reply_translation'
act_ct.c:(.text+0x233a): undefined reference to `nf_nat_setup_info'
act_ct.c:(.text+0x234a): undefined reference to `nf_nat_alloc_null_binding'
act_ct.c:(.text+0x237c): undefined reference to `nf_nat_packet'

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: sctp: fix warning "NULL check before some freeing functions is not needed"
Hariprasad Kelam [Tue, 16 Jul 2019 02:20:02 +0000 (07:50 +0530)]
net: sctp: fix warning "NULL check before some freeing functions is not needed"

This patch removes NULL checks before calling kfree.

fixes below issues reported by coccicheck
net/sctp/sm_make_chunk.c:2586:3-8: WARNING: NULL check before some
freeing functions is not needed.
net/sctp/sm_make_chunk.c:2652:3-8: WARNING: NULL check before some
freeing functions is not needed.
net/sctp/sm_make_chunk.c:2667:3-8: WARNING: NULL check before some
freeing functions is not needed.
net/sctp/sm_make_chunk.c:2684:3-8: WARNING: NULL check before some
freeing functions is not needed.

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agocaif-hsi: fix possible deadlock in cfhsi_exit_module()
Taehee Yoo [Mon, 15 Jul 2019 05:10:17 +0000 (14:10 +0900)]
caif-hsi: fix possible deadlock in cfhsi_exit_module()

cfhsi_exit_module() calls unregister_netdev() under rtnl_lock().
but unregister_netdev() internally calls rtnl_lock().
So deadlock would occur.

Fixes: c41254006377 ("caif-hsi: Add rtnl support")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests/bpf: fix perf_buffer on s390
Ilya Leoshkevich [Tue, 16 Jul 2019 12:58:27 +0000 (14:58 +0200)]
selftests/bpf: fix perf_buffer on s390

perf_buffer test fails for exactly the same reason test_attach_probe
used to fail: different nanosleep syscall kprobe name.

Reuse the test_attach_probe fix.

Fixes: ee5cf82ce04a ("selftests/bpf: test perf buffer API")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoselftests/bpf: structure test_{progs, maps, verifier} test runners uniformly
Andrii Nakryiko [Tue, 16 Jul 2019 19:38:37 +0000 (12:38 -0700)]
selftests/bpf: structure test_{progs, maps, verifier} test runners uniformly

It's easier to follow the logic if it's structured the same.
There is just slight difference between test_progs/test_maps and
test_verifier. test_verifier's verifier/*.c files are not really compilable
C files (they are more of include headers), so they can't be specified as
explicit dependencies of test_verifier.

Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoselftests/bpf: fix test_verifier/test_maps make dependencies
Andrii Nakryiko [Tue, 16 Jul 2019 19:38:36 +0000 (12:38 -0700)]
selftests/bpf: fix test_verifier/test_maps make dependencies

e46fc22e60a4 ("selftests/bpf: make directory prerequisites order-only")
exposed existing problem in Makefile for test_verifier and test_maps tests:
their dependency on auto-generated header file with a list of all tests wasn't
recorded explicitly. This patch fixes these issues.

Fixes: 51a0e301a563 ("bpf: Add BPF_MAP_TYPE_SK_STORAGE test to test_maps")
Fixes: 6b7b6995c43e ("selftests: bpf: tests.h should depend on .c files, not the output")
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agofix: taprio: Change type of txtime-delay parameter to u32
Vedang Patel [Tue, 16 Jul 2019 19:52:18 +0000 (12:52 -0700)]
fix: taprio: Change type of txtime-delay parameter to u32

During the review of the iproute2 patches for txtime-assist mode, it was
pointed out that it does not make sense for the txtime-delay parameter to
be negative. So, change the type of the parameter from s32 to u32.

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoskbuff: fix compilation warnings in skb_dump()
Qian Cai [Tue, 16 Jul 2019 15:43:05 +0000 (11:43 -0400)]
skbuff: fix compilation warnings in skb_dump()

The commit 6413139dfc64 ("skbuff: increase verbosity when dumping skb
data") introduced a few compilation warnings.

net/core/skbuff.c:766:32: warning: format specifies type 'unsigned
short' but the argument has type 'unsigned int' [-Wformat]
                       level, sk->sk_family, sk->sk_type,
sk->sk_protocol);
                                             ^~~~~~~~~~~
net/core/skbuff.c:766:45: warning: format specifies type 'unsigned
short' but the argument has type 'unsigned int' [-Wformat]
                       level, sk->sk_family, sk->sk_type,
sk->sk_protocol);
^~~~~~~~~~~~~~~

Fix them by using the proper types.

Fixes: 6413139dfc64 ("skbuff: increase verbosity when dumping skb data")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobe2net: Signal that the device cannot transmit during reconfiguration
Benjamin Poirier [Tue, 16 Jul 2019 08:16:55 +0000 (17:16 +0900)]
be2net: Signal that the device cannot transmit during reconfiguration

While changing the number of interrupt channels, be2net stops adapter
operation (including netif_tx_disable()) but it doesn't signal that it
cannot transmit. This may lead dev_watchdog() to falsely trigger during
that time.

Add the missing call to netif_carrier_off(), following the pattern used in
many other drivers. netif_carrier_on() is already taken care of in
be_open().

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: mediatek: mtk_eth_soc: Add of_node_put() before goto
Nishka Dasgupta [Tue, 16 Jul 2019 05:55:04 +0000 (11:25 +0530)]
net: ethernet: mediatek: mtk_eth_soc: Add of_node_put() before goto

Each iteration of for_each_child_of_node puts the previous node, but in
the case of a goto from the middle of the loop, there is no put, thus
causing a memory leak. Hence add an of_node_put before the goto.
Issue found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: mscc: ocelot_board: Add of_node_put() before return
Nishka Dasgupta [Tue, 16 Jul 2019 05:52:19 +0000 (11:22 +0530)]
net: ethernet: mscc: ocelot_board: Add of_node_put() before return

Each iteration of for_each_available_child_of_node puts the previous
node, but in the case of a return from the middle of the loop, there is
no put, thus causing a memory leak. Hence add an of_node_put before the
return in two places.
Issue found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: ethernet: ti: cpsw: Add of_node_put() before return and break
Nishka Dasgupta [Tue, 16 Jul 2019 05:48:43 +0000 (11:18 +0530)]
net: ethernet: ti: cpsw: Add of_node_put() before return and break

Each iteration of for_each_available_child_of_node puts the previous
node, but in the case of a return or break from the middle of the loop,
there is no put, thus causing a memory leak.
Hence, for function cpsw_probe_dt, create an extra label err_node_put
that puts the last used node and returns ret; modify the return
statements in the loop to save the return value in ret and goto this new
label.
For function cpsw_remove_dt, add an of_node_put before the break.
Issue found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agolibbpf: fix another GCC8 warning for strncpy
Andrii Nakryiko [Tue, 16 Jul 2019 03:57:03 +0000 (20:57 -0700)]
libbpf: fix another GCC8 warning for strncpy

Similar issue was fixed in cdfc7f888c2a ("libbpf: fix GCC8 warning for
strncpy") already. This one was missed. Fixing now.

Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoselftests/bpf: skip nmi test when perf hw events are disabled
Ilya Leoshkevich [Tue, 16 Jul 2019 10:56:34 +0000 (12:56 +0200)]
selftests/bpf: skip nmi test when perf hw events are disabled

Some setups (e.g. virtual machines) might run with hardware perf events
disabled. If this is the case, skip the test_send_signal_nmi test.

Add a separate test involving a software perf event. This allows testing
the perf event path regardless of hardware perf event support.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoselftests/bpf: fix "alu with different scalars 1" on s390
Ilya Leoshkevich [Tue, 16 Jul 2019 10:53:53 +0000 (12:53 +0200)]
selftests/bpf: fix "alu with different scalars 1" on s390

BPF_LDX_MEM is used to load the least significant byte of the retrieved
test_val.index, however, on big-endian machines it ends up retrieving
the most significant byte.

Change the test to load the whole int in order to make it
endianness-independent.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoMerge tag 'mlx5-fixes-2019-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git...
David S. Miller [Tue, 16 Jul 2019 00:18:15 +0000 (17:18 -0700)]
Merge tag 'mlx5-fixes-2019-07-15' of git://git./linux/kernel/git/saeed/linux

Saeed Mameed says:

====================
Mellanox, mlx5 fixes 2019-07-15

This pull request provides mlx5 TC flower and tunnel fixes for kernel
5.2 from Eli and Vlad.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests/bpf: remove logic duplication in test_verifier
Andrii Nakryiko [Fri, 12 Jul 2019 17:44:41 +0000 (10:44 -0700)]
selftests/bpf: remove logic duplication in test_verifier

test_verifier tests can specify single- and multi-runs tests. Internally
logic of handling them is duplicated. Get rid of it by making single run
retval/data specification to be a first run spec.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoMerge branch 'bpf-fix-wide-loads-sockaddr'
Daniel Borkmann [Mon, 15 Jul 2019 21:15:53 +0000 (23:15 +0200)]
Merge branch 'bpf-fix-wide-loads-sockaddr'

Stanislav Fomichev says:

====================
When fixing selftests by adding support for wide stores, Yonghong
reported that he had seen some examples where clang generates
single u64 loads for two adjacent u32s as well:

http://lore.kernel.org/netdev/a66c937f-94c0-eaf8-5b37-8587d66c0c62@fb.com

Fix this to support aligned u64 reads for some bpf_sock_addr fields
as well.
====================

Acked-by: Andrii Narkyiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpf: sync bpf.h to tools/
Stanislav Fomichev [Mon, 15 Jul 2019 16:39:56 +0000 (09:39 -0700)]
bpf: sync bpf.h to tools/

Update bpf_sock_addr comments to indicate support for 8-byte reads
from user_ip6 and msg_src_ip6.

Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: add selftests for wide loads
Stanislav Fomichev [Mon, 15 Jul 2019 16:39:55 +0000 (09:39 -0700)]
selftests/bpf: add selftests for wide loads

Mirror existing wide store tests with wide loads. The only significant
difference is expected error string.

Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: rename verifier/wide_store.c to verifier/wide_access.c
Stanislav Fomichev [Mon, 15 Jul 2019 16:39:54 +0000 (09:39 -0700)]
selftests/bpf: rename verifier/wide_store.c to verifier/wide_access.c

Move the file and rename internal BPF_SOCK_ADDR define to
BPF_SOCK_ADDR_STORE. This selftest will be extended in the next commit
with the wide loads.

Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpf: allow wide aligned loads for bpf_sock_addr user_ip6 and msg_src_ip6
Stanislav Fomichev [Mon, 15 Jul 2019 16:39:53 +0000 (09:39 -0700)]
bpf: allow wide aligned loads for bpf_sock_addr user_ip6 and msg_src_ip6

Add explicit check for u64 loads of user_ip6 and msg_src_ip6 and
update the comment.

Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpf: rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok
Stanislav Fomichev [Mon, 15 Jul 2019 16:39:52 +0000 (09:39 -0700)]
bpf: rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok

Rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok to indicate
that it can be used for both loads and stores.

Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agosamples/bpf: build with -D__TARGET_ARCH_$(SRCARCH)
Ilya Leoshkevich [Mon, 15 Jul 2019 09:11:03 +0000 (11:11 +0200)]
samples/bpf: build with -D__TARGET_ARCH_$(SRCARCH)

While $ARCH can be relatively flexible (see Makefile and
tools/scripts/Makefile.arch), $SRCARCH always corresponds to a directory
name under arch/.

Therefore, build samples with -D__TARGET_ARCH_$(SRCARCH), since that
matches the expectations of bpf_helpers.h.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: put test_stub.o into $(OUTPUT)
Ilya Leoshkevich [Fri, 12 Jul 2019 13:59:50 +0000 (15:59 +0200)]
selftests/bpf: put test_stub.o into $(OUTPUT)

Add a rule to put test_stub.o in $(OUTPUT) and change the references to
it accordingly. This prevents test_stub.o from being created in the
source directory.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: make directory prerequisites order-only
Ilya Leoshkevich [Fri, 12 Jul 2019 13:56:31 +0000 (15:56 +0200)]
selftests/bpf: make directory prerequisites order-only

When directories are used as prerequisites in Makefiles, they can cause
a lot of unnecessary rebuilds, because a directory is considered changed
whenever a file in this directory is added, removed or modified.

If the only thing a target is interested in is the existence of the
directory it depends on, which is the case for selftests/bpf, this
directory should be specified as an order-only prerequisite: it would
still be created in case it does not exist, but it would not trigger a
rebuild of a target in case it's considered changed.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: fix attach_probe on s390
Ilya Leoshkevich [Fri, 12 Jul 2019 13:41:42 +0000 (15:41 +0200)]
selftests/bpf: fix attach_probe on s390

attach_probe test fails, because it cannot install a kprobe on a
non-existent sys_nanosleep symbol.

Use the correct symbol name for the nanosleep syscall on 64-bit s390.
Don't bother adding one for 31-bit mode, since tests are compiled only
in 64-bit mode.

Fixes: 1e8611bbdfc9 ("selftests/bpf: add kprobe/uprobe selftests")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoMerge branch 'bpf-btf-size-verification-fix'
Daniel Borkmann [Mon, 15 Jul 2019 21:02:17 +0000 (23:02 +0200)]
Merge branch 'bpf-btf-size-verification-fix'

Andrii Nakryiko says:

====================
BTF size resolution logic isn't always resolving type size correctly, leading
to erroneous map creation failures due to value size mismatch.

This patch set:
1. fixes the issue (patch #1);
2. adds tests for trickier cases (patch #2);
3. and converts few test cases utilizing BTF-defined maps, that previously
   couldn't use typedef'ed arrays due to kernel bug (patch #3).
====================

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: use typedef'ed arrays as map values
Andrii Nakryiko [Fri, 12 Jul 2019 17:25:57 +0000 (10:25 -0700)]
selftests/bpf: use typedef'ed arrays as map values

Convert few tests that couldn't use typedef'ed arrays due to kernel bug.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: add trickier size resolution tests
Andrii Nakryiko [Fri, 12 Jul 2019 17:25:56 +0000 (10:25 -0700)]
selftests/bpf: add trickier size resolution tests

Add more BTF tests, validating that size resolution logic is correct in
few trickier cases.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agobpf: fix BTF verifier size resolution logic
Andrii Nakryiko [Fri, 12 Jul 2019 17:25:55 +0000 (10:25 -0700)]
bpf: fix BTF verifier size resolution logic

BTF verifier has a size resolution bug which in some circumstances leads to
invalid size resolution for, e.g., TYPEDEF modifier.  This happens if we have
[1] PTR -> [2] TYPEDEF -> [3] ARRAY, in which case due to being in pointer
context ARRAY size won't be resolved (because for pointer it doesn't matter, so
it's a sink in pointer context), but it will be permanently remembered as zero
for TYPEDEF and TYPEDEF will be marked as RESOLVED. Eventually ARRAY size will
be resolved correctly, but TYPEDEF resolved_size won't be updated anymore.
This, subsequently, will lead to erroneous map creation failure, if that
TYPEDEF is specified as either key or value, as key_size/value_size won't
correspond to resolved size of TYPEDEF (kernel will believe it's zero).

Note, that if BTF was ordered as [1] ARRAY <- [2] TYPEDEF <- [3] PTR, this
won't be a problem, as by the time we get to TYPEDEF, ARRAY's size is already
calculated and stored.

This bug manifests itself in rejecting BTF-defined maps that use array
typedef as a value type:

typedef int array_t[16];

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __type(value, array_t); /* i.e., array_t *value; */
} test_map SEC(".maps");

The fix consists on not relying on modifier's resolved_size and instead using
modifier's resolved_id (type ID for "concrete" type to which modifier
eventually resolves) and doing size determination for that resolved type. This
allow to preserve existing "early DFS termination" logic for PTR or
STRUCT_OR_ARRAY contexts, but still do correct size determination for modifier
types.

Fixes: eb3f595dab40 ("bpf: btf: Validate type reference")
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agonet/mlx5e: Allow dissector meta key in tc flower
Vlad Buslov [Thu, 11 Jul 2019 10:03:48 +0000 (13:03 +0300)]
net/mlx5e: Allow dissector meta key in tc flower

Recently, fl_flow_key->indev_ifindex int field was refactored into
flow_dissector_key_meta field. With this, flower classifier also sets
FLOW_DISSECTOR_KEY_META flow dissector key. However, mlx5 flower dissector
validation code rejects filters that use flow dissector keys that are not
supported. Add FLOW_DISSECTOR_KEY_META to the list of allowed dissector
keys in __parse_cls_flower() to prevent following error when offloading
flower classifier to mlx5:

Error: mlx5_core: Unsupported key.

Fixes: 8212ed777f40 ("net: sched: cls_flower: use flow_dissector for ingress ifindex")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Rely on filter_dev instead of dissector keys for tunnels
Vlad Buslov [Mon, 8 Jul 2019 14:02:36 +0000 (17:02 +0300)]
net/mlx5e: Rely on filter_dev instead of dissector keys for tunnels

Currently, tunnel attributes are parsed and inner header matching is used
only when flow dissector specifies match on some of the supported
encapsulation fields. When user tries to offload tc filter that doesn't
match any encapsulation fields on tunnel device, mlx5 tc layer incorrectly
sets to match packet header keys on encap header (outer header) and
firmware rejects the rule with syndrome 0x7e1579 when creating new flow
group.

Change __parse_cls_flower() to determine whether tunnel is used based on
fitler_dev tunnel info, instead of determining it indirectly by checking
flow dissector enc keys.

Fixes: bbd00f7e2349 ("net/mlx5e: Add TC tunnel release action for SRIOV offloads")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Verify encapsulation is supported
Eli Cohen [Thu, 27 Jun 2019 05:34:41 +0000 (08:34 +0300)]
net/mlx5e: Verify encapsulation is supported

When mlx5e_attach_encap() calls mlx5e_get_tc_tun() to get the tunnel
info data struct, check that returned value is not NULL, as would be in
the case of unsupported encapsulation.

Fixes: d386939a327d2 ("net/mlx5e: Rearrange tc tunnel code in a modular way")
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoISDN: hfcsusb: checking idx of ep configuration
Phong Tran [Mon, 15 Jul 2019 15:08:14 +0000 (22:08 +0700)]
ISDN: hfcsusb: checking idx of ep configuration

The syzbot test with random endpoint address which made the idx is
overflow in the table of endpoint configuations.

this adds the checking for fixing the error report from
syzbot

KASAN: stack-out-of-bounds Read in hfcsusb_probe [1]
The patch tested by syzbot [2]

Reported-by: syzbot+8750abbc3a46ef47d509@syzkaller.appspotmail.com
[1]:
https://syzkaller.appspot.com/bug?id=30a04378dac680c5d521304a00a86156bb913522
[2]:
https://groups.google.com/d/msg/syzkaller-bugs/_6HBdge8F3E/OJn7wVNpBAAJ

Signed-off-by: Phong Tran <tranmanphong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovmxnet3: Remove call to memset after dma_alloc_coherent
Fuqian Huang [Mon, 15 Jul 2019 03:21:18 +0000 (11:21 +0800)]
vmxnet3: Remove call to memset after dma_alloc_coherent

In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agohippi: Remove call to memset after pci_alloc_consistent
Fuqian Huang [Mon, 15 Jul 2019 03:19:21 +0000 (11:19 +0800)]
hippi: Remove call to memset after pci_alloc_consistent

pci_alloc_consistent calls dma_alloc_coherent directly.
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoethernet: remove redundant memset
Fuqian Huang [Mon, 15 Jul 2019 03:19:11 +0000 (11:19 +0800)]
ethernet: remove redundant memset

kvzalloc already zeroes the memory during the allocation.
pci_alloc_consistent calls dma_alloc_coherent directly.
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So the memset after these function is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoatm: idt77252: Remove call to memset after dma_alloc_coherent
Fuqian Huang [Mon, 15 Jul 2019 03:17:09 +0000 (11:17 +0800)]
atm: idt77252: Remove call to memset after dma_alloc_coherent

In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: neigh: fix multiple neigh timer scheduling
Lorenzo Bianconi [Sun, 14 Jul 2019 21:36:11 +0000 (23:36 +0200)]
net: neigh: fix multiple neigh timer scheduling

Neigh timer can be scheduled multiple times from userspace adding
multiple neigh entries and forcing the neigh timer scheduling passing
NTF_USE in the netlink requests.
This will result in a refcount leak and in the following dump stack:

[   32.465295] NEIGH: BUG, double timer add, state is 8
[   32.465308] CPU: 0 PID: 416 Comm: double_timer_ad Not tainted 5.2.0+ #65
[   32.465311] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[   32.465313] Call Trace:
[   32.465318]  dump_stack+0x7c/0xc0
[   32.465323]  __neigh_event_send+0x20c/0x880
[   32.465326]  ? ___neigh_create+0x846/0xfb0
[   32.465329]  ? neigh_lookup+0x2a9/0x410
[   32.465332]  ? neightbl_fill_info.constprop.0+0x800/0x800
[   32.465334]  neigh_add+0x4f8/0x5e0
[   32.465337]  ? neigh_xmit+0x620/0x620
[   32.465341]  ? find_held_lock+0x85/0xa0
[   32.465345]  rtnetlink_rcv_msg+0x204/0x570
[   32.465348]  ? rtnl_dellink+0x450/0x450
[   32.465351]  ? mark_held_locks+0x90/0x90
[   32.465354]  ? match_held_lock+0x1b/0x230
[   32.465357]  netlink_rcv_skb+0xc4/0x1d0
[   32.465360]  ? rtnl_dellink+0x450/0x450
[   32.465363]  ? netlink_ack+0x420/0x420
[   32.465366]  ? netlink_deliver_tap+0x115/0x560
[   32.465369]  ? __alloc_skb+0xc9/0x2f0
[   32.465372]  netlink_unicast+0x270/0x330
[   32.465375]  ? netlink_attachskb+0x2f0/0x2f0
[   32.465378]  netlink_sendmsg+0x34f/0x5a0
[   32.465381]  ? netlink_unicast+0x330/0x330
[   32.465385]  ? move_addr_to_kernel.part.0+0x20/0x20
[   32.465388]  ? netlink_unicast+0x330/0x330
[   32.465391]  sock_sendmsg+0x91/0xa0
[   32.465394]  ___sys_sendmsg+0x407/0x480
[   32.465397]  ? copy_msghdr_from_user+0x200/0x200
[   32.465401]  ? _raw_spin_unlock_irqrestore+0x37/0x40
[   32.465404]  ? lockdep_hardirqs_on+0x17d/0x250
[   32.465407]  ? __wake_up_common_lock+0xcb/0x110
[   32.465410]  ? __wake_up_common+0x230/0x230
[   32.465413]  ? netlink_bind+0x3e1/0x490
[   32.465416]  ? netlink_setsockopt+0x540/0x540
[   32.465420]  ? __fget_light+0x9c/0xf0
[   32.465423]  ? sockfd_lookup_light+0x8c/0xb0
[   32.465426]  __sys_sendmsg+0xa5/0x110
[   32.465429]  ? __ia32_sys_shutdown+0x30/0x30
[   32.465432]  ? __fd_install+0xe1/0x2c0
[   32.465435]  ? lockdep_hardirqs_off+0xb5/0x100
[   32.465438]  ? mark_held_locks+0x24/0x90
[   32.465441]  ? do_syscall_64+0xf/0x270
[   32.465444]  do_syscall_64+0x63/0x270
[   32.465448]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fix the issue unscheduling neigh_timer if selected entry is in 'IN_TIMER'
receiving a netlink request with NTF_USE flag set

Reported-by: Marek Majkowski <marek@cloudflare.com>
Fixes: 0c5c2d308906 ("neigh: Allow for user space users of the neighbour table")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>