git.openwrt.org Git - openwrt/staging/blogic.git/log

projects / openwrt / staging / blogic.git / log

stephen hemminger [Thu, 10 Aug 2017 00:46:12 +0000 (17:46 -0700)]

netvsc: keep track of some non-fatal overload conditions

Add ethtool statistics for case where send chimmeny buffer is
exhausted and driver has to fall back to doing scatter/gather
send. Also, add statistic for case where ring buffer is full and
receive completions are delayed.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:11 +0000 (17:46 -0700)]

netvsc: allow controlling send/recv buffer size

Control the size of the buffer areas via ethtool ring settings.
They aren't really traditional hardware rings, but host API breaks
receive and send buffer into chunks. The final size of the chunks are
controlled by the host.

The default value of send and receive buffer area for host DMA
is much larger than it needs to be. Experimentation shows that
4M receive and 1M send is sufficient.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:10 +0000 (17:46 -0700)]

netvsc: remove unnecessary check for NULL hdr

The function init_page_array is always called with a valid pointer
to RNDIS header. No check for NULL is needed.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:09 +0000 (17:46 -0700)]

netvsc: remove unnecessary cast of void pointer

Assignment to a typed pointer is sufficient in C.
No cast is needed.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:08 +0000 (17:46 -0700)]

netvsc: whitespace cleanup

Fix some minor indentation issues.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:07 +0000 (17:46 -0700)]

netvsc: no need to allocate send/receive on numa node

The send and receive buffers are both per-device (not per-channel).
The associated NUMA node is a property of the CPU which is per-channel
therefore it makes no sense to force the receive/send buffer to be
allocated on a particular node (since it is a shared resource).

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:06 +0000 (17:46 -0700)]

netvsc: check error return when restoring channels and mtu

If setting new values fails, and the attempt to restore original
settings fails. Then log an error and leave device down.
This should never happen, but if it does don't go down in flames.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:05 +0000 (17:46 -0700)]

netvsc: propagate MAC address change to VF slave

If VF is slaved to synthetic device, then any change to netvsc
MAC address should be propagated to the slave device.

If slave device doesn't support MAC address change then it
should also be an error to attempt to change synthetic NIC MAC
address.

It also fixes the error unwind in the original code.
If give a bad address, the old code would change the device
MAC address anyway.

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:04 +0000 (17:46 -0700)]

netvsc: don't signal host twice if empty

When hv_pkt_iter_next() returns NULL, it has already called
hv_pkt_iter_close(). Calling it twice can lead to extra host signal.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

stephen hemminger [Thu, 10 Aug 2017 00:46:03 +0000 (17:46 -0700)]

netvsc: delay setup of VF device

When VF device is discovered, delay bring it automatically up in
order to allow userspace to some simple changes (like renaming).

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Dan Carpenter [Wed, 9 Aug 2017 21:35:50 +0000 (00:35 +0300)]

phylink: Fix an uninitialized variable bug

"ret" isn't necessarily initialized here.

Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Intiyaz Basha [Wed, 9 Aug 2017 20:28:04 +0000 (13:28 -0700)]

liquidio: removed check for queue size alignment

There is no restriction on queue size alignment. Hence removing check for
valid queue size.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Intiyaz Basha [Wed, 9 Aug 2017 19:07:08 +0000 (12:07 -0700)]

liquidio: rx/tx queue cleanup

When deleting a queue, clear its corresponding bit in the qmask, vfree its
memory, clear out the pointer that's pointing to it, and decrement the
queue count.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <fmanlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Fri, 11 Aug 2017 20:47:01 +0000 (13:47 -0700)]

Merge branch 'net-sched-let-the-offloader-decide-what-to-offload'

Jiri Pirko says:

====================
net: sched: let the offloader decide what to offload

Currently there is a Qdisc_class_ops->tcf_cl_offload callback
that is called to find out if cls would offload rule or not.
This is only supported by sch_ingress and sch_clsact.
So the Qdisc are to decide. However, the driver knows what is he
able to offload, so move the decision making to drivers completely.
Just pass classid there and provide set of helpers to allow
identification of qdisc.

As a side effect, this actually allows clsact egress rules
offload in mlxsw.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jiri Pirko [Wed, 9 Aug 2017 12:30:35 +0000 (14:30 +0200)]

net: sched: remove cops->tcf_cl_offload

cops->tcf_cl_offload is no longer needed, as the drivers check what they
can and cannot offload using the classid identify helpers. So remove this.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jiri Pirko [Wed, 9 Aug 2017 12:30:34 +0000 (14:30 +0200)]

net: sched: remove handle propagation down to the drivers

There is no longer need to use handle in drivers, so remove it from
tc_cls_common_offload struct.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jiri Pirko [Wed, 9 Aug 2017 12:30:33 +0000 (14:30 +0200)]

net: sched: use newly added classid identity helpers

Instead of checking handle, which does not have the inner class
information and drivers wrongly assume clsact->egress as ingress, use
the newly introduced classid identification helpers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jiri Pirko [Wed, 9 Aug 2017 12:30:32 +0000 (14:30 +0200)]

net: sched: propagate classid down to offload drivers

Drivers need classid to decide they support this specific qdisc+class
or not. So propagate it down via the tc_cls_common_offload struct.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jiri Pirko [Wed, 9 Aug 2017 12:30:31 +0000 (14:30 +0200)]

net: sched: Add helpers to identify classids

Offloading drivers need to understand what qdisc class a filter is added
to. Currently they only need to identify ingress, clsact->ingress and
clsact->egress. So provide these helpers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Girish Moodalbail [Wed, 9 Aug 2017 08:09:28 +0000 (01:09 -0700)]

geneve: use netlink_ext_ack for error reporting in rtnl operations

Add extack error messages for failure paths while creating/modifying
geneve devices. Once extack support is added to iproute2, more
meaningful and helpful error messages will be displayed making it easy
for users to discern what went wrong.

Before:

=======
$ ip link add gen1 address 0:1:2:3:4:5:6 type geneve id 200 \
remote 192.168.13.2
RTNETLINK answers: Invalid argument

After:
======
$ ip link add gen1 address 0:1:2:3:4:5:6 type geneve id 200 \
remote 192.168.13.2
Error: Provided link layer address is not Ethernet

Also, netdev_dbg() calls used to log errors associated with Netlink
request have been removed.

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Fri, 11 Aug 2017 17:02:45 +0000 (10:02 -0700)]

Merge branch 'sctp-remove-typedefs-from-structures-part-6'

Xin Long says:

====================
sctp: remove typedefs from structures part 6

As we know, typedef is suggested not to use in kernel, even checkpatch.pl
also gives warnings about it. Now sctp is using it for many structures.

All this kind of typedef's using should be removed. This patchset is the
part 6 to remove all typedefs in include/net/sctp/structs.h, command.h
and sm.h.

Just as the part 1-5, No any code's logic would be changed in these patches,
only cleaning up.

Note that this is the last part for this typedef cleaning up. after this
patchset, no more inappropriate typedefs in sctp. It's also to tidy some
codes when removing them, like fixing many indents, reodering some local
params, especially in the last 2 patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:58 +0000 (10:23 +0800)]

sctp: fix some indents in sm_make_chunk.c

There are some bad indents of functions' defination in sm_make_chunk.c.
They have been there since beginning, it was probably caused by that
the typedef sctp_chunk_t was replaced with struct sctp_chunk.

So it's the best time to fix them in this patchset, it's also to fix
some bad indents in other functions' defination in sm_make_chunk.c.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:57 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_disposition_t

This patch is to remove the typedef sctp_disposition_t, and
replace with enum sctp_disposition in the places where it's
using this typedef.

It's also to fix the indent for many functions' defination.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:56 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_sm_table_entry_t

This patch is to remove the typedef sctp_sm_table_entry_t, and
replace with struct sctp_sm_table_entry in the places where it's
using this typedef.

It is also to fix some indents.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:55 +0000 (10:23 +0800)]

sctp: remove the unused typedef sctp_sm_command_t

Remove this typedef including the struct, there is even no places
using it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:54 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_verb_t

This patch is to remove the typedef sctp_verb_t, and
replace with enum sctp_verb in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:53 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_arg_t

This patch is to remove the typedef sctp_arg_t, and
replace with union sctp_arg in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:52 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_cmd_seq_t

This patch is to remove the typedef sctp_cmd_seq_t, and
replace with struct sctp_cmd_seq in the places where it's
using this typedef.

Note that it doesn't fix many indents although it should,
as sctp_disposition_t's removal would mess them up again.
So better to fix them when removing sctp_disposition_t in
the later patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:51 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_cmd_t

This patch is to remove the typedef sctp_cmd_t, and
replace with enum sctp_cmd in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:50 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_socket_type_t

This patch is to remove the typedef sctp_socket_type_t, and
replace with enum sctp_socket_type in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:49 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_dbg_objcnt_entry_t

This patch is to remove the typedef sctp_dbg_objcnt_entry_t, and
replace with struct sctp_dbg_objcnt_entry in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:48 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_cmsgs_t

This patch is to remove the typedef sctp_cmsgs_t, and
replace with struct sctp_cmsgs in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:47 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_endpoint_type_t

This patch is to remove the typedef sctp_endpoint_type_t, and
replace with enum sctp_endpoint_type in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:46 +0000 (10:23 +0800)]

sctp: remove the typedef sctp_sender_hb_info_t

This patch is to remove the typedef sctp_sender_hb_info_t, and
replace with struct sctp_sender_hb_info in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Fri, 11 Aug 2017 02:23:45 +0000 (10:23 +0800)]

sctp: remove the unused typedef sctp_packet_phandler_t

Remove this function typedef, there is even no places
using it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 10 Aug 2017 19:11:16 +0000 (12:11 -0700)]

Merge git://git./linux/kernel/git/davem/net

Mainline had UFO fixes, but UFO is removed in net-next so we
take the HEAD hunks.

Minor context conflict in bcmsysport statistics bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Linus Torvalds [Thu, 10 Aug 2017 17:30:29 +0000 (10:30 -0700)]

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix handling of initial STATE message in TIPC, from Jon Paul Maloy.

2) Fix stats handling in bcm_sysport_get_stats(), from Florian
    Fainelli.

3) Reject 16777215 VNI value in geneve_validate(), from Girish
    Moodalbail.

4) Fix initial IGMP sysctl setting regression, from Nikolay Borisov.

5) Once a UFO fragmented frame is treated as UFO, we should continue
    doing so. Likewise once a frame has been segmented, we should
    continue doing that and not try to convert it to a UFO frame. From
    Willem de Bruijn.

6) Test the AF_PACKET RX/TX ring pg_vec state under the socket lock to
    prevent races. From Willem de Bruijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  packet: fix tp_reserve race in packet_set_ring
  udp: consistently apply ufo or fragmentation
  net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
  igmp: Fix regression caused by igmp sysctl namespace code.
  geneve: maximum value of VNI cannot be used
  net: systemport: Fix software statistics for SYSTEMPORT Lite
  tipc: remove premature ESTABLISH FSM event at link synchronization

commit | commitdiff | tree

Willem de Bruijn [Thu, 10 Aug 2017 16:41:58 +0000 (12:41 -0400)]

packet: fix tp_reserve race in packet_set_ring

Updates to tp_reserve can race with reads of the field in
packet_set_ring. Avoid this by holding the socket lock during
updates in setsockopt PACKET_RESERVE.

This bug was discovered by syzkaller.

Fixes: 8913336a7e8d ("packet: add PACKET_RESERVE sockopt")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Willem de Bruijn [Thu, 10 Aug 2017 16:29:19 +0000 (12:29 -0400)]

udp: consistently apply ufo or fragmentation

When iteratively building a UDP datagram with MSG_MORE and that
datagram exceeds MTU, consistently choose UFO or fragmentation.

Once skb_is_gso, always apply ufo. Conversely, once a datagram is
split across multiple skbs, do not consider ufo.

Sendpage already maintains the first invariant, only add the second.
IPv6 does not have a sendpage implementation to modify.

A gso skb must have a partial checksum, do not follow sk_no_check_tx
in udp_send_skb.

Found by syzkaller.

Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 10 Aug 2017 16:50:22 +0000 (09:50 -0700)]

Merge branch 'rtnetlink-fix-initial-rtnl-pushdown-fallout'

Florian Westphal says:

====================
rtnetlink: fix initial rtnl pushdown fallout

This series fixes various bugs and splats reported since the
allow-handler-to-run-with-no-rtnl series went in.

Last patch adds a script that can be used to add further
tests in case more bugs are reported.
In case you prefer reverting the original series instead of
fixing fallout I can resend this patch on its own.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Thu, 10 Aug 2017 14:53:02 +0000 (16:53 +0200)]

selftests: add rtnetlink test script

add a simple script to exercise some rtnetlink call paths, so KASAN,
lockdep etc. can yell at developer before patches are sent upstream.

This can be extended to also cover bond, team, vrf and the like.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Thu, 10 Aug 2017 14:53:01 +0000 (16:53 +0200)]

rtnetlink: fallback to UNSPEC if current family has no doit callback

We need to use PF_UNSPEC in case the requested family has no doit
callback, otherwise this now fails with EOPNOTSUPP instead of running the
unspec doit callback, as before.

Fixes: 6853dd488119 ("rtnetlink: protect handler table with rcu")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Thu, 10 Aug 2017 14:53:00 +0000 (16:53 +0200)]

rtnetlink: init handler refcounts to 1

If using CONFIG_REFCOUNT_FULL=y we get following splat:
refcount_t: increment on 0; use-after-free.
WARNING: CPU: 0 PID: 304 at lib/refcount.c:152 refcount_inc+0x47/0x50
Call Trace:
rtnetlink_rcv_msg+0x191/0x260
...

This warning is harmless (0 is "no callback running", not "memory
was freed").

Use '1' as the new 'no handler is running' base instead of 0 to avoid
this.

Fixes: 019a316992ee ("rtnetlink: add reference counting to prevent module unload while dump is in progress")
Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Thu, 10 Aug 2017 14:52:59 +0000 (16:52 +0200)]

rtnetlink: switch rtnl_link_get_slave_info_data_size to rcu

David Ahern reports following splat:
RTNL: assertion failed at net/core/dev.c (5717)
netdev_master_upper_dev_get+0x5f/0x70
if_nlmsg_size+0x158/0x240
rtnl_calcit.isra.26+0xa3/0xf0

rtnl_link_get_slave_info_data_size currently assumes RTNL protection, but
there appears to be no hard requirement for this, so use rcu instead.

At the time of this writing, there are three 'get_slave_size' callbacks
(now invoked under rcu): bond_get_slave_size, vrf_get_slave_size and
br_port_get_slave_size, all return constant only (i.e. they don't sleep).

Fixes: 6853dd488119 ("rtnetlink: protect handler table with rcu")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Thu, 10 Aug 2017 14:52:58 +0000 (16:52 +0200)]

rtnetlink: do not use RTM_GETLINK directly

Userspace sends RTM_GETLINK type, but the kernel substracts
RTM_BASE from this, i.e. 'type' doesn't contain RTM_GETLINK
anymore but instead RTM_GETLINK - RTM_BASE.

This caused the calcit callback to not be invoked when it
should have been (and vice versa).

While at it, also fix a off-by one when checking family index. vs
handler array size.

Fixes: e1fa6d216dd ("rtnetlink: call rtnl_calcit directly")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Thu, 10 Aug 2017 14:52:57 +0000 (16:52 +0200)]

rtnetlink: use rcu_dereference_raw to silence rcu splat

Ido reports a rcu splat in __rtnl_register.
The splat is correct; as rtnl_register doesn't grab any logs
and doesn't use rcu locks either. It has always been like this.
handler families are not registered in parallel so there are no
races wrt. the kmalloc ordering.

The only reason to use rcu_dereference in the first place was to
avoid sparse from complaining about this.

Thus this switches to _raw() to not have rcu checks here.

The alternative is to add rtnl locking to register/unregister,
however, I don't see a compelling reason to do so as this has been
lockless for the past twenty years or so.

Fixes: 6853dd4881 ("rtnetlink: protect handler table with rcu")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Linus Torvalds [Thu, 10 Aug 2017 16:36:06 +0000 (09:36 -0700)]

Merge git://git./linux/kernel/git/davem/sparc

Pull sparc updates from David Miller:

1) Recognize M8 cpus, just basic chip ID matching, from Allen Pais.

2) Prevent crashes when bringing up sunvdc virtual block devices in
    some environments. From Jim Quigley.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sunvdc: prevent sunvdc panic when mpgroup disk added to guest domain
  sparc64: Increase max_phys_bits to 51 and VA bits to 53 for M8.
  sparc64: recognize and support sparc M8 cpu type
  sparc64: properly name the cpu constants

commit | commitdiff | tree

John Crispin [Thu, 10 Aug 2017 08:09:03 +0000 (10:09 +0200)]

net: core: fix compile error inside flow_dissector due to new dsa callback

The following error was introduced by
commit 43e665287f93 ("net-next: dsa: fix flow dissection")
due to a missing #if guard

net/core/flow_dissector.c: In function '__skb_flow_dissect':
net/core/flow_dissector.c:448:18: error: 'struct net_device' has no member named 'dsa_ptr'
ops = skb->dev->dsa_ptr->tag_ops;
^
make[3]: *** [net/core/flow_dissector.o] Error 1

Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 10 Aug 2017 05:51:47 +0000 (22:51 -0700)]

Merge branch 'dsa-flow-dissection'

John Crispin says:

====================
net-next: dsa: fix flow dissection

RPS and probably other kernel features are currently broken on some if not
all DSA devices. The root cause of this is that skb_hash will call the
flow_dissector. At this point the skb still contains the magic switch
header and the skb->protocol field is not set up to the correct 802.3
value yet. By the time the tag specific code is called, removing the header
and properly setting the protocol an invalid hash is already set. In the
case of the mt7530 this will result in all flows always having the same
hash.

Changes since RFC:
* use a callback instead of static values
* add cover letter
====================

Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

John Crispin [Wed, 9 Aug 2017 12:41:19 +0000 (14:41 +0200)]

net-next: dsa: fix flow dissection

RPS and probably other kernel features are currently broken on some if not
all DSA devices. The root cause of this is that skb_hash will call the
flow_dissector. At this point the skb still contains the magic switch
header and the skb->protocol field is not set up to the correct 802.3
value yet. By the time the tag specific code is called, removing the header
and properly setting the protocol an invalid hash is already set. In the
case of the mt7530 this will result in all flows always having the same
hash.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

John Crispin [Wed, 9 Aug 2017 12:41:18 +0000 (14:41 +0200)]

net-next: tag_mtk: add flow_dissect callback to the ops struct

The MT7530 inserts the 4 magic header in between the 802.3 address and
protocol field. The patch implements the callback that can be called by
the flow dissector to figure out the real protocol and offset of the
network header. With this patch applied we can properly parse the packet
and thus make hashing function properly.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

John Crispin [Wed, 9 Aug 2017 12:41:17 +0000 (14:41 +0200)]

net-next: dsa: add flow_dissect callback to struct dsa_device_ops

When the flow dissector first sees packets coming in on a DSA devices the
802.3 header wont be located where the code expects it to be as the tag
is still present. Adding this new callback allows a DSA device to provide a
new function that the flow_dissector can use to get the correct protocol
and offset of the network header.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

John Crispin [Wed, 9 Aug 2017 12:41:16 +0000 (14:41 +0200)]

net-next: dsa: move struct dsa_device_ops to the global header file

We need to access this struct from within the flow_dissector to fix
dissection for packets coming in on DSA devices.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Xin Long [Wed, 9 Aug 2017 10:15:19 +0000 (18:15 +0800)]

net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target

Commit 55917a21d0cc ("netfilter: x_tables: add context to know if
extension runs from nft_compat") introduced a member nft_compat to
xt_tgchk_param structure.

But it didn't set it's value for ipt_init_target. With unexpected
value in par.nft_compat, it may return unexpected result in some
target's checkentry.

This patch is to set all it's fields as 0 and only initialize the
non-zero fields in ipt_init_target.

v1->v2:
As Wang Cong's suggestion, fix it by setting all it's fields as
0 and only initializing the non-zero fields.

Fixes: 55917a21d0cc ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Nikolay Borisov [Wed, 9 Aug 2017 11:38:04 +0000 (14:38 +0300)]

igmp: Fix regression caused by igmp sysctl namespace code.

Commit dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
moved the igmp sysctls initialization from tcp_sk_init to igmp_net_init. This
function is only called as part of per-namespace initialization, only if
CONFIG_IP_MULTICAST is defined, otherwise igmp_mc_init() call in ip_init is
compiled out, casuing the igmp pernet ops to not be registerd and those sysctl
being left initialized with 0. However, there are certain functions, such as
ip_mc_join_group which are always compiled and make use of some of those
sysctls. Let's do a partial revert of the aforementioned commit and move the
sysctl initialization into inet_init_net, that way they will always have
sane values.

Fixes: dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196595
Reported-by: Gerardo Exequiel Pozzi <vmlinuz386@gmail.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Thu, 10 Aug 2017 05:45:36 +0000 (22:45 -0700)]

Merge branch 'mediatek-bring-up-QDMA-RX-ring-0'

John Crispin says:

====================
net-next: mediatek: bring up QDMA RX ring 0

The MT7623 has several DMA rings. Inside the SW path, the core will use
the PDMA when receiving traffic. While bringing up the HW path we noticed
that the PPE requires the QDMA RX to also be brought up as it uses this
ring internally for its flow scheduling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

John Crispin [Wed, 9 Aug 2017 10:09:32 +0000 (12:09 +0200)]

net-next: mediatek: bring up QDMA RX ring 0

This patch is in preparation for adding HW flow and QoS offloading. For
those features to work, the driver needs to bring up the first QDMA RX
ring. This ring is used by the PPE offloading HW.

Signed-off-by: John Crisp in <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

John Crispin [Wed, 9 Aug 2017 10:09:31 +0000 (12:09 +0200)]

net-next: mediatek: fix typos inside the header file

Trivial patch fixing 2 typos.

Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Bhumika Goyal [Wed, 9 Aug 2017 09:32:08 +0000 (15:02 +0530)]

net: atm: make atmdev_ops const

Make these const as they are only stored in the ops field of a atm_dev
structure, which is const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Bhumika Goyal [Wed, 9 Aug 2017 09:19:15 +0000 (14:49 +0530)]

atm: make atmdev_ops const

Make these structures const as they are either passed to the function
atm_dev_register having the corresponding argument as const or stored in
the ops field of a atm_dev structure, which is also const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Bhumika Goyal [Wed, 9 Aug 2017 05:04:15 +0000 (10:34 +0530)]

net: dsa: make dsa_switch_ops const

Make these structures const as they are only stored in the ops field of
a dsa_switch structure, which is const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Intiyaz Basha [Wed, 9 Aug 2017 02:34:28 +0000 (19:34 -0700)]

liquidio: napi cleanup

Disable napi when interface is going down.
Delete napi when destroying the interface.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Girish Moodalbail [Wed, 9 Aug 2017 00:26:24 +0000 (17:26 -0700)]

geneve: maximum value of VNI cannot be used

Geneve's Virtual Network Identifier (VNI) is 24 bit long, so the range
of values for it would be from 0 to 16777215 (2^24 -1). However, one
cannot create a geneve device with VNI set to 16777215. This patch fixes
this issue.

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David Ahern [Tue, 8 Aug 2017 21:51:02 +0000 (15:51 -0600)]

net: ipv6: lower ndisc notifier priority below addrconf

ndisc_notify is used to send unsolicited neighbor advertisements
(e.g., on a link up). Currently, the ndisc notifier is run before the
addrconf notifer which means NA's are not sent for link-local addresses
which are added by the addrconf notifier.

Fix by lowering the priority of the ndisc notifier. Setting the priority
to ADDRCONF_NOTIFY_PRIORITY - 5 means it runs after addrconf and before
the route notifier which is ADDRCONF_NOTIFY_PRIORITY - 10.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Fainelli [Tue, 8 Aug 2017 21:45:09 +0000 (14:45 -0700)]

net: systemport: Fix software statistics for SYSTEMPORT Lite

With SYSTEMPORT Lite we have holes in our statistics layout that make us
skip over the hardware MIB counters, bcm_sysport_get_stats() was not
taking that into account resulting in reporting 0 for all SW-maintained
statistics, fix this by skipping accordingly.

Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jon Paul Maloy [Tue, 8 Aug 2017 20:23:56 +0000 (22:23 +0200)]

tipc: remove premature ESTABLISH FSM event at link synchronization

When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.

However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before initializing synchronization.

Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.

This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.

We fix this by removing the unnecessary event call.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Nathan Fontenot [Tue, 8 Aug 2017 20:26:18 +0000 (15:26 -0500)]

ibmvnic: Correct 'unused variable' warning in build.

Commit a248878d7a1d ("ibmvnic: Check for transport event on driver resume")
removed the loop to kick irqs on driver resume but didn't remove the now
unused loop variable 'i'.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Nathan Fontenot [Tue, 8 Aug 2017 20:24:05 +0000 (15:24 -0500)]

ibmvnic: Add netdev_dbg output for debugging

To ease debugging of the ibmvnic driver add a series of netdev_dbg()
statements to track driver status, especially during initialization,
removal, and resetting of the driver.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Nathan Fontenot [Tue, 8 Aug 2017 19:28:45 +0000 (14:28 -0500)]

ibmvnic: Clean up resources on probe failure

Ensure that any resources allocated during probe are released if the
probe of the driver fails.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Jim Quigley [Fri, 21 Jul 2017 13:20:15 +0000 (09:20 -0400)]

sunvdc: prevent sunvdc panic when mpgroup disk added to guest domain

Using mpgroup to define multiple paths for a virtual disk causes multiple
virtual-device-port ports to be created for that virtual device.
Each virtual-device-port port then gets a vdisk created for it by the Linux
sunvdc driver. As mpgroup is not supported by the Linux sunvdc driver it
cannot handle multiple ports for a single vdisk, leading to a kernel panic
at startup.

This fix prevents more than one vdisk per virtual-device-port being created
until full virtual disk multipathing (mpgroup) support is implemented.

Signed-off-by: Jim Quigley <Jim.Quigley@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 9 Aug 2017 23:57:38 +0000 (16:57 -0700)]

Merge branch 'rtnetlink-allow-selected-handlers-to-run-without-rtnl'

Florian Westphal says:

====================
rtnetlink: allow selected handlers to run without rtnl

Changes since v1:
In patch 6, don't make ipv6 route handlers lockless, they all have
assumptions on rtnl being held. Other patches are unchanged.

The RTNL mutex is used to serialize both rtnetlink calls and
dump requests.
Its also used to protect other things such as the list of current
net namespaces.

Unfortunately RTNL mutex is a performance issue, e.g. a cpu adding an
ip address prevents other cpus from seemingly unrelated tasks such as
dumping tc classifiers or doing rtnetlink route lookups.

This patch set adds basic infrastructure to start pushing the rtnl lock
down to those places that need it, or even elide it entirely in some cases.

Subsystems can now indicate that their doit() callback can run without
RTNL mutex, such callbacks can then run in parallel.

This will obviously need a lot of followup work; all current
users need to be audited/changed to benefit from this.
Initial no-rtnl spot is netns new/getid.

We have various 'get' handlers that are also a tempting target,
however, several of these depend on rtnl mutex to prevent information
from changing while objects are being read by rtnl handlers; however,
it doesn't appear impossible to change this.

Dumps are another problem entirely, see
commit 2907c35ff64708065 ("net: hold rtnl again in dump callbacks"),
this patchset doesn't touch dump requests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Wed, 9 Aug 2017 18:41:53 +0000 (20:41 +0200)]

net: call newid/getid without rtnl mutex held

Both functions take nsid_lock and don't rely on rtnl lock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Wed, 9 Aug 2017 18:41:52 +0000 (20:41 +0200)]

rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED

Allow callers to tell rtnetlink core that its doit callback
should be invoked without holding rtnl mutex.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Wed, 9 Aug 2017 18:41:51 +0000 (20:41 +0200)]

rtnetlink: protect handler table with rcu

Note that netlink dumps still acquire rtnl mutex via the netlink
dump infrastructure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Wed, 9 Aug 2017 18:41:50 +0000 (20:41 +0200)]

rtnetlink: small rtnl lock pushdown

instead of rtnl lock/unload at the top level, push it down
to the called function.

This is just an intermediate step, next commit switches protection
of the rtnl_link ops table to rcu, in which case (for dumps) the
rtnl lock is acquired only by the netlink dumper infrastructure
(current lock/unlock/dump/lock/unlock rtnl sequence becomes
rcu lock/rcu unlock/dump).

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Wed, 9 Aug 2017 18:41:49 +0000 (20:41 +0200)]

rtnetlink: add reference counting to prevent module unload while dump is in progress

I don't see what prevents rmmod (unregister_all is called) while a dump
is active.

Even if we'd add rtnl lock/unlock pair to unregister_all (as done here),
thats not enough either as rtnl_lock is released right before the dump
process starts.

So this adds a refcount:
* acquire rtnl mutex
* bump refcount
* release mutex
* start the dump

... and make unregister_all remove the callbacks (no new dumps possible)
and then wait until refcount is 0.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Wed, 9 Aug 2017 18:41:48 +0000 (20:41 +0200)]

rtnetlink: make rtnl_register accept a flags parameter

This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.

This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Florian Westphal [Wed, 9 Aug 2017 18:41:47 +0000 (20:41 +0200)]

rtnetlink: call rtnl_calcit directly

There is only a single place in the kernel that regisers the "calcit"
callback (to determine min allocation for dumps).

This is in rtnetlink.c for PF_UNSPEC RTM_GETLINK.
The function that checks for calcit presence at run time will first check
the requested family (which will always fail for !PF_UNSPEC as no callsite
registers this), then falls back to checking PF_UNSPEC.

Therefore we can just check if type is RTM_GETLINK and then do a direct
call. Because of fallback to PF_UNSPEC all RTM_GETLINK types used this
regardless of family.

This has the advantage that we don't need to allocate space for
the function pointer for all the other families.

A followup patch will drop the calcit function pointer from the
rtnl_link callback structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 9 Aug 2017 23:53:57 +0000 (16:53 -0700)]

Merge branch 'bpf-new-branches'

Daniel Borkmann says:

====================
bpf: Add BPF_J{LT,LE,SLT,SLE} instructions

This set adds BPF_J{LT,LE,SLT,SLE} instructions to the BPF
insn set, interpreter, JIT hardening code and all JITs are
also updated to support the new instructions. Basic idea is
to reduce register pressure by avoiding BPF_J{GT,GE,SGT,SGE}
rewrites. Removing the workaround for the rewrites in LLVM,
this can result in shorter BPF programs, less stack usage
and less verification complexity. First patch provides some
more details on rationale and integration.

Thanks a lot!

v1 -> v2:
- Reworded commit msg in patch 1
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:40:03 +0000 (01:40 +0200)]

bpf: add test cases for new BPF_J{LT, LE, SLT, SLE} instructions

Add test cases to the verifier selftest suite in order to verify that
i) direct packet access, and ii) dynamic map value access is working
with the changes related to the new instructions.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:40:02 +0000 (01:40 +0200)]

bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier

Enable the newly added jump opcodes, main parts are in two
different areas, namely direct packet access and dynamic map
value access. For the direct packet access, we now allow for
the following two new patterns to match in order to trigger
markings with find_good_pkt_pointers():

Variant 1 (access ok when taking the branch):

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (bf) r0 = r2
  3: (07) r0 += 8
  4: (ad) if r0 < r3 goto pc+2
  R0=pkt(id=0,off=8,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
  R3=pkt_end R10=fp
  5: (b7) r0 = 0
  6: (95) exit

  from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx
               R2=pkt(id=0,off=0,r=8) R3=pkt_end R10=fp
  7: (71) r0 = *(u8 *)(r2 +0)
  8: (05) goto pc-4
  5: (b7) r0 = 0
  6: (95) exit
  processed 11 insns, stack depth 0

Variant 2 (access ok on fall-through):

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (bf) r0 = r2
  3: (07) r0 += 8
  4: (bd) if r3 <= r0 goto pc+1
  R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8)
  R3=pkt_end R10=fp
  5: (71) r0 = *(u8 *)(r2 +0)
  6: (b7) r0 = 1
  7: (95) exit

  from 4 to 6: R0=pkt(id=0,off=8,r=0) R1=ctx
               R2=pkt(id=0,off=0,r=0) R3=pkt_end R10=fp
  6: (b7) r0 = 1
  7: (95) exit
  processed 10 insns, stack depth 0

The above two basically just swap the branches where we need
to handle an exception and allow packet access compared to the
two already existing variants for find_good_pkt_pointers().

For the dynamic map value access, we add the new instructions
to reg_set_min_max() and reg_set_min_max_inv() in order to
learn bounds. Verifier test cases for both are added in a
follow-up patch.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:40:01 +0000 (01:40 +0200)]

bpf, nfp: implement jiting of BPF_J{LT,LE}

This work implements jiting of BPF_J{LT,LE} instructions with
BPF_X/BPF_K variants for the nfp eBPF JIT. The two BPF_J{SLT,SLE}
instructions have not been added yet given BPF_J{SGT,SGE} are
not supported yet either.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:40:00 +0000 (01:40 +0200)]

bpf, ppc64: implement jiting of BPF_J{LT, LE, SLT, SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the ppc64 eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:39:59 +0000 (01:39 +0200)]

bpf, s390x: implement jiting of BPF_J{LT, LE, SLT, SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the s390x eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:39:58 +0000 (01:39 +0200)]

bpf, sparc64: implement jiting of BPF_J{LT, LE, SLT, SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the sparc64 eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:39:57 +0000 (01:39 +0200)]

bpf, arm64: implement jiting of BPF_J{LT, LE, SLT, SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the arm64 eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:39:56 +0000 (01:39 +0200)]

bpf, x86: implement jiting of BPF_J{LT,LE,SLT,SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the x86_64 eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Daniel Borkmann [Wed, 9 Aug 2017 23:39:55 +0000 (01:39 +0200)]

bpf: add BPF_J{LT,LE,SLT,SLE} instructions

Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.

This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.

Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).

The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)

[1] https://github.com/borkmann/llvm/tree/bpf-insns

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 9 Aug 2017 23:49:17 +0000 (16:49 -0700)]

Merge branch 'net-zerocopy-fixes'

Willem de Bruijn says:

====================
net: zerocopy fixes

Fix two issues introduced in the msg_zerocopy patchset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Willem de Bruijn [Wed, 9 Aug 2017 23:09:44 +0000 (19:09 -0400)]

sock: fix zerocopy_success regression with msg_zerocopy

Do not use uarg->zerocopy outside msg_zerocopy. In other paths the
field is not explicitly initialized and aliases another field.

Those paths have only one reference so do not need this intermediate
variable. Call uarg->callback directly.

Fixes: 1f8b977ab32d ("sock: enable MSG_ZEROCOPY")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Willem de Bruijn [Wed, 9 Aug 2017 23:09:43 +0000 (19:09 -0400)]

sock: fix zerocopy panic in mem accounting

Only call mm_unaccount_pinned_pages when releasing a struct ubuf_info
that has initialized its field uarg->mmp.

Before this patch, a vhost-net with experimental_zcopytx can crash in

  mm_unaccount_pinned_pages
  sock_zerocopy_put
  skb_zcopy_clear
  skb_release_data

Only sock_zerocopy_alloc initializes this field. Move the unaccount
call from generic sock_zerocopy_put to its specific callback
sock_zerocopy_callback.

Fixes: a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 9 Aug 2017 23:47:19 +0000 (16:47 -0700)]

Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2017-08-08

This series contains updates to e1000e and igb/igbvf.

Gangfeng Huang fixes an issue with receive network flow classification,
where igb_nfc_filter_exit() was not being called in igb_down() which
would cause the filter tables to "fill up" if a user where to change
the adapter settings (such as speed) which requires a reset of the
adapter.

Cliff Spradlin fixes a timestamping issue, where igb was allowing requests
for hardware timestamping even if it was not configured for hardware
transmit timestamping.

Corinna Vinschen removes the error message that there was an "unexpected
SYS WRAP", when it is actually expected.  So remove the message to not
confuse users.

Greg Edwards provides several patches for the mailbox interface between
the PF and VF drivers.  Added a mailbox unlock method to be used to unlock
the PF/VF mailbox by the PF.  Added a lock around the VF mailbox ops to
prevent the VF from sending another message while the PF is still
processing the previous message.  Fixed a "scheduling while atomic" issue
by changing msleep() to mdelay().

Sasha adds support for the next LOM generations i219 (v8 & v9) which
will be available in the next Intel client platform IceLake.

John Linville adds support for a Broadcom PHY to the igb driver, since
there are designs out in the world which use the igb MAC and a third
party PHY.  This allows the driver to load and function as expected on
these designs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 9 Aug 2017 23:28:45 +0000 (16:28 -0700)]

Merge git://git./linux/kernel/git/davem/net

The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Linus Torvalds [Wed, 9 Aug 2017 21:30:34 +0000 (14:30 -0700)]

Merge tag 'pinctrl-v4.13-2' of git://git./linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
"These are the pin control fixes I have gathered since the return from
  my vacation. They boiled in -next a while so let's get them in.

  Apart from the documentation build it is purely driver fixes. Which is
  nice. The Intel fixes seem kind of important.

   - Fix the documentation build as the docs were moved

   - Correct the UART pin list on the Intel Merrifield

   - Fix pin assignment and number of pins on the Marvell Armada 37xx
     pin controller

   - Cover the Setzer models in the Chromebook DMI quirk in the Intel
     cheryview driver so they start working

   - Add the missing "sim" function to the sunxi driver

   - Fix USB pin definitions on Uniphier Pro4

   - Smatch fix for invalid reference in the zx pin control driver"

* tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: generic: update references to Documentation/pinctrl.txt
  pinctrl: intel: merrifield: Correct UART pin lists
  pinctrl: armada-37xx: Fix number of pin in south bridge
  pinctrl: armada-37xx: Fix the pin 23 on south bridge
  pinctrl: cherryview: Add Setzer models to the Chromebook DMI quirk
  pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver
  pinctrl: uniphier: fix USB3 pin assignment for Pro4
  pinctrl: zte: fix dereference of 'data' in zx_set_mux()

commit | commitdiff | tree

Mel Gorman [Wed, 9 Aug 2017 07:27:11 +0000 (08:27 +0100)]

futex: Remove unnecessary warning from get_futex_key

Commit 65d8fc777f6d ("futex: Remove requirement for lock_page() in
get_futex_key()") removed an unnecessary lock_page() with the
side-effect that page->mapping needed to be treated very carefully.

Two defensive warnings were added in case any assumption was missed and
the first warning assumed a correct application would not alter a
mapping backing a futex key.  Since merging, it has not triggered for
any unexpected case but Mark Rutland reported the following bug
triggering due to the first warning.

  kernel BUG at kernel/futex.c:679!
  Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
  Hardware name: linux,dummy-virt (DT)
  task: ffff80001e271780 task.stack: ffff000010908000
  PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
  LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
  pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145

The fact that it's a bug instead of a warning was due to an unrelated
arm64 problem, but the warning itself triggered because the underlying
mapping changed.

This is an application issue but from a kernel perspective it's a
recoverable situation and the warning is unnecessary so this patch
removes the warning.  The warning may potentially be triggered with the
following test program from Mark although it may be necessary to adjust
NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
system.

    #include <linux/futex.h>
    #include <pthread.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <sys/syscall.h>
    #include <sys/time.h>
    #include <unistd.h>

    #define NR_FUTEX_THREADS 16
    pthread_t threads[NR_FUTEX_THREADS];

    void *mem;

    #define MEM_PROT  (PROT_READ | PROT_WRITE)
    #define MEM_SIZE  65536

    static int futex_wrapper(int *uaddr, int op, int val,
                             const struct timespec *timeout,
                             int *uaddr2, int val3)
    {
        syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
    }

    void *poll_futex(void *unused)
    {
        for (;;) {
            futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
        }
    }

    int main(int argc, char *argv[])
    {
        int i;

        mem = mmap(NULL, MEM_SIZE, MEM_PROT,
               MAP_SHARED | MAP_ANONYMOUS, -1, 0);

        printf("Mapping @ %p\n", mem);

        printf("Creating futex threads...\n");

        for (i = 0; i < NR_FUTEX_THREADS; i++)
            pthread_create(&threads[i], NULL, poll_futex, NULL);

        printf("Flipping mapping...\n");
        for (;;) {
            mmap(mem, MEM_SIZE, MEM_PROT,
                 MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        }

        return 0;
    }

Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Wed, 9 Aug 2017 20:21:28 +0000 (13:21 -0700)]

Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
"The main thing is to allow empty id_tables for ACPI to make some
  drivers get probed again. It looks a bit bigger than usual because it
  needs some internal renaming, too.

  Other than that, there is a fix for broken DSTDs, a super simple
  enablement for ARM MPS, and two documentation fixes which I'd like to
  see in v4.13 already"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: rephrase explanation of I2C_CLASS_DEPRECATED
  i2c: allow i2c-versatile for ARM MPS platforms
  i2c: designware: Some broken DSTDs use 1MiHz instead of 1MHz
  i2c: designware: Print clock freq on invalid clock freq error
  i2c: core: Allow empty id_table in ACPI case as well
  i2c: mux: pinctrl: mention correct module name in Kconfig help text

commit | commitdiff | tree

Linus Torvalds [Wed, 9 Aug 2017 17:37:35 +0000 (10:37 -0700)]

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Three patches that should go into this release.

  Two of them are from Paolo and fix up some corner cases with BFQ, and
  the last patch is from Ming and fixes up a potential usage count
  imbalance regression due to the recent NOWAIT work"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed
  block, bfq: consider also in_service_entity to state whether an entity is active
  block, bfq: reset in_service_entity if it becomes idle

commit | commitdiff | tree

Linus Torvalds [Wed, 9 Aug 2017 17:33:49 +0000 (10:33 -0700)]

Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:
"Fix two regressions in the inside-secure driver with respect to
  hmac(sha1)"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: inside-secure - fix the sha state length in hmac_sha1_setkey
  crypto: inside-secure - fix invalidation check in hmac_sha1_setkey

commit | commitdiff | tree

Linus Torvalds [Wed, 9 Aug 2017 17:14:04 +0000 (10:14 -0700)]

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"The pull requests are getting smaller, that's progress I suppose :-)

   1) Fix infinite loop in CIPSO option parsing, from Yujuan Qi.

   2) Fix remote checksum handling in VXLAN and GUE tunneling drivers,
      from Koichiro Den.

   3) Missing u64_stats_init() calls in several drivers, from Florian
      Fainelli.

   4) TCP can set the congestion window to an invalid ssthresh value
      after congestion window reductions, from Yuchung Cheng.

   5) Fix BPF jit branch generation on s390, from Daniel Borkmann.

   6) Correct MIPS ebpf JIT merge, from David Daney.

   7) Correct byte order test in BPF test_verifier.c, from Daniel
      Borkmann.

   8) Fix various crashes and leaks in ASIX driver, from Dean Jenkins.

   9) Handle SCTP checksums properly in mlx4 driver, from Davide
      Caratti.

  10) We can potentially enter tcp_connect() with a cached route
      already, due to fastopen, so we have to explicitly invalidate it.

  11) skb_warn_bad_offload() can bark in legitimate situations, fix from
      Willem de Bruijn"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  net: avoid skb_warn_bad_offload false positives on UFO
  qmi_wwan: fix NULL deref on disconnect
  ppp: fix xmit recursion detection on ppp channels
  rds: Reintroduce statistics counting
  tcp: fastopen: tcp_connect() must refresh the route
  net: sched: set xt_tgchk_param par.net properly in ipt_init_target
  net: dsa: mediatek: add adjust link support for user ports
  net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets
  qed: Fix a memory allocation failure test in 'qed_mcp_cmd_init()'
  hysdn: fix to a race condition in put_log_buffer
  s390/qeth: fix L3 next-hop in xmit qeth hdr
  asix: Fix small memory leak in ax88772_unbind()
  asix: Ensure asix_rx_fixup_info members are all reset
  asix: Add rx->ax_skb = NULL after usbnet_skb_return()
  bpf: fix selftest/bpf/test_pkt_md_access on s390x
  netvsc: fix race on sub channel creation
  bpf: fix byte order test in test_verifier
  xgene: Always get clk source, but ignore if it's missing for SGMII ports
  MIPS: Add missing file for eBPF JIT.
  bpf, s390: fix build for libbpf and selftest suite
  ...

commit | commitdiff | tree

Vincent Bernat [Tue, 8 Aug 2017 18:23:49 +0000 (20:23 +0200)]

net: ipv6: avoid overhead when no custom FIB rules are installed

If the user hasn't installed any custom rules, don't go through the
whole FIB rules layer. This is pretty similar to f4530fa574df (ipv4:
Avoid overhead when no custom FIB rules are installed).

Using a micro-benchmark module [1], timing ip6_route_output() with
get_cycles(), with 40,000 routes in the main routing table, before this
patch:

    min=606 max=12911 count=627 average=1959 95th=4903 90th=3747 50th=1602 mad=821
    table=254 avgdepth=21.8 maxdepth=39
    value │                         ┊                            count
      600 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒                                         199
      880 │▒▒▒░░░░░░░░░░░░░░░░                                      43
     1160 │▒▒▒░░░░░░░░░░░░░░░░░░░░                                  48
     1440 │▒▒▒░░░░░░░░░░░░░░░░░░░░░░░                               43
     1720 │▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░                          59
     2000 │▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                      50
     2280 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                    26
     2560 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                  31
     2840 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░               28
     3120 │▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░              17
     3400 │▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             17
     3680 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             8
     3960 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           11
     4240 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░            6
     4520 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           6
     4800 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           9

After:

    min=544 max=11687 count=627 average=1776 95th=4546 90th=3585 50th=1227 mad=565
    table=254 avgdepth=21.8 maxdepth=39
    value │                         ┊                            count
      540 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒                                        201
      800 │▒▒▒▒▒░░░░░░░░░░░░░░░░                                    63
     1060 │▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░                               68
     1320 │▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░                            39
     1580 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                         32
     1840 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                       32
     2100 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                    34
     2360 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                 33
     2620 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░               26
     2880 │▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░              22
     3140 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░              9
     3400 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             8
     3660 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             9
     3920 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░            8
     4180 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           8
     4440 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           8

At the frequency of the host during the bench (~ 3.7 GHz), this is
about a 100 ns difference on the median value.

A next step would be to collapse local and main tables, as in
0ddcf43d5d4a (ipv4: FIB Local/MAIN table collapse).

[1]: https://github.com/vincentbernat/network-lab/blob/master/lab-routes-ipv6/kbench_mod.c

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

John Crispins staging tree

RSS Atom