Ursula Braun [Mon, 23 Jul 2018 11:53:11 +0000 (13:53 +0200)]
net/smc: use DECLARE_BITMAP for rtokens_used_mask
Link group field tokens_used_mask is a bitmap. Use macro
DECLARE_BITMAP for its definition.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jul 2018 11:53:10 +0000 (13:53 +0200)]
net/smc: add function to get link group from link
Replace a frequently used construct with a more readable variant,
reducing the code. Also might come handy when we start to support
more than a single per link group.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 23 Jul 2018 11:53:09 +0000 (13:53 +0200)]
net/smc: eliminate cursor read and write calls
The functions to read and write cursors are exclusively used to copy
cursors. Therefore switch to a respective function instead.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Mon, 23 Jul 2018 11:53:08 +0000 (13:53 +0200)]
net/smc: provide smc mode in smc_diag.c
Rename field diag_fallback into diag_mode and set the smc mode of a
connection explicitly.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Mon, 23 Jul 2018 10:33:08 +0000 (12:33 +0200)]
selftests: forwarding: gre_multipath: Drop IPv6 tests
Support for device-only IPv6 multipath next hops was dropped in
commit
33bd5ac54dc4 ("net/ipv6: Revert attempt to simplify route replace
and append") and as of commit
b5d2d75e079a ("net/ipv6: Do not allow
device only routes via the multipath API"), attempts to add a next hop
like that yield an explicit diagnostic.
Correspondingly, drop the IPv6 parts of GRE multipath test that are
supposed to test that code.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Mon, 23 Jul 2018 08:33:19 +0000 (16:33 +0800)]
ipv6: sr: Use kmemdup instead of duplicating it in parse_nla_srh
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 23 Jul 2018 16:32:15 +0000 (09:32 -0700)]
Merge branch 'net-bridge-add-support-for-backup-port'
Nikolay Aleksandrov says:
====================
net: bridge: add support for backup port
This set introduces a new bridge port option that allows any port to have
any other port (in the same bridge of course) as its backup and traffic
will be forwarded to the backup port when the primary goes down. This is
mainly used in MLAG and EVPN setups where we have peerlink path which is
a backup of many (or even all) ports and is a participating bridge port
itself. There's more detailed information in patch 02. Patch 01 just
prepares the port sysfs code for options that take raw value. The main
issues that this set solves are scalability and fallback latency.
We have used similar code for over 6 months now to bring the fallback
latency of the backup peerlink down and avoid fdb notification storms.
Also due to the nature of master devices such setup is currently not
possible, and last but not least having tens of thousands of fdbs require
thousands of calls to switch.
I've also CCed our MLAG experts that have been using similar option.
Roopa also adds:
"Two switches acting in a MLAG pair are connected by the peerlink
interface which is a bridge port.
the config on one of the switches looks like the below. The other
switch also has a similar config.
eth0 is connected to one port on the server. And the server is
connected to both switches.
br0 -- team0---eth0
|
-- switch-peerlink
switch-peerlink becomes the failover/backport port when say team0 to
the server goes down.
Today, when team0 goes down, control plane has to withdraw all the fdb
entries pointing to team0
and re-install the fdb entries pointing to switch-peerlink...and
restore the fdb entries when team0 comes back up again.
and this is the problem we are trying to solve.
This also becomes necessary when multihoming is implemented by a
standard like E-VPN https://tools.ietf.org/html/rfc8365#section-8
where the 'switch-peerlink' is an overlay vxlan port (like nikolay
mentions in his patch commit). In these implementations, the fdb scale
can be much larger.
On why bond failover cannot be used here ?: the point that nikolay was
alluding to is, switch-peerlink in the above example is a bridge port
and is a failover/backport port for more than one or all ports in the
bridge br0. And you cannot enslave switch-peerlink into a second level
team
with other bridge ports. Hence a multi layered team device is not an
option (FWIW, switch-peerlink is also a teamed interface to the peer
switch)."
v3: Added Roopa's explanation and diagram
v2: In patch 01 use kstrdup/kfree to avoid casting the const buf. In order
to avoid using GFP_ATOMIC or always allocating I kept the spinlock inside
each branch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 23 Jul 2018 08:16:59 +0000 (11:16 +0300)]
net: bridge: add support for backup port
This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
allows to set a backup port to be used for known unicast traffic if the
port has gone carrier down. The backup pointer is rcu protected and set
only under RTNL, a counter is maintained so when deleting a port we know
how many other ports reference it as a backup and we remove it from all.
Also the pointer is in the first cache line which is hot at the time of
the check and thus in the common case we only add one more test.
The backup port will be used only for the non-flooding case since
it's a part of the bridge and the flooded packets will be forwarded to it
anyway. To remove the forwarding just send a 0/non-existing backup port.
This is used to avoid numerous scalability problems when using MLAG most
notably if we have thousands of fdbs one would need to change all of them
on port carrier going down which takes too long and causes a storm of fdb
notifications (and again when the port comes back up). In a Multi-chassis
Link Aggregation setup usually hosts are connected to two different
switches which act as a single logical switch. Those switches usually have
a control and backup link between them called peerlink which might be used
for communication in case a host loses connectivity to one of them.
We need a fast way to failover in case a host port goes down and currently
none of the solutions (like bond) cannot fulfill the requirements because
the participating ports are actually the "master" devices and must have the
same peerlink as their backup interface and at the same time all of them
must participate in the bridge device. As Roopa noted it's normal practice
in routing called fast re-route where a precalculated backup path is used
when the main one is down.
Another use case of this is with EVPN, having a single vxlan device which
is backup of every port. Due to the nature of master devices it's not
currently possible to use one device as a backup for many and still have
all of them participate in the bridge (which is master itself).
More detailed information about MLAG is available at the link below.
https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG
Further explanation and a diagram by Roopa:
Two switches acting in a MLAG pair are connected by the peerlink
interface which is a bridge port.
the config on one of the switches looks like the below. The other
switch also has a similar config.
eth0 is connected to one port on the server. And the server is
connected to both switches.
br0 -- team0---eth0
|
-- switch-peerlink
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 23 Jul 2018 08:16:58 +0000 (11:16 +0300)]
net: bridge: add support for raw sysfs port options
This patch adds a new alternative store callback for port sysfs options
which takes a raw value (buf) and can use it directly. It is needed for the
backup port sysfs support since we have to pass the device by its name.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Mon, 23 Jul 2018 03:16:47 +0000 (11:16 +0800)]
net: mediatek: use dma_zalloc_coherent instead of allocator/memset
Use dma_zalloc_coherent instead of dma_alloc_coherent
followed by memset 0.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sat, 21 Jul 2018 04:14:39 +0000 (21:14 -0700)]
nfp: avoid buffer leak when FW communication fails
After device is stopped we reset the rings by moving all free buffers
to positions [0, cnt - 2], and clear the position cnt - 1 in the ring.
We then proceed to clear the read/write pointers. This means that if
we try to reset the ring again the code will assume that the next to
fill buffer is at position 0 and swap it with cnt - 1. Since we
previously cleared position cnt - 1 it will lead to leaking the first
buffer and leaving ring in a bad state.
This scenario can only happen if FW communication fails, in which case
the ring will never be used again, so the fact it's in a bad state will
not be noticed. Buffer leak is the only problem. Don't try to move
buffers in the ring if the read/write pointers indicate the ring was
never used or have already been reset.
nfp_net_clear_config_and_disable() is now fully idempotent.
Found by code inspection, FW communication failures are very rare,
and reconfiguring a live device is not common either, so it's unlikely
anyone has ever noticed the leak.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sat, 21 Jul 2018 04:14:38 +0000 (21:14 -0700)]
nfp: bring back support for offloading shared blocks
Now that we have offload replay infrastructure added by
commit
326367427cc0 ("net: sched: call reoffload op on block callback reg")
and flows are guaranteed to be removed correctly, we can revert
commit
951a8ee6def3 ("nfp: reject binding to shared blocks").
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vitaly Kuznetsov [Fri, 20 Jul 2018 16:33:59 +0000 (18:33 +0200)]
xen-netfront: fix queue name setting
Commit
f599c64fdf7d ("xen-netfront: Fix race between device setup and
open") changed the initialization order: xennet_create_queues() now
happens before we do register_netdev() so using netdev->name in
xennet_init_queue() is incorrect, we end up with the following in
/proc/interrupts:
60: 139 0 xen-dyn -event eth%d-q0-tx
61: 265 0 xen-dyn -event eth%d-q0-rx
62: 234 0 xen-dyn -event eth%d-q1-tx
63: 1 0 xen-dyn -event eth%d-q1-rx
and this looks ugly. Actually, using early netdev name (even when it's
already set) is also not ideal: nowadays we tend to rename eth devices
and queue name may end up not corresponding to the netdev name.
Use nodename from xenbus device for queue naming: this can't change in VM's
lifetime. Now /proc/interrupts looks like
62: 202 0 xen-dyn -event device/vif/0-q0-tx
63: 317 0 xen-dyn -event device/vif/0-q0-rx
64: 262 0 xen-dyn -event device/vif/0-q1-tx
65: 17 0 xen-dyn -event device/vif/0-q1-rx
Fixes: f599c64fdf7d ("xen-netfront: Fix race between device setup and open")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Fri, 20 Jul 2018 16:16:02 +0000 (09:16 -0700)]
net/dsa/realtek: add MODULE_LICENSE()
Add MODULE_LICENSE() to net/dsa/realtek.o to fix build warning message.
WARNING: modpost: missing MODULE_LICENSE() in drivers/net/dsa/realtek.o
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Sun, 22 Jul 2018 08:37:31 +0000 (11:37 +0300)]
bonding: don't cast const buf in sysfs store
As was recently discussed [1], let's avoid casting the const buf in
bonding_sysfs_store_option and use kstrndup/kfree instead.
[1] http://lists.openwall.net/netdev/2018/07/22/25
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 22 Jul 2018 16:43:31 +0000 (09:43 -0700)]
Merge branch 'TX-used-ring-batched-updating-for-vhost'
Jason Wang says:
====================
TX used ring batched updating for vhost
This series implement batch updating of used ring for TX. This help to
reduce the cache contention on used ring. The idea is first split
datacopy path from zerocopy, and do only batching for datacopy. This
is because zercopy had already supported its own batching.
TX PPS was increased 25.8% and Netperf TCP does not show obvious
differences.
The split of datapath will also be helpful for future implementation
like in order completion.
====================
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:21 +0000 (08:15 +0800)]
vhost_net: batch update used ring for datacopy TX
Like commit
e2b3b35eb989 ("vhost_net: batch used ring update in rx"),
this patches implements batch used ring update for datacopy TX
(zerocopy has already done some kind of batching).
Testpmd transmission from guest to host (XDP_DROP on tap) shows 25.8%
improvement (from ~3.1Mpps to ~3.9Mpps) on Broadwell i7-5600U CPU @
2.60GHz machine. Netperf TCP tests does not show obvious differences.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:20 +0000 (08:15 +0800)]
vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
A more generic name which could be used for TX as well.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:19 +0000 (08:15 +0800)]
vhost_net: rename vhost_rx_signal_used() to vhost_net_signal_used()
Rename for reusing this for TX.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:18 +0000 (08:15 +0800)]
vhost_net: split out datacopy logic
Instead of mixing zerocopy and datacopy logics, this patch tries to
split datacopy logic out. This results for a more compact code and
ad-hoc optimization could be done on top more easily.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:17 +0000 (08:15 +0800)]
vhost_net: introduce tx_can_batch()
Introduce tx_can_batch() to determine whether TX could be
batched. This will help to reduce the code duplication in the future.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:16 +0000 (08:15 +0800)]
vhost_net: introduce get_tx_bufs()
Factor out logic of getting tx buffer and iov iter
initialization. This will be used for reducing codes duplication in
the future.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:15 +0000 (08:15 +0800)]
vhost_net: introduce vhost_exceeds_weight()
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:14 +0000 (08:15 +0800)]
vhost_net: introduce helper to initialize tx iov iter
Introduce init_iov_iter() in order to be reused by future patch.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 20 Jul 2018 00:15:13 +0000 (08:15 +0800)]
vhost_net: drop unnecessary parameter
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Fri, 20 Jul 2018 06:07:42 +0000 (14:07 +0800)]
multicast: remove useless parameter for group add
Remove the mode parameter for igmp/igmp6_group_added as we can get it
from first parameter.
Fixes: 6e2059b53f988 (ipv4/igmp: init group mode as INCLUDE when join source group)
Fixes: c7ea20c9da5b9 (ipv6/mcast: init as INCLUDE when join SSM INCLUDE group)
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Railton [Thu, 19 Jul 2018 23:11:46 +0000 (00:11 +0100)]
net: wimax: stack: fixed multi line comment issue
Moved end of comment to it's own line per guide
Signed-off-by: Mark Railton <mark@markrailton.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guenter Roeck [Thu, 19 Jul 2018 16:41:39 +0000 (09:41 -0700)]
net: phy: sfp: Do not use "imply HWMON"
"imply HWMON" was supposed to ensure that the SFP phy code can be built
with HWMON enabled or disabled while at the same time ensuring that
HWMON is not built as module if SFP is built into the kernel.
Unfortunately, that does not work as intended. With "allmodconfig", it
results in several unrelated HWMON drivers to be disabled instead of
being built as module as expected.
Let's use the old "depends on HWMON || HWMON=n" instead. This is slightly
different (it enforces SFP to be built as module if HWMON is built as
module), but it is better than the alternative of using "IS_REACHABLE()"
in the driver since that would disable sensor support if HWMON is built
as module and SFP is built into the kernel.
Fixes: 1323061a018a ("net: phy: sfp: Add HWMON support for module sensors")
Cc: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Thu, 19 Jul 2018 14:18:27 +0000 (22:18 +0800)]
libcxgb: replace vmalloc and memset with vzalloc
Use vzalloc instead of the vmalloc, memset combo
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Thu, 19 Jul 2018 13:57:11 +0000 (21:57 +0800)]
net: hix5hd2_gmac: use dma_zalloc_coherent instead of allocator/memset
Use dma_zalloc_coherent instead of dma_alloc_coherent
followed by memset 0.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Thu, 19 Jul 2018 09:16:59 +0000 (17:16 +0800)]
tipc: make some functions static
Fixes the following sparse warnings:
net/tipc/link.c:376:5: warning: symbol 'link_bc_rcv_gap' was not declared. Should it be static?
net/tipc/link.c:823:6: warning: symbol 'link_prepare_wakeup' was not declared. Should it be static?
net/tipc/link.c:959:6: warning: symbol 'tipc_link_advance_backlog' was not declared. Should it be static?
net/tipc/link.c:1009:5: warning: symbol 'tipc_link_retrans' was not declared. Should it be static?
net/tipc/monitor.c:687:5: warning: symbol '__tipc_nl_add_monitor_peer' was not declared. Should it be static?
net/tipc/group.c:230:20: warning: symbol 'tipc_group_find_member' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Thu, 19 Jul 2018 04:14:17 +0000 (23:14 -0500)]
net: sched: use PTR_ERR_OR_ZERO macro in tcf_block_cb_register
This line makes up what macro PTR_ERR_OR_ZERO already does. So,
make use of PTR_ERR_OR_ZERO rather than an open-code version.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 21 Jul 2018 17:28:55 +0000 (10:28 -0700)]
Merge branch 'tcp-improve-setsockopt-TCP_USER_TIMEOUT-accuracy'
Jon Maxwell says:
====================
tcp: improve setsockopt() TCP_USER_TIMEOUT accuracy
The patch was becoming bigger based on feedback therefore I have
implemented a series of 3 commits instead in V4.
This series is a continuation based on V3 here and associated feedback:
https://patchwork.kernel.org/patch/
10516195/
Suggestions by Neal Cardwell:
1) Fix up units mismatch regarding msec/jiffies.
2) Address possiblility of time_remaining being negative.
3) Add a helper routine tcp_clamp_rto_to_user_timeout() to do the rto
calculation.
4) Move start_ts logic into helper routine tcp_retrans_stamp() to
validate tcp_sk(sk)->retrans_stamp.
5) Some u32 declation and return refactoring.
6) Return 0 instead of false in tcp_retransmit_stamp(), it's not a bool.
Suggestions by David Laight:
1) Don't cache rto in tcp_clamp_rto_to_user_timeout().
Suggestions by Eric Dumazet:
1) Make u32 declartions consistent.
2) Use patch series for easier review.
3) Convert icsk->icsk_user_timeout to millisconds to avoid jiffie to
msec dance.
4) Use seperate titles for each commit in the series.
5) Fix fuzzy indentation and line wrap issues.
6) Make commit titles descriptive.
Changes:
1) Call tcp_clamp_rto_to_user_timeout(sk) as an argument to
inet_csk_reset_xmit_timer() to save on rto declaration.
Every time the TCP retransmission timer fires. It checks to see if
there is a timeout before scheduling the next retransmit timer. The
retransmit interval between each retransmission increases
exponentially. The issue is that in order for the timeout to occur the
retransmit timer needs to fire again. If the user timeout check happens
after the 9th retransmit for example. It needs to wait for the 10th
retransmit timer to fire in order to evaluate whether a timeout has
occurred or not. If the interval is large enough then the timeout will
be inaccurate.
For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
1st retransmit:
22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
Last retransmit:
22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
Timeout:
send: Connection timed out
Sun Jul 1 22:25:34 EDT 2018
We can see that last retransmit took ~7 seconds. Which pushed the total
timeout to ~15 seconds instead of the expected 10 seconds. This gets
more inaccurate the larger the TCP_USER_TIMEOUT value. As the interval
increases.
Add tcp_clamp_rto_to_user_timeout() to determine if the user rto has
expired. Or whether the rto interval needs to be recalculated. Use the
original interval if user rto is not set.
Test results with the patch is the expected 10 second timeout:
1st retransmit:
01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
Last retransmit:
01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
Timeout:
send: Connection timed out
Mon Jul 2 01:38:09 EDT 2018
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maxwell [Thu, 19 Jul 2018 01:14:44 +0000 (11:14 +1000)]
tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy
Create the tcp_clamp_rto_to_user_timeout() helper routine. To calculate
the correct rto, so that the TCP_USER_TIMEOUT socket option is more
accurate. Taking suggestions and feedback into account from
Eric Dumazet, Neal Cardwell and David Laight. Due to the 1st commit we
can avoid the msecs_to_jiffies() and jiffies_to_msecs() dance.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maxwell [Thu, 19 Jul 2018 01:14:43 +0000 (11:14 +1000)]
tcp: Add tcp_retransmit_stamp() helper routine
Create a seperate helper routine as per Neal Cardwells suggestion. To
be used by the final commit in this series and retransmits_timed_out().
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maxwell [Thu, 19 Jul 2018 01:14:42 +0000 (11:14 +1000)]
tcp: convert icsk_user_timeout from jiffies to msecs
This is a preparatory commit. Part of this series that improves the
socket TCP_USER_TIMEOUT option accuracy. Implement Eric Dumazets idea
to convert icsk->icsk_user_timeout from jiffies to msecs. To eliminate
the msecs_to_jiffies() and jiffies_to_msecs() dance in future.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 21 Jul 2018 17:12:30 +0000 (10:12 -0700)]
Merge branch 's390-qeth-updates'
Julian Wiedmann says:
====================
s390/qeth: updates 2018-07-19
please apply one more round of qeth patches to net-next.
This brings additional performance improvements for the transmit code,
and some refactoring to pave the way for using netdev_priv.
Also, two minor fixes for rare corner cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:58 +0000 (12:43 +0200)]
s390/qeth: speed up L2 IQD xmit
Modify the L2 OSA xmit path so that it also supports L2 IQD devices
(in particular, their HW header requirements). This allows IQD devices
to advertise NETIF_F_SG support, and eliminates the allocation overhead
for the HW header.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:57 +0000 (12:43 +0200)]
s390/qeth: add support for constrained HW headers
Some transmit modes require that the HW header is located in the same
page as the initial protocol headers in skb->data. Let callers specify
the size of this contiguous header range, and enforce it when building
the HW header.
While at it, apply some gentle renaming to the relevant L2 code so that
it matches the L3 code.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:56 +0000 (12:43 +0200)]
s390/qeth: merge linearize-check into HW header construction
When checking whether an skb needs to be linearized to fit into an IO
buffer, it's desirable to consider the skb's final size and layout
(ie. after the HW header was added). But a subsequent linearization can
then cause the re-positioned HW header to violate its alignment
restrictions.
Dealing with this situation in two different code paths is quite tricky.
This patch integrates a) linearize-check and b) HW header construction
into one 3 step-sequence:
1. evaluate how the HW header needs to be added (to identify if it takes
up an additional buffer element), then
2. check if the required buffer elements exceed the device's limit.
Linearize when necessary and re-evaluate the HW header placement.
3. Add the HW header in the best-possible way:
a) push, without taking up an additional buffer element
b) push, but consume another buffer element
c) allocate a header object from the cache.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:55 +0000 (12:43 +0200)]
s390/qeth: add statistics for consumed buffer elements
Nowadays an skb fragment typically spans over multiple pages. So replace
the obsolete, SG-only 'fragments' counter with one that tracks the
consumed buffer elements. This is what actually matters for performance.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:54 +0000 (12:43 +0200)]
s390/qeth: use core MTU range checking
qeth's ndo_change_mtu() only applies some trivial bounds checking. Set
up dev->min_mtu properly, so that dev_set_mtu() can do this for us.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:53 +0000 (12:43 +0200)]
s390/qeth: simplify max MTU handling
When the MPC initialization code discovers the HW-specific max MTU,
apply the resulting changes straight to the netdevice.
If this is the device's first initialization, also set its MTU
(HiperSockets: the max MTU; else: a layer-specific default value).
Then cap the current MTU by the new max MTU.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:52 +0000 (12:43 +0200)]
s390/qeth: don't cache HW port number
The netdevice is always available now, so get the portno from there.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:51 +0000 (12:43 +0200)]
s390/qeth: allocate netdevice early
Allocation of the netdevice is currently delayed until a qeth card first
goes online. This complicates matters in several places, where we need
to cache values instead of applying them straight to the netdevice.
Improve on this by moving the allocation up to where the qeth card
itself is created. This is also one step in direction of eventually
placing the qeth card into netdev_priv().
In all subsequent code, remove the now redundant checks whether
card->dev is valid.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:50 +0000 (12:43 +0200)]
s390/qeth: remove redundant netif_carrier_ok() checks
netif_carrier_off() does its own checking.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:49 +0000 (12:43 +0200)]
s390/qeth: reset layer2 attribute on layer switch
After the subdriver's remove() routine has completed, the card's layer
mode is undetermined again. Reflect this in the layer2 field.
If qeth_dev_layer2_store() hits an error after remove() was called, the
card _always_ requires a setup(), even if the previous layer mode is
requested again.
But qeth_dev_layer2_store() bails out early if the requested layer mode
still matches the current one. So unless we reset the layer2 field,
re-probing the card back to its previous mode is currently not possible.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Thu, 19 Jul 2018 10:43:48 +0000 (12:43 +0200)]
s390/qeth: fix race in used-buffer accounting
By updating q->used_buffers only _after_ do_QDIO() has completed, there
is a potential race against the buffer's TX completion. In the unlikely
case that the TX completion path wins, qeth_qdio_output_handler() would
decrement the counter before qeth_flush_buffers() even incremented it.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 21 Jul 2018 15:44:24 +0000 (08:44 -0700)]
Merge branch 'hns3-misc-cleanups'
Salil Mehta says:
====================
Misc. cleanups for HNS3 ethernet driver
This patch-set presents some cleanups for HNS3 Ethernet Driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:47:06 +0000 (15:47 +0100)]
net: hns3: Add SPDX tags to HNS3 PF driver
Add the SPDX identifiers to HNS3 PF driver.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:47:05 +0000 (15:47 +0100)]
net: hns3: Remove unused struct member and definition
The struct hclge_desc_cb and hclge_desc_cb are never used in
anywhere. This patch removes them.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:47:04 +0000 (15:47 +0100)]
net: hns3: Fix misleading parameter name
The input parameter "dev" of hns3_irq_handle() is indeed
used as a tqp vector, it is misleadin.
The struct member "flag" is used to indicate ring type,
so rename it.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:47:03 +0000 (15:47 +0100)]
net: hns3: Modify inconsistent bit mask macros
Use BIT() and GENMASK() to convert the bit mask, modify
the inconsistent ones, and remove useless ones.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:47:02 +0000 (15:47 +0100)]
net: hns3: Use decimal for bit offset macros
Using hex for bit offsets is inconsistent with the rest
of the file. Change them to decimal.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:47:01 +0000 (15:47 +0100)]
net: hns3: Correct unreasonable code comments
This patch fixes some comment spelling errors, removes
redundant comments, rewrites misleading comments, and
adds some necessary comments.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:47:00 +0000 (15:47 +0100)]
net: hns3: Remove extra space and brackets
Remove extra space and brackets.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:46:59 +0000 (15:46 +0100)]
net: hns3: Standardize the handle of return value
Apply the standard minor cleanup by returning ret outside
the brackets.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Thu, 19 Jul 2018 14:46:58 +0000 (15:46 +0100)]
net: hns3: Remove some redundant assignments
Remove some redundant assignments, because they have
been set to zero when allocate hdev.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 21 Jul 2018 06:58:30 +0000 (23:58 -0700)]
Merge git://git./linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-20
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add sharing of BPF objects within one ASIC: this allows for reuse of
the same program on multiple ports of a device, and therefore gains
better code store utilization. On top of that, this now also enables
sharing of maps between programs attached to different ports of a
device, from Jakub.
2) Cleanup in libbpf and bpftool's Makefile to reduce unneeded feature
detections and unused variable exports, also from Jakub.
3) First batch of RCU annotation fixes in prog array handling, i.e.
there are several __rcu markers which are not correct as well as
some of the RCU handling, from Roman.
4) Two fixes in BPF sample files related to checking of the prog_cnt
upper limit from sample loader, from Dan.
5) Minor cleanup in sockmap to remove a set but not used variable,
from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 21 Jul 2018 06:44:36 +0000 (23:44 -0700)]
Merge branch 'Make-sys-class-net-per-net-namespace-objects-belong-to-container'
Tyler Hicks says:
====================
Make /sys/class/net per net namespace objects belong to container
This is a revival of an older patch set from Dmitry Torokhov:
https://lore.kernel.org/lkml/
1471386795-32918-1-git-send-email-dmitry.torokhov@gmail.com/
My submission of v2 is here:
https://lore.kernel.org/lkml/
1531497949-1766-1-git-send-email-tyhicks@canonical.com/
Here's Dmitry's description:
There are objects in /sys hierarchy (/sys/class/net/) that logically
belong to a namespace/container. Unfortunately all sysfs objects start
their life belonging to global root, and while we could change
ownership manually, keeping tracks of all objects that come and go is
cumbersome. It would be better if kernel created them using correct
uid/gid from the beginning.
This series changes kernfs to allow creating object's with arbitrary
uid/gid, adds get_ownership() callback to ktype structure so subsystems
could supply their own logic (likely tied to namespace support) for
determining ownership of kobjects, and adjusts sysfs code to make use
of this information. Lastly net-sysfs is adjusted to make sure that
objects in net namespace are owned by the root user from the owning
user namespace.
Note that we do not adjust ownership of objects moved into a new
namespace (as when moving a network device into a container) as
userspace can easily do it.
I'm reviving this patch set because we would like this feature for
system containers. One specific use case that we have is that libvirt is
unable to configure its bridge device inside of a system container due
to the bridge files in /sys/class/net/ being owned by init root instead
of container root. The last two patches in this set are patches that
I've added to Dmitry's original set to allow such configuration of the
bridge device.
Eric had previously provided feedback that he didn't favor these changes
affecting all layers of the stack and that most of the changes could
remain local to drivers/base/core.c. That feedback is certainly sensible
but I wanted to send out v2 of the patch set without making that large
of a change since quite a bit of time has passed and the bridge changes
in the last patch of this set shows that not all of the changes will be
local to drivers/base/core.c. I'm happy to make the changes if the
original request still stands.
* Changes since v2:
- Added my Co-Developed-by and Signed-off-by tags to all of Dmitry's
patches that I've modified
- Patch 1 received build failure fixes in
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
- Patch 2 was updated to drop the declaration of sysfs_add_file() from
sysfs.h since the patch removed all other uses of the function
- Patch 5 is a new patch that prevents tx_maxrate from being written
to from inside of a container
+ Maybe I'm being too cautious here but the restriction can always
be loosened up later
- Patches 6 and 7 were updated to make net_ns_get_ownership() always
initialize uid and gid, even when the network namespace is NULL, so
that it isn't a dangerous function to reuse
+ Requested by Christian Brauner
- I've looked at all sysfs attributes affected by this patch set and
feel comfortable about the changes. There are quite a few affected
attributes that don't have any capable()/ns_capable() checks in
their store operations (per_bond_attrs, at91_sysfs_attrs,
sysfs_grcan_attrs, ican3_sysfs_attrs, cdc_ncm_sysfs_attrs,
qmi_wwan_sysfs_attrs) but I think this is acceptable. It means that
container root, rather than specifically CAP_NET_ADMIN inside of the
network namespace that the device belongs to, can write to those
device attributes. It's the same situation that those devices have
today in that init root is able to write to the attributes without
necessarily having CAP_NET_ADMIN. I think that this should probably
be fixed in order to be consistent with what netdev_store() does by
verifying CAP_NET_ADMIN in the network namespace but that it doesn't
need to happen in this patch set.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tyler Hicks [Fri, 20 Jul 2018 21:56:54 +0000 (21:56 +0000)]
bridge: make sure objects belong to container's owner
When creating various bridge objects in /sys/class/net/... make sure
that they belong to the container's owner instead of global root (if
they belong to a container/namespace).
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tyler Hicks [Fri, 20 Jul 2018 21:56:53 +0000 (21:56 +0000)]
net: create reusable function for getting ownership info of sysfs inodes
Make net_ns_get_ownership() reusable by networking code outside of core.
This is useful, for example, to allow bridge related sysfs files to be
owned by container root.
Add a function comment since this is a potentially dangerous function to
use given the way that kobject_get_ownership() works by initializing uid
and gid before calling .get_ownership().
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Torokhov [Fri, 20 Jul 2018 21:56:52 +0000 (21:56 +0000)]
net-sysfs: make sure objects belong to container's owner
When creating various objects in /sys/class/net/... make sure that they
belong to container's owner instead of global root (if they belong to a
container/namespace).
Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tyler Hicks [Fri, 20 Jul 2018 21:56:51 +0000 (21:56 +0000)]
net-sysfs: require net admin in the init ns for setting tx_maxrate
An upcoming change will allow container root to open some /sys/class/net
files for writing. The tx_maxrate attribute can result in changes
to actual hardware devices so err on the side of caution by requiring
CAP_NET_ADMIN in the init namespace in the corresponding attribute store
operation.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Torokhov [Fri, 20 Jul 2018 21:56:50 +0000 (21:56 +0000)]
driver core: set up ownership of class devices in sysfs
Plumb in get_ownership() callback for devices belonging to a class so that
they can be created with uid/gid different from global root. This will
allow network devices in a container to belong to container's root and not
global root.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Torokhov [Fri, 20 Jul 2018 21:56:49 +0000 (21:56 +0000)]
kobject: kset_create_and_add() - fetch ownership info from parent
This change implements get_ownership() for ksets created with
kset_create_and_add() call by fetching ownership data from parent kobject.
This is done mostly for benefit of "queues" attribute of net devices so
that corresponding directory belongs to container's root instead of global
root for network devices in a container.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Torokhov [Fri, 20 Jul 2018 21:56:48 +0000 (21:56 +0000)]
sysfs, kobject: allow creating kobject belonging to arbitrary users
Normally kobjects and their sysfs representation belong to global root,
however it is not necessarily the case for objects in separate namespaces.
For example, objects in separate network namespace logically belong to the
container's root and not global root.
This change lays groundwork for allowing network namespace objects
ownership to be transferred to container's root user by defining
get_ownership() callback in ktype structure and using it in sysfs code to
retrieve desired uid/gid when creating sysfs objects for given kobject.
Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Torokhov [Fri, 20 Jul 2018 21:56:47 +0000 (21:56 +0000)]
kernfs: allow creating kernfs objects with arbitrary uid/gid
This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.
When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.
Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 21 Jul 2018 06:37:55 +0000 (23:37 -0700)]
net: Init backlog NAPI's gro_hash.
Based upon a patch by Sean Tranchetti.
Fixes: d4546c2509b1 ("net: Convert GRO SKB handling to list_head.")
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 21 Jul 2018 05:28:28 +0000 (22:28 -0700)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree:
1) No need to set ttl from reject action for the bridge family, from
Taehee Yoo.
2) Use a fixed timeout for flow that are passed up from the flowtable
to conntrack, from Florian Westphal.
3) More preparation patches for tproxy support for nf_tables, from Mate
Eckl.
4) Remove unnecessary indirection in core IPv6 checksum function, from
Florian Westphal.
5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it.
From Florian Westphal.
6) socket match now selects socket infrastructure, instead of depending
on it. From Mate Eckl.
7) Patch series to simplify conntrack tuple building/parsing from packet
path and ctnetlink, from Florian Westphal.
8) Fetch timeout policy from protocol helpers, instead of doing it from
core, from Florian Westphal.
9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from
Florian Westphal.
10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES
respectively, instead of IPV6. Patch from Mate Eckl.
11) Add specific function for garbage collection in conncount,
from Yi-Hung Wei.
12) Catch number of elements in the connlimit list, from Yi-Hung Wei.
13) Move locking to nf_conncount, from Yi-Hung Wei.
14) Series of patches to add lockless tree traversal in nf_conncount,
from Yi-Hung Wei.
15) Resolve clash in matching conntracks when race happens, from
Martynas Pumputis.
16) If connection entry times out, remove template entry from the
ip_vs_conn_tab table to improve behaviour under flood, from
Julian Anastasov.
17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng.
18) Call abort from 2-phase commit protocol before requesting modules,
make sure this is done under the mutex, from Florian Westphal.
19) Grab module reference when starting transaction, also from Florian.
20) Dynamically allocate expression info array for pre-parsing, from
Florian.
21) Add per netns mutex for nf_tables, from Florian Westphal.
22) A couple of patches to simplify and refactor nf_osf code to prepare
for nft_osf support.
23) Break evaluation on missing socket, from Mate Eckl.
24) Allow to match socket mark from nft_socket, from Mate Eckl.
25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is
built-in into nf_conntrack. From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jul 2018 21:45:10 +0000 (14:45 -0700)]
Merge ra./pub/scm/linux/kernel/git/torvalds/linux
All conflicts were trivial overlapping changes, so reasonably
easy to resolve.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 20 Jul 2018 21:27:02 +0000 (14:27 -0700)]
Merge tag 'vfio-v4.18-rc6' of git://github.com/awilliam/linux-vfio
Pull VFIO fix from Alex Williamson:
"Harden potential Spectre v1 issue (Gustavo A. R. Silva)"
* tag 'vfio-v4.18-rc6' of git://github.com/awilliam/linux-vfio:
vfio/pci: Fix potential Spectre v1
Linus Torvalds [Fri, 20 Jul 2018 21:24:17 +0000 (14:24 -0700)]
Merge tag 'for-4.18/dm-fixes-2' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper fix from Mike Snitzer:
"Fix DM writecache target to allow an optional offset to the start of
the data and metadata area.
This allows userspace tools (e.g. LVM2) to place a header and metadata
at the front of the writecache device for its use"
* tag 'for-4.18/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm writecache: support optional offset for start of device
Jon Maloy [Wed, 18 Jul 2018 17:50:06 +0000 (19:50 +0200)]
tipc: make link capability update thread safe
The commit referred to below introduced an update of the link
capabilities field that is not safe. Given the recently added
feature to remove idle node and link items after 5 minutes, there
is a small risk that the update will happen at the very moment the
targeted link is being removed. To avoid this we have to perform
the update inside the node item's write lock protection.
Fixes: 9012de508956 ("tipc: add sequence number check for link STATE messages")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jul 2018 19:33:38 +0000 (12:33 -0700)]
Merge branch 'constify-nla_policy'
Stephen Hemminger says:
====================
constify nla_policy
Almost all places that use nla_policy declare it const.
A couple of drivers didn't but that is fixable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Wed, 18 Jul 2018 16:32:44 +0000 (09:32 -0700)]
gtp: constify nla_policy
The netlink policy structure can be constant like other
drivers.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger [Wed, 18 Jul 2018 16:32:43 +0000 (09:32 -0700)]
nbd: constify nla_policy
The netlink policy should be const like other drivers.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Wed, 18 Jul 2018 13:27:41 +0000 (08:27 -0500)]
tls: Fix copy-paste error in tls_device_reencrypt
It seems that the proper structure to use in this particular
case is *skb_iter* instead of skb.
Addresses-Coverity-ID:
1471906 ("Copy-paste error")
Fixes: 4799ac81e52a ("tls: Add rx inline crypto offload")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 20 Jul 2018 18:47:08 +0000 (11:47 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"A set of 8 obvious fixes.
Three (2 qla2xxx and the cxlflash oopses) are regressions, two from
4.17 and one from the merge window. The hpsa change is user visible,
but it fixes an error users have complained about"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: cxlflash: fix assignment of the backend operations
scsi: qedi: Send driver state to MFW
scsi: qedf: Send the driver state to MFW
scsi: hpsa: correct enclosure sas address
scsi: sd_zbc: Fix variable type and bogus comment
scsi: qla2xxx: Fix NULL pointer dereference for fcport search
scsi: qla2xxx: Fix kernel crash due to late workqueue allocation
scsi: qla2xxx: Fix inconsistent DMA mem alloc/free
Linus Torvalds [Fri, 20 Jul 2018 18:43:21 +0000 (11:43 -0700)]
Merge tag 'iommu-fixes-v4.18-rc5' of git://git./linux/kernel/git/joro/iommu
Pull IOMMU fix from Joerg Roedel:
"Only one revert, for an an Intel VT-d patch that caused issues with
the i915 GPU driver"
* tag 'iommu-fixes-v4.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
Revert "iommu/vt-d: Clean up pasid quirk for pre-production devices"
Linus Torvalds [Fri, 20 Jul 2018 18:37:30 +0000 (11:37 -0700)]
Merge tag 'platform-drivers-x86-v4.18-2' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fixes from Andy Shevchenko:
"The Dell laptop ACPI video brightness control is now back after fixing
a regression brought by SMM refactoring"
* tag 'platform-drivers-x86-v4.18-2' of git://git.infradead.org/linux-platform-drivers-x86:
platform/x86: dell-laptop: Fix backlight detection
Linus Torvalds [Fri, 20 Jul 2018 18:33:22 +0000 (11:33 -0700)]
Merge tag 'arc-4.18-rc6' of git://git./linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
"ARC is back after radio silence in 4.17:
- Fix CONFIG_SWAP [Alexey]
- Robustify cmpxchg emulation for systems w/o atomics [Alexey /
PeterZ]
- Allow mprotext(PROT_EXEC) for stack mappings [Vineet]
- HSDK platform enable PCIe, APG GPIO [Gustavo]
- miscll other fixes, config updates etc"
* tag 'arc-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARCv2: [plat-hsdk]: Save accl reg pair by default
ARC: mm: allow mprotect to make stack mappings executable
ARC: Fix CONFIG_SWAP
ARC: [arcompact] entry.S: minor code movement
ARC: configs: Remove CONFIG_INITRAMFS_SOURCE from defconfigs
ARC: configs: remove no longer needed CONFIG_DEVPTS_MULTIPLE_INSTANCES
ARC: Improve cmpxchg syscall implementation
ARC: [plat-hsdk]: Configure APB GPIO controller on ARC HSDK platform
ARC: [plat-hsdk] Add PCIe support
ARC: Enable machine_desc->init_per_cpu for !CONFIG_SMP
ARC: Explicitly add -mmedium-calls to CFLAGS
Linus Torvalds [Fri, 20 Jul 2018 18:18:33 +0000 (11:18 -0700)]
Merge tag 'nds32-for-linus-4.18' of git://git./linux/kernel/git/greentime/linux
Pull nds32 updates from Greentime Hu:
"Bug fixes and build ixes for nds32"
* tag 'nds32-for-linus-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/greentime/linux:
nds32: fix build error "relocation truncated to fit: R_NDS32_25_PCREL_RELA" when make allyesconfig
nds32: To simplify the implementation of update_mmu_cache()
nds32: Fix the dts pointer is not passed correctly issue.
nds32: To implement these icache invalidation APIs since nds32 cores don't snoop data cache. This issue is found by Guo Ren. Based on the Documentation/core-api/cachetlb.rst and it says:
nds32: Fix build error caused by configuration flag rename
nds32: define __NDS32_E[BL]__ for sparse
Linus Torvalds [Fri, 20 Jul 2018 18:12:27 +0000 (11:12 -0700)]
Merge tag 'pm-4.18-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix a relatively old initialization issue in intel_pstate causing the
pcc-cpufreq driver to be used instead of it on some HP Proliant
systems.
This turned into a functional regression during the 4.17 cycle,
because pcc-cpufreq is a scalability disaster and that was amplified
by the idle loop rework done at that time (Rafael Wysocki).
* tag 'pm-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Register when ACPI PCCH is present
Linus Torvalds [Fri, 20 Jul 2018 18:09:10 +0000 (11:09 -0700)]
Merge tag 'acpi-4.18-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Extend the recently added suspend-to-idle quirk for Thinkpad X1 Carbon
6th to other systems from that familiy which turned out to need it too
(Robin Johnson)"
* tag 'acpi-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / EC: Use ec_no_wakeup on more Thinkpad X1 Carbon 6th systems
Damien Thébault [Wed, 18 Jul 2018 10:06:01 +0000 (12:06 +0200)]
platform/x86: dell-laptop: Fix backlight detection
Fix return code check for "max brightness" ACPI call.
The Dell laptop ACPI video brightness control is not present on dell
laptops anymore, but was present in older kernel versions.
The code that checks the return value is incorrect since the SMM
refactoring.
The old code was:
if (buffer->output[0] == 0)
Which was changed to:
ret = dell_send_request(...)
if (ret)
However, dell_send_request() will return 0 if buffer->output[0] == 0,
so we must change the check to:
if (ret == 0)
This issue was found on a Dell M4800 laptop, and the fix tested on it
as well.
Fixes: 549b4930f057 ("dell-smbios: Introduce dispatcher for SMM calls")
Signed-off-by: Damien Thébault <damien@dtbo.net>
Tested-by: Damien Thébault <damien@dtbo.net>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Reviewed-by: Mario Limonciello <mario.limonciello@dell.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Lu Baolu [Sun, 8 Jul 2018 06:23:21 +0000 (14:23 +0800)]
Revert "iommu/vt-d: Clean up pasid quirk for pre-production devices"
This reverts commit
ab96746aaa344fb720a198245a837e266fad3b62.
The commit
ab96746aaa34 ("iommu/vt-d: Clean up pasid quirk for
pre-production devices") triggers ECS mode on some platforms
which have broken ECS support. As the result, graphic device
will be inoperable on boot.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107017
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
David S. Miller [Fri, 20 Jul 2018 06:35:38 +0000 (23:35 -0700)]
Merge branch 'qed-Add-support-for-phy-module-query'
Sudarsana Reddy Kalluru says:
====================
qed*: Add support for phy module query.
The patch series adds driver support for querying the PHY module's
eeprom data.
Please consider applying it to 'net-next'.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Wed, 18 Jul 2018 13:27:23 +0000 (06:27 -0700)]
qede: Add driver callbacks for eeprom module query.
This patch implements the ethtool callbacks for querying sfp/eeprom module.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Wed, 18 Jul 2018 13:27:22 +0000 (06:27 -0700)]
qed: Add qed APIs for PHY module query.
This patch adds qed APIs for reading the PHY module.
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jul 2018 06:26:02 +0000 (23:26 -0700)]
Merge branch 'tc-tunnel-ttl-tos'
Or Gerlitz says
====================
set/match the tos/ttl fields of TC based IP tunnels
This series comes to address the case to set (encap) and match (decap)
also the tos and ttl fields of TC based IP tunnels.
Example encap (1st one) and decap (2nd) that use the new fields
tc filter add dev eth0_0 protocol ip parent ffff: prio 10 flower \
src_mac e4:11:22:33:44:50 dst_mac e4:11:22:33:44:70 \
action tunnel_key set src_ip 192.168.10.1 dst_ip 192.168.10.2 id 100 dst_port 4789 tos 0x30 \
action mirred egress redirect dev vxlan_sys_4789
tc filter add dev vxlan_sys_4789 protocol ip parent ffff: prio 10 flower \
enc_src_ip 192.168.10.2 enc_dst_ip 192.168.10.1 enc_key_id 100 enc_dst_port 4789 enc_tos 0x30 \
src_mac e4:11:22:33:44:70 dst_mac e4:11:22:33:44:50 \
action tunnel_key unset \
action mirred egress redirect dev eth0_0
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 17 Jul 2018 16:27:18 +0000 (19:27 +0300)]
net/sched: cls_flower: Support matching on ip tos and ttl for tunnels
Allow users to set rules matching on ipv4 tos and ttl or
ipv6 traffic-class and hoplimit of tunnel headers.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 17 Jul 2018 16:27:17 +0000 (19:27 +0300)]
flow_dissector: Dissect tos and ttl from the tunnel info
Add dissection of the tos and ttl from the ip tunnel headers
fields in case a match is needed on them.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Tue, 17 Jul 2018 16:27:16 +0000 (19:27 +0300)]
net/sched: tunnel_key: Allow to set tos and ttl for tc based ip tunnels
Allow user-space to provide tos and ttl to be set for the tunnel headers.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 20 Jul 2018 03:17:47 +0000 (20:17 -0700)]
Merge tag 'drm-fixes-2018-07-20' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Just two sets of driver fixes this week to follow up on the set from
earlier in the week and hopefully get me realigned schedule wise.
amdgpu:
- ACP fix for boards with multiple I2S instances
- DP fix for CZ, vega
- hybrid laptop fixes
- Resume regression fix
nouveau:
- large memory systems and Pascal fix
- MST race fixes
- runtime PM fix"
* tag 'drm-fixes-2018-07-20' of git://anongit.freedesktop.org/drm/drm:
drm/nouveau/fb/gp100-: disable address remapper
drm/amd/amdgpu: creating two I2S instances for stoney/cz (v2)
drm/amdgpu: add another ATPX quirk for TOPAZ
drm/amd/display: Fix DP HBR2 Eye Diagram Pattern on Carrizo
drm/amdgpu: Make sure IB tests flushed after IP resume
drm/nouveau: Set DRIVER_ATOMIC cap earlier to fix debugfs
drm/nouveau: Remove bogus crtc check in pmops_runtime_idle
drm/nouveau/drm/nouveau: Fix runtime PM leak in nv50_disp_atomic_commit()
drm/nouveau: Avoid looping through fake MST connectors
drm/nouveau: Use drm_connector_list_iter_* for iterating connectors
drm/nouveau/gem: off by one bugs in nouveau_gem_pushbuf_reloc_apply()
drm/nouveau/kms/nv50-: ensure window updates are submitted when flushing mst disables
Dave Airlie [Fri, 20 Jul 2018 00:27:41 +0000 (10:27 +1000)]
Merge branch 'linux-4.18' of git://github.com/skeggsb/linux into drm-fixes
- fix problem with pascal and large memory systems
- fix a bunch of MST problems
- fix a runtime PM interaction with MST
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CACAvsv79O8deSts2fxJ_oS6=q8yA+OgwBSEpp5R=BQBmWa+oyg@mail.gmail.com
Dave Airlie [Fri, 20 Jul 2018 00:23:37 +0000 (10:23 +1000)]
Merge branch 'drm-fixes-4.18' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Fixes for 4.18. The ACP patch is a bit bigger than I would like
at this point, but it should have gone in long ago, it just fell
through the cracks. The others are pretty small and straight-forward.
- ACP fix for boards with 2 I2S instances
- DP fix for CZ, vega
- Fix for a hybrid graphics laptop
- Fix a resume regression
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180718162603.2747-1-alexander.deucher@amd.com
Linus Torvalds [Thu, 19 Jul 2018 18:54:04 +0000 (11:54 -0700)]
Merge tag 'pci-v4.18-fixes-3' of git://git./linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- Fix crashes that happen when PHY drivers are left disabled in the V3
Semiconductor, MediaTek, Faraday, Aardvark, DesignWare, Versatile,
and X-Gene host controller drivers (Sergei Shtylyov)
- Fix a NULL pointer dereference in the endpoint library configfs
support (Kishon Vijay Abraham I)
- Fix a race condition in Hyper-V IRQ handling (Dexuan Cui)
* tag 'pci-v4.18-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: v3-semi: Fix I/O space page leak
PCI: mediatek: Fix I/O space page leak
PCI: faraday: Fix I/O space page leak
PCI: aardvark: Fix I/O space page leak
PCI: designware: Fix I/O space page leak
PCI: versatile: Fix I/O space page leak
PCI: xgene: Fix I/O space page leak
PCI: OF: Fix I/O space page leak
PCI: endpoint: Fix NULL pointer dereference error when CONFIGFS is disabled
PCI: hv: Disable/enable IRQs rather than BH in hv_compose_msi_msg()
Vineet Gupta [Tue, 17 Jul 2018 22:21:56 +0000 (15:21 -0700)]
ARCv2: [plat-hsdk]: Save accl reg pair by default
This manifsted as strace segfaulting on HSDK because gcc was targetting
the accumulator registers as GPRs, which kernek was not saving/restoring
by default.
Cc: stable@vger.kernel.org #4.14+
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Linus Torvalds [Thu, 19 Jul 2018 14:43:17 +0000 (07:43 -0700)]
Merge tag 'sound-4.18-rc6' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A rawmidi race fix and three trivial HD-audio quirks"
* tag 'sound-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Yet another Clevo P950 quirk entry
ALSA: rawmidi: Change resized buffers atomically
ALSA: hda/realtek - Add Panasonic CF-SZ6 headset jack quirk
ALSA: hda: add mute led support for HP ProBook 455 G5