Joe Perches [Wed, 12 Mar 2014 17:22:37 +0000 (10:22 -0700)]
lg-vl600: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.
Make these uses consistent with the rest of the kernel.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 12 Mar 2014 17:22:36 +0000 (10:22 -0700)]
xilinx: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.
Make these uses consistent with the rest of the kernel.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 12 Mar 2014 17:22:30 +0000 (10:22 -0700)]
brocade: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.
Make these uses consistent with the rest of the kernel.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 12 Mar 2014 17:04:20 +0000 (10:04 -0700)]
tipc: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.
Make these uses consistent with the rest of the kernel.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 12 Mar 2014 17:04:18 +0000 (10:04 -0700)]
ieee802154: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.
Make these uses consistent with the rest of the kernel.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 12 Mar 2014 17:04:17 +0000 (10:04 -0700)]
net: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.
Make these uses consistent with the rest of the kernel.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 12 Mar 2014 17:04:15 +0000 (10:04 -0700)]
8021q: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.
Make these uses consistent with the rest of the kernel.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Tue, 11 Mar 2014 16:01:24 +0000 (18:01 +0200)]
gianfar: Fix multi-queue support checks @probe()
priv is not instantiated at gfar_of_init() time, when
parsing the DT for info on supported HW queues. Before
the netdev can be allocated, the number of supported
queues must be known. Because the number of supported
queues depends on device type, move the compatibility
checks before netdev allocation. Local vars are used
to hold the operation mode info before netdev allocation.
This fixes the null accesses for priv->.., in gfar_of_init.
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Tue, 11 Mar 2014 08:24:19 +0000 (16:24 +0800)]
r8152: support dumping the hw counters
Add dumping the tally counter by ethtool.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Stilwell [Tue, 11 Mar 2014 00:29:25 +0000 (19:29 -0500)]
ieee802154: at86rf230: add support for rf233 chip
The rf233 and rf231 are sufficiently similar that we can treat
rf233 like rf231.
rf233 is missing some features that rf231 has, but we don't currently
make use of them so there's nothing to handle differently yet.
Should we add support in the future for rf231 *_NOCLK or SLEEP states,
or PAD_IO drive strength, exceptions will need to be made for rf233.
Signed-off-by: Thomas Stilwell <stilwellt@openlabs.co>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 12 Mar 2014 03:54:56 +0000 (23:54 -0400)]
Merge branch 'pkt_sched_cond_resched'
Eric Dumazet says:
====================
pkt_sched: allow scheduling points
We have seen delays of more than 50ms in class or qdisc dumps, in case
device is under high TX stress, even with the prior 4KB per skb limit.
With the new 16KB limit, this could translate to 200ms delays.
Add cond_resched() to give a chance to higher prio tasks to get cpu.
But before doing so, we need to remove the rcu locking from tc_dump_qdisc()
as David spotted.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 11 Mar 2014 00:11:43 +0000 (17:11 -0700)]
pkt_sched: add cond_resched() to class and qdisc dump
We have seen delays of more than 50ms in class or qdisc dumps, in case
device is under high TX stress, even with the prior 4KB per skb limit.
Add cond_resched() to give a chance to higher prio tasks to get cpu.
Signed-off-by; Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 11 Mar 2014 00:11:42 +0000 (17:11 -0700)]
pkt_sched: do not use rcu in tc_dump_qdisc()
Like all rtnetlink dump operations, we hold RTNL in tc_dump_qdisc(),
so we do not need to use rcu protection to protect list of netdevices.
This will allow preemption to occur, thus reducing latencies.
Following patch adds explicit cond_resched() calls.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Mon, 10 Mar 2014 16:48:38 +0000 (09:48 -0700)]
bonding: force cast of IP address in options
The option code is taking IP address and putting it into a generic
container. Force cast to silence sparse warnings.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Mon, 10 Mar 2014 16:41:46 +0000 (09:41 -0700)]
netdev: set __percpu attribute on netdev_alloc_pcpu_stats
This patch fixes sparse warnings in vlan driver.
It propagates the sparse __percpu attribute from alloc_percpu
into netdev_alloc_pcpu_stats. I expect it may trigger additional
sparse warnings from other drivers that are missing the __percpu
attribute.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Li RongQing [Tue, 11 Mar 2014 02:40:08 +0000 (10:40 +0800)]
ipv6: ip6_forward: perform skb->pkt_type check at the beginning
Packets which have L2 address different from ours should be
already filtered before entering into ip6_forward().
Perform that check at the beginning to avoid processing such packets.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Tue, 11 Mar 2014 02:20:32 +0000 (10:20 +0800)]
r8152: add skb_cow_head
Call skb_cow_head() before editing the tx packet header. The header
would be reallocated if it is shared.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Mon, 10 Mar 2014 12:12:23 +0000 (13:12 +0100)]
net: eth: cpsw: Use net_device_stats from struct net_device
Instead of using an own copy of struct net_device_stats in struct
cpsw_priv, use stats from struct net_device. Also remove the thus
unnecessary .ndo_get_stats function, as it just returns dev->stats,
which is the default.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 10 Mar 2014 14:09:07 +0000 (07:09 -0700)]
flowcache: restore a single flow_cache kmem_cache
It is not legal to create multiple kmem_cache having the same name.
flowcache can use a single kmem_cache, no need for a per netns
one.
Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware")
Reported-by: Jakub Kicinski <moorray3@wp.pl>
Tested-by: Jakub Kicinski <moorray3@wp.pl>
Tested-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gu Zheng [Mon, 10 Mar 2014 01:57:34 +0000 (09:57 +0800)]
net: add a pre-check of net_ns in sk_change_net()
We do not need to switch the net_ns if the target net_ns the same
as the current one, so here we add a pre-check of net_ns to avoid
this as David suggested.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 10 Mar 2014 00:36:02 +0000 (17:36 -0700)]
tcp: timestamp SYN+DATA messages
All skb in socket write queue should be properly timestamped.
In case of FastOpen, we special case the SYN+DATA 'message' as we
queue in socket wrote queue the two fallback skbs:
1) SYN message by itself.
2) DATA segment by itself.
We should make sure these skbs have proper timestamps.
Add a WARN_ON_ONCE() to eventually catch future violations.
Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Sun, 9 Mar 2014 23:10:59 +0000 (16:10 -0700)]
hyperv: Change the receive buffer size for legacy hosts
Due to a bug in the Hyper-V host verion 2008R2, we need to use a slightly smaller
receive buffer size, otherwise the buffer will not be accepted by the legacy hosts.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Aring [Sun, 9 Mar 2014 08:51:40 +0000 (09:51 +0100)]
6lowpan: reassembly: fix access of ctl table entry
Correct offset is 3 of the 6lowpanfrag_max_datagram_size value in proc
entry ctl table and not 2.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 10 Mar 2014 19:52:17 +0000 (15:52 -0400)]
Merge branch 'hyperv-next'
K. Y. Srinivasan says:
====================
Drivers: net: hyperv: Enable various offloads
This patch set enables both checksum as well as segmentation offload.
As part of this effort I have enabled scatter gather I/O a well.
In version 2 of these patches, I addressed comments from David Miller and
Dan Carpenter.
In this version I have addressed the latest comments from David Miller.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
KY Srinivasan [Sun, 9 Mar 2014 03:23:18 +0000 (19:23 -0800)]
Drivers: net: hyperv: Enable large send offload
Enable segmentation offload.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KY Srinivasan [Sun, 9 Mar 2014 03:23:17 +0000 (19:23 -0800)]
Drivers: net: hyperv: Enable send side checksum offload
Enable send side checksum offload.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KY Srinivasan [Sun, 9 Mar 2014 03:23:16 +0000 (19:23 -0800)]
Drivers: net: hyperv: Enable receive side IP checksum offload
Enable receive side checksum offload.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KY Srinivasan [Sun, 9 Mar 2014 03:23:15 +0000 (19:23 -0800)]
Drivers: net: hyperv: Enable offloads on the host
Prior to enabling guest side offloads, enable the offloads on the host.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KY Srinivasan [Sun, 9 Mar 2014 03:23:14 +0000 (19:23 -0800)]
Drivers: net: hyperv: Cleanup the send path
In preparation for enabling offloads, cleanup the send path.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KY Srinivasan [Sun, 9 Mar 2014 03:23:13 +0000 (19:23 -0800)]
Drivers: net: hyperv: Enable scatter gather I/O
Cleanup the code and enable scatter gather I/O.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tim Harvey [Sat, 8 Mar 2014 04:59:53 +0000 (20:59 -0800)]
sky2: allow mac to come from dt
The driver reads the mac address from the device registers which would
need to have been programmed by the bootloader. This patch adds
the ability to pull the mac from devicetree via the pci device dt node.
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Cc: netdev@vger.kernel.org
Cc: devicetree@vger.kernel.org
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Changes since v2:
- eliminated use of stack tmpaddr per feedback
Changes since v1:
- simplified based on feedback
- fixed formatting
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 7 Mar 2014 22:57:43 +0000 (14:57 -0800)]
l2tp: fix unused variable warning
net/l2tp/l2tp_core.c:1111:15: warning: unused variable
'sk' [-Wunused-variable]
Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kleber Sacilotto de Souza [Fri, 7 Mar 2014 22:48:25 +0000 (19:48 -0300)]
IB/mlx5_core: remove unreachable function call in module init
The call to mlx5_health_cleanup() in the module init function can never
be reached. Removing it.
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Acked-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 7 Mar 2014 20:02:33 +0000 (12:02 -0800)]
netlink: autosize skb lengthes
One known problem with netlink is the fact that NLMSG_GOODSIZE is
really small on PAGE_SIZE==4096 architectures, and it is difficult
to know in advance what buffer size is used by the application.
This patch adds an automatic learning of the size.
First netlink message will still be limited to ~4K, but if user used
bigger buffers, then following messages will be able to use up to 16KB.
This speedups dump() operations by a large factor and should be safe
for legacy applications.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 7 Mar 2014 18:27:41 +0000 (18:27 +0000)]
sfc: Use ether_addr_copy and eth_broadcast_addr
Faster than memcpy/memset on some architectures.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 10 Mar 2014 17:17:44 +0000 (13:17 -0400)]
Merge branch 'gianfar-next'
Claudiu Manoil says:
====================
gianfar: Tx timeout issue
There's an older Tx timeout issue showing up on etsec2 devices
with 2 CPUs. I pinned this issue down to processing overhead
incurred by supporting multiple Tx/Rx rings, as explained in
the 2nd patch below. But before this, there's also a concurency
issue leading to Rx/Tx spurrious interrupts, addressed by the
'Tx NAPI' patch below.
The Tx timeout can be triggered with multiple Tx flows,
'iperf -c -N 8' commands, on a 2 CPUs etsec2 based (P1020) board.
Before the patches:
"""
root@p1020rdb-pc:~# iperf -c 172.16.1.3 -n 1000M -P 8 &
[...]
root@p1020rdb-pc:~# NETDEV WATCHDOG: eth1 (fsl-gianfar): transmit queue 1 timed out
WARNING: at net/sched/sch_generic.c:279
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted
3.13.0-rc3-03386-g89ea59c #23
task:
ed84ef40 ti:
ed868000 task.ti:
ed868000
NIP:
c04627a8 LR:
c04627a8 CTR:
c02fb270
REGS:
ed869d00 TRAP: 0700 Not tainted (
3.13.0-rc3-03386-g89ea59c)
MSR:
00029000 <CE,EE,ME> CR:
44000022 XER:
20000000
[...]
root@p1020rdb-pc:~# [ ID] Interval Transfer Bandwidth
[ 5] 0.0-19.3 sec 1000 MBytes 434 Mbits/sec
[ 8] 0.0-39.7 sec 1000 MBytes 211 Mbits/sec
[ 9] 0.0-40.1 sec 1000 MBytes 209 Mbits/sec
[ 3] 0.0-40.2 sec 1000 MBytes 209 Mbits/sec
[ 10] 0.0-59.0 sec 1000 MBytes 142 Mbits/sec
[ 7] 0.0-74.6 sec 1000 MBytes 112 Mbits/sec
[ 6] 0.0-74.7 sec 1000 MBytes 112 Mbits/sec
[ 4] 0.0-74.7 sec 1000 MBytes 112 Mbits/sec
[SUM] 0.0-74.7 sec 7.81 GBytes 898 Mbits/sec
root@p1020rdb-pc:~# ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:04:9f:00:13:01
inet addr:172.16.1.1 Bcast:172.16.255.255 Mask:255.255.0.0
inet6 addr: fe80::204:9fff:fe00:1301/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:708722 errors:0 dropped:0 overruns:0 frame:0
TX packets:
8717849 errors:6 dropped:0 overruns:1470 carrier:0
collisions:0 txqueuelen:1000
RX bytes:
58118018 (55.4 MiB) TX bytes:
274069482 (261.3 MiB)
Base address:0xa000
"""
After applying the patches:
"""
root@p1020rdb-pc:~# iperf -c 172.16.1.3 -n 1000M -P 8 &
[...]
root@p1020rdb-pc:~# [ ID] Interval Transfer Bandwidth
[ 9] 0.0-70.5 sec 1000 MBytes 119 Mbits/sec
[ 5] 0.0-70.5 sec 1000 MBytes 119 Mbits/sec
[ 6] 0.0-70.7 sec 1000 MBytes 119 Mbits/sec
[ 4] 0.0-71.0 sec 1000 MBytes 118 Mbits/sec
[ 8] 0.0-71.1 sec 1000 MBytes 118 Mbits/sec
[ 3] 0.0-71.2 sec 1000 MBytes 118 Mbits/sec
[ 10] 0.0-71.3 sec 1000 MBytes 118 Mbits/sec
[ 7] 0.0-71.3 sec 1000 MBytes 118 Mbits/sec
[SUM] 0.0-71.3 sec 7.81 GBytes 942 Mbits/sec
root@p1020rdb-pc:~# ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:04:9f:00:13:01
inet addr:172.16.1.1 Bcast:172.16.255.255 Mask:255.255.0.0
inet6 addr: fe80::204:9fff:fe00:1301/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:728446 errors:0 dropped:0 overruns:0 frame:0
TX packets:
8690057 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:
59732650 (56.9 MiB) TX bytes:
271554306 (258.9 MiB)
Base address:0xa000
"""
v2: PATCH 2:
Replaced CPP check with run-time condition to
limit the number of queues. Updated comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Fri, 7 Mar 2014 12:42:46 +0000 (14:42 +0200)]
gianfar: Use Single-Queue polling for "fsl,etsec2"
For the "fsl,etsec2" compatible models the driver currently
supports 8 Tx and Rx DMA rings (aka HW queues). However, there
are only 2 pairs of Rx/Tx interrupt lines, as these controllers
are integrated in low power SoCs with 2 CPUs at most. As a result,
there are at most 2 NAPI instances that have to service multiple
Tx and Rx queues for these devices. This complicates the NAPI
polling routine having to iterate over the mutiple Rx/Tx queues
hooked to the same interrupt lines. And there's also an overhead
at HW level, as the controller needs to service all the 8 Tx rings
in a round robin manner. The combined overhead shows up for multi
parallel Tx flows transmitted by the kernel stack, when the driver
usually starts returning NETDEV_TX_BUSY leading to NETDEV WATCHDOG
Tx timeout triggering if the Tx path is congested for too long.
As an alternative, this patch makes the driver support only one
Tx/Rx DMA ring per NAPI instance (per interrupt group or pair
of Tx/Rx interrupt lines) by default. The simplified single queue
polling routine (gfar_poll_sq) will be the default napi poll routine
for the etsec2 devices too. Some adjustments needed to be made to
link the Tx/Rx HW queues with each NAPI instance (2 in this case).
The gfar_poll_sq() is already successfully used by older SQ_SG_MODE
(single interrupt group) controllers.
This patch fixes Tx timeout triggering under heavy Tx traffic load
(i.e. iperf -c -P 8) for the "fsl,etsec2" (currently the only
MQ_MG_MODE devices). There's also a significant memory footprint
reduction by supporting 2 Rx/Tx DMA rings (at most), instead of 8,
for these devices.
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Fri, 7 Mar 2014 12:42:45 +0000 (14:42 +0200)]
gianfar: Separate out the Tx interrupt handling (Tx NAPI)
There are some concurrency issues on devices w/ 2 CPUs related
to the handling of Rx and Tx interrupts. eTSEC has separate
interrupt lines for Rx and Tx but a single imask register
to mask these interrupts and a single NAPI instance to handle
both Rx and Tx work. As a result, the Rx and Tx ISRs are
identical, both are invoking gfar_schedule_cleanup(), however
both handlers can be entered at the same time when the Rx and
Tx interrupts are taken by different CPUs. In this case
spurrious interrupts (SPU) show up (in /proc/interrupts)
indicating a concurrency issue. Also, Tx overruns followed
by Tx timeout have been observed under heavy Tx traffic load.
To address these issues, the schedule cleanup ISR part has
been changed to handle the Rx and Tx interrupts independently.
The patch adds a separate NAPI poll routine for Tx cleanup to
be triggerred independently by the Tx confirmation interrupts
only. Existing poll functions are modified to handle only
the Rx path processing. The Tx poll routine does not need a
budget, since Tx processing doesn't consume NAPI budget, and
hence it is registered with minimum NAPI weight.
NAPI scheduling does not require locking since there are
different NAPI instances between the Rx and Tx confirmation
paths now.
So, the patch fixes the occurence of spurrious Rx/Tx interrupts.
Tx overruns also occur less frequently now.
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 7 Mar 2014 10:45:31 +0000 (18:45 +0800)]
vlan: use use ether_addr_equal_64bits to instead of ether_addr_equal
Ether_addr_equal_64bits is more efficient than ether_addr_equal, and
can be used when each argument is an array within a structure that
contains at least two bytes of data beyond the array, so it is safe
to use it for vlan, and make sense for fast path.
Cc: Joe Perches <joe@perches.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 7 Mar 2014 10:45:30 +0000 (18:45 +0800)]
vlan: slight optimization for vlan_do_receive()
According Joe's suggestion, maybe it'd be faster to add an unlikely to
the test for PCKET_OTHERHOST, so I add it and see whether the performance
could be better, although the differences is so small and negligible, but
it is hard to catch that any lower device would set the skb type to
PACKET_OTHERHOST, so most of time, I think it make sense to add unlikely
for the test.
Cc: Joe Perches <joe@perches.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 7 Mar 2014 06:57:52 +0000 (22:57 -0800)]
pkt_sched: fq: do not hold qdisc lock while allocating memory
Resizing fq hash table allocates memory while holding qdisc spinlock,
with BH disabled.
This is definitely not good, as allocation might sleep.
We can drop the lock and get it when needed, we hold RTNL so no other
changes can happen at the same time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 8 Mar 2014 23:49:29 +0000 (18:49 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to e1000e, ixgbevf and igb.
Majority of this series contains fixes and cleanups to e1000e,
most notably are:
Todd provides a fix to PTP in e1000e which adds a lock in e1000e_phc_adjfreq
to prevent concurrent changes to TIMINCA and SYSTIMH/L. Then provides an
igb fix to use ARRAY_SIZE for array size calculation.
David provides the remaining e1000e which contain:
- cleanup of pointer references that are no longer used
- fix an issue on systems with Management Engine enabled with the
ethernet cable unplugged
- fix an issue on 82579 where enabling EEE LPI sooner than one second
after link up causes link issues on some switches
- refactor the power management flows to prevent the suspend path from
being executed twice when hibernating
- refactor the runtime power management to fix interfering with the
functionality of Energy Efficient Ethernet when enabled and to fix
the device from repeatedly flip between suspend and resume with the
interface administratively downed
- enable the feature PHY Ultra Low Power Mode which is a power saving
feature that reduces the power consumption of the PHY when a cable is
not connected
- fix the ethtool offline tests for 82579 parts
- fix SHRA register access for 82579 parts which was introduced by
previous commit
c3a0dce35af0 "e1000e: fix overrun of PHY RAR array"
Florian provides a fix for ixgbevf where skb->pkt_type was being checked
like a bitmask, but it is not a bitmask.
Fix an issue reported by Stephen Hemminger where there was a warning
about code defined but never used if IGB_HWMON is not defined.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher [Thu, 6 Mar 2014 05:28:06 +0000 (05:28 +0000)]
igb: fix warning if !CONFIG_IGB_HWMON
Fix warning about code defined but never used if IGB_HWMON not defined.
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Todd Fujinaka [Tue, 4 Mar 2014 02:25:22 +0000 (02:25 +0000)]
igb: fix array size calculation
Use ARRAY_SIZE for array size calculation.
Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Florian Fainelli [Fri, 28 Feb 2014 23:46:49 +0000 (15:46 -0800)]
ixgbevf: fix skb->pkt_type checks
skb->pkt_type is not a bitmask, but contains only value at a time from
the range defined in include/uapi/linux/if_packet.h.
Checking it like if it was a bitmask of values would also cause
PACKET_OTHERHOST, PACKET_LOOPBACK and PACKET_FASTROUTE to be matched by
this check since their lower 2 bits are also set, although that does not
fix a real bug, it is still potentially confusing.
This bogus check was introduced in commit
815cccbf ("ixgbe: add setlink,
getlink support to ixgbe and ixgbevf").
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Wed, 5 Mar 2014 07:50:46 +0000 (07:50 +0000)]
e1000e: Fix SHRA register access for 82579
Previous commit
c3a0dce35af0 fixed an overrun for the RAR on i218 devices.
This commit also attempted to homogenize the RAR/SHRA access for all parts
accessed by the e1000e driver. This change introduced an error for
assigning MAC addresses to guest OS's for 82579 devices.
Only RAR[0] is accessible to the driver for 82579 parts, and additional
addresses must be placed into the SHRA[L|H] registers. The rar_entry_count
was changed in the previous commit to an inaccurate value that accounted
for all RAR and SHRA registers, not just the ones usable by the driver.
This patch fixes the count to the correct value and adjusts the
e1000_rar_set_pch2lan() function to user the correct index.
Cc: John Greene <jogreene@redhat.com>
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Wed, 5 Mar 2014 07:54:19 +0000 (07:54 +0000)]
e1000e: Fix ethtool offline tests for 82579 parts
Changes to the rar_entry_count value require a change to the indexing
used to access the SHRA[H|L] registers when testing them with
'ethtool -t <iface> offline'
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Wed, 26 Feb 2014 00:02:24 +0000 (00:02 +0000)]
e1000e: Fix not generating an error on invalid load parameter
Valid values for InterruptThrottleRate are 10-100000, or one of
0, 1, 3, 4. '2' is not valid. This is a legacy from the branching
from the e1000 driver code that e1000e was based from.
Prior to this patch, if the e1000e driver was loaded with a forced
invalid InterruptThrottleRate of '2', then no throttle rate would be
set and no error message generated.
Now, a message will be generated that an invalid value was used and the
value for InterruptThrottleRate will be set to the default value.
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Sat, 22 Feb 2014 03:15:17 +0000 (03:15 +0000)]
e1000e: Feature Enable PHY Ultra Low Power Mode (ULP)
ULP is a power saving feature that reduces the power consumption of the
PHY when a cable is not connected.
ULP is gated on the following conditions:
1) The hardware must support ULP. Currently this is only I218
devices from Intel
2) ULP is initiated by the driver, so, no driver results in no ULP.
3) ULP's implementation utilizes Runtime Power Management to toggle its
execution. ULP is enabled/disabled based on the state of Runtime PM.
4) ULP is not active when wake-on-unicast, multicast or broadcast is active
as these features are mutually-exclusive.
Since the PHY is in an unavailable state while ULP is active, any access
of the PHY registers will fail. This is resolved by utilizing kernel
calls that cause the device to exit Runtime PM (e.g. pm_runtime_get_sync)
and then, after PHY access is complete, allow the device to resume
Runtime PM (e.g. pm_runtime_put_sync).
Under certain conditions, toggling the LANPHYPC is necessary to disable
ULP mode. Break out existing code to toggle LANPHYPC to a new function
to avoid code duplication.
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Fri, 14 Feb 2014 07:16:46 +0000 (07:16 +0000)]
e1000e Refactor of Runtime Power Management
Fix issues with:
RuntimePM causing the device to repeatedly flip between suspend and resume
with the interface administratively downed.
Having RuntimePM enabled interfering with the functionality of Energy
Efficient Ethernet.
Added checks to disallow functions that should not be executed if the
device is currently runtime suspended
Make runtime_idle callback to use same deterministic behavior as the igb
driver.
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Acked-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Fri, 14 Feb 2014 07:16:41 +0000 (07:16 +0000)]
e1000e: Refactor PM flows
Refactor the system power management flows to prevent the suspend path from
being executed twice when hibernating since both the freeze and
poweroff callbacks were set to e1000_suspend() via SET_SYSTEM_SLEEP_PM_OPS.
There are HW workarounds that are performed during this flow and calling
them twice was causing erroneous behavior.
Re-arrange the code to take advantage of common code paths and explicitly
set the individual dev_pm_ops callbacks for suspend, resume, freeze,
thaw, poweroff and restore.
Add a boolean parameter (reset) to the e1000e_down function to allow
for cases when the HW should not be reset when downed during a PM event.
Now that all suspend/shutdown paths result in a call to __e1000_shutdown()
that checks Wake on Lan status, removing redundant check for WoL in
e1000_power_down_phy().
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Acked-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Wed, 5 Feb 2014 01:09:54 +0000 (01:09 +0000)]
e1000e: Add missing branding strings in ich8lan.c
Branding strings from recently released and soon to be released
hardware configurations that are supported by e1000e.
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Acked-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Tue, 4 Feb 2014 01:56:06 +0000 (01:56 +0000)]
e1000e: Cleanup - Update GPL header and Copyright
This patch is to update the GPL header by removing the portion that
refers to the Free Software Foundation address.
Change the copyright date for 2014.
Reformat the header comments to conform to kernel networking coding norms
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Fri, 24 Jan 2014 23:07:48 +0000 (23:07 +0000)]
e1000e: Fix 82579 sets LPI too early.
Enabling EEE LPI sooner than one second after link up on 82579 causes link
issues with some switches.
Remove EEE enablement for 82579 parts from the link initialization flow to
avoid initializing too early. EEE initialization for 82579 will be done
in e1000e_update_phy_task.
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Acked-by: Bruce W Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Thu, 23 Jan 2014 06:29:13 +0000 (06:29 +0000)]
e1000e: Resolve issues with Management Engine (ME) briefly blocking PHY resets
On a ME enabled system with the cable out, the driver init flow would
generate an erroneous message indicating that resets were being blocked
by an active ME session. Cause was ME clearing the semaphore bit to
block further PHY resets for up to 50 msec during power-on/cycle. After
this interval, ME would re-set the bit and allow PHY resets.
To resolve this, change the flow of e1000e_phy_hw_reset_generic() to
utilize a delay and retry method. Poll the FWSM register to minimize
any extra time added to the flow. If the delay times out at 100ms
(checked in 10msec increments), then return the value E1000_BLK_PHY_RESET,
as this is the accurate state of the PHY. Attempting to alter just the
call to e1000e_phy_hw_reset_generic() in e1000_init_phy_workarounds_pchlan()
just caused the problem to move further down the flow.
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Acked-by: Bruce W. Allan <bruce.w.allan@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Wed, 22 Jan 2014 00:21:41 +0000 (00:21 +0000)]
e1000e: Cleanup unecessary references
Cleaning up some pointer references that are no longer necessary
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Todd Fujinaka [Sat, 18 Jan 2014 06:09:33 +0000 (06:09 +0000)]
e1000e: PTP lock in e1000e_phc_adjustfreq
Add lock in e1000e_phc_adjfreq to prevent concurrent changes to TIMINCA
and SYSTIMH/L.
Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Aring [Fri, 7 Mar 2014 10:06:54 +0000 (11:06 +0100)]
6lowpan: reassembly: fix return of init function
This patch adds a missing return after fragmentation init. Otherwise we
register a sysctl interface and deregister it afterwards which makes no
sense.
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Mar 2014 22:08:57 +0000 (17:08 -0500)]
Merge tag 'linux-can-next-for-3.15-
20140307' of git://gitorious.org/linux-can/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2014-02-12
this is a pull request of twelve patches for net-next/master.
Alexander Shiyan contributes two patches for the mcp251x, one making
the driver more quiet and the other one improves the compile time
coverage by removing the #ifdef CONFIG_PM_SLEEP. Then two patches for
the flexcan driver by me, one removing the #ifdef CONFIG_PM_SLEEP, too,
the other one making use of platform_get_device_id(). Another patch by
me which converts the janz-ican3 driver to use netdev_<level>(). The
remaining 7 patches are by Oliver Hartkopp, they add CAN FD support to
the netlink configuration interface.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Mar 2014 21:24:54 +0000 (16:24 -0500)]
Merge branch 'r8152'
Hayes Wang says:
====================
r8152: tx/rx improvement
- Select the suitable spin lock for each function.
- Add additional check to reduce the spin lock.
- Up the priority of the tx to avoid interrupted by rx.
- Support rx checksum, large send, and IPv6 hw checksum.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Fri, 7 Mar 2014 03:04:40 +0000 (11:04 +0800)]
r8152: support IPv6
Support hw IPv6 checksum for TCP and UDP packets.
Note that the hw has the limitation of the range of the transport
offset. Besides, the TCP Pseudo Header of the IPv6 TSO of the hw
bases on the Microsoft document which excludes the packet length.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Fri, 7 Mar 2014 03:04:39 +0000 (11:04 +0800)]
r8152: support TSO
Support scatter gather and TSO.
Adjust the tx checksum function and set the max gso size to fix the
size of the tx aggregation buffer.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Fri, 7 Mar 2014 03:04:38 +0000 (11:04 +0800)]
r8152: support rx checksum
Support hw rx checksum for TCP and UDP packets.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Fri, 7 Mar 2014 03:04:37 +0000 (11:04 +0800)]
r8152: calculate the dropped packets for rx
Continue dealing with the remain rx packets, even though the allocation
of the skb fail. This could calculate the correct dropped packets.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Fri, 7 Mar 2014 03:04:36 +0000 (11:04 +0800)]
r8152: up the priority of the transmission
move the tx_bottom() from delayed_work to tasklet. It makes the rx
and tx balanced. If the device is in runtime suspend when getting
the tx packet, wakeup the device before trasmitting.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Fri, 7 Mar 2014 03:04:35 +0000 (11:04 +0800)]
r8152: check tx agg list before spin lock
Check tx agg list before spin lock to avoid doing spin lock every
times.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Fri, 7 Mar 2014 03:04:34 +0000 (11:04 +0800)]
r8152: replace spin_lock_irqsave and spin_unlock_irqrestore
Use spin_lock and spin_unlock in interrupt context.
The ndo_start_xmit would not be called in interrupt context, so
replace the relative spin_lock_irqsave and spin_unlock_irqrestore
with spin_lock_bh and spin_unlock_bh.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Mar 2014 21:12:43 +0000 (16:12 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e and i40evf.
Most notable are:
Joseph completes the implementation of the ethtool ntuple rule
management interface by adding the get, update and delete interface
reset.
Akeem provides a fix to prevent a possible overflow due to multiplication
of number and size by using kzalloc, so use kcalloc.
Jesse provides an implementation for skb_set_hash() and adds the L4 type
return when we know it is an L4 hash. He also adds a counter to
statistics for Tx timeouts to help users. Lastly he provides a change
to stay away from the cache line where the done bit may be getting
written back for the transmit ring since the hardware may be writing the
whole cache line for a partial update.
Shannon cleans up code comments.
Anjali removes a firmware workaround for newer firmware since the number
of MSIx vectors are being reported correctly.
v2:
- dropped patch 01 of the series based on feedback from the author
Joe Perches and Shannon Nelson.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Mar 2014 21:08:59 +0000 (16:08 -0500)]
Merge tag 'rxrpc-devel-
20140304' of git://git./linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
net-next: AF_RXRPC fixes and development
Here are some AF_RXRPC fixes:
(1) Fix to remove incorrect checksum calculation made during recvmsg(). It's
unnecessary to try to do this there since we check the checksum before
reading the RxRPC header from the packet.
(2) Fix to prevent the sending of an ABORT packet in response to another
ABORT packet and inducing a storm.
(3) Fix UDP MTU calculation from parsing ICMP_FRAG_NEEDED packets where we
don't handle the ICMP packet not specifying an MTU size.
And development patches:
(4) Add sysctls for configuring RxRPC parameters, specifically various delays
pertaining to ACK generation, the time before we resend a packet for
which we don't receive an ACK, the maximum time a call is permitted to
live and the amount of time transport, connection and dead call
information is cached.
(5) Improve ACK packet production by adjusting the handling of ACK_REQUESTED
packets, ignoring the MORE_PACKETS flag, delaying the production of
otherwise immediate ACK_IDLE packets and delaying all ACK_IDLE production
(barring the call termination) to half a second.
(6) Add more sysctl parameters to expose the Rx window size, the maximum
packet size that we're willing to receive and the number of jumbo rxrpc
packets we're willing to handle in a single UDP packet.
(7) Request ACKs on alternate DATA packets so that the other side doesn't
wait till we fill up the Tx window.
(8) Use a RCU hash table to look up the rxrpc_call for an incoming packet
rather than stepping through a hierarchy involving several spinlocks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Mar 2014 20:57:47 +0000 (15:57 -0500)]
Merge branch 'xen-netback-next'
Zoltan Kiss says:
====================
xen-netback: TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy
A long known problem of the upstream netback implementation that on the TX
path (from guest to Dom0) it copies the whole packet from guest memory into
Dom0. That simply became a bottleneck with 10Gb NICs, and generally it's a
huge perfomance penalty. The classic kernel version of netback used grant
mapping, and to get notified when the page can be unmapped, it used page
destructors. Unfortunately that destructor is not an upstreamable solution.
Ian Campbell's skb fragment destructor patch series [1] tried to solve this
problem, however it seems to be very invasive on the network stack's code,
and therefore haven't progressed very well.
This patch series use SKBTX_DEV_ZEROCOPY flags to tell the stack it needs to
know when the skb is freed up. That is the way KVM solved the same problem,
and based on my initial tests it can do the same for us. Avoiding the extra
copy boosted up TX throughput from 6.8 Gbps to 7.9 (I used a slower AMD
Interlagos box, both Dom0 and guest on upstream kernel, on the same NUMA node,
running iperf 2.0.5, and the remote end was a bare metal box on the same 10Gb
switch)
Based on my investigations the packet get only copied if it is delivered to
Dom0 IP stack through deliver_skb, which is due to this [2] patch. This affects
DomU->Dom0 IP traffic and when Dom0 does routing/NAT for the guest. That's a bit
unfortunate, but luckily it doesn't cause a major regression for this usecase.
In the future we should try to eliminate that copy somehow.
There are a few spinoff tasks which will be addressed in separate patches:
- grant copy the header directly instead of map and memcpy. This should help
us avoiding TLB flushing
- use something else than ballooned pages
- fix grant map to use page->index properly
I've tried to broke it down to smaller patches, with mixed results, so I
welcome suggestions on that part as well:
1: Use skb->cb to store pending_idx
2: Some refactoring
3: Change RX path for mapped SKB fragments (moved here to keep bisectability,
review it after #4)
4: Introduce TX grant mapping
5: Remove old TX grant copy definitons and fix indentations
6: Add stat counters for zerocopy
7: Handle guests with too many frags
8: Timeout packets in RX path
9: Aggregate TX unmap operations
v2: I've fixed some smaller things, see the individual patches. I've added a
few new stat counters, and handling the important use case when an older guest
sends lots of slots. Instead of delayed copy now we timeout packets on the RX
path, based on the assumption that otherwise packets should get stucked
anywhere else. Finally some unmap batching to avoid too much TLB flush
v3: Apart from fixing a few things mentioned in responses the important change
is the use the hypercall directly for grant [un]mapping, therefore we can
avoid m2p override.
v4: Now we are using a new grant mapping API to avoid m2p_override. The RX queue
timeout logic changed also.
v5: Only minor fixes based on Wei's comments
v6: Important bugfixes for xenvif_poll exit path and zerocopy callback, see
first 2 patches. Also rework of handling packets with too many slots, and
reorder the series a bit.
v7: Small fixes in comments/log messages/error paths, and merging the frag
overflow stats patch into its parent.
[1] http://lwn.net/Articles/491522/
[2] https://lkml.org/lkml/2012/7/20/363
====================
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:31 +0000 (21:48 +0000)]
xen-netback: Aggregate TX unmap operations
Unmapping causes TLB flushing, therefore we should make it in the largest
possible batches. However we shouldn't starve the guest for too long. So if
the guest has space for at least two big packets and we don't have at least a
quarter ring to unmap, delay it for at most 1 milisec.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:30 +0000 (21:48 +0000)]
xen-netback: Timeout packets in RX path
A malicious or buggy guest can leave its queue filled indefinitely, in which
case qdisc start to queue packets for that VIF. If those packets came from an
another guest, it can block its slots and prevent shutdown. To avoid that, we
make sure the queue is drained in every 10 seconds.
The QDisc queue in worst case takes 3 round to flush usually.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:29 +0000 (21:48 +0000)]
xen-netback: Handle guests with too many frags
Xen network protocol had implicit dependency on MAX_SKB_FRAGS. Netback has to
handle guests sending up to XEN_NETBK_LEGACY_SLOTS_MAX slots. To achieve that:
- create a new skb
- map the leftover slots to its frags (no linear buffer here!)
- chain it to the previous through skb_shinfo(skb)->frag_list
- map them
- copy and coalesce the frags into a brand new one and send it to the stack
- unmap the 2 old skb's pages
It's also introduces new stat counters, which help determine how often the guest
sends a packet with more than MAX_SKB_FRAGS frags.
NOTE: if bisect brought you here, you should apply the series up until
"xen-netback: Timeout packets in RX path", otherwise malicious guests can block
other guests by not releasing their sent packets.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:28 +0000 (21:48 +0000)]
xen-netback: Add stat counters for zerocopy
These counters help determine how often the buffers had to be copied. Also
they help find out if packets are leaked, as if "sent != success + fail",
there are probably packets never freed up properly.
NOTE: if bisect brought you here, you should apply the series up until
"xen-netback: Timeout packets in RX path", otherwise Windows guests can't work
properly and malicious guests can block other guests by not releasing their sent
packets.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:27 +0000 (21:48 +0000)]
xen-netback: Remove old TX grant copy definitons and fix indentations
These became obsolete with grant mapping. I've left intentionally the
indentations in this way, to improve readability of previous patches.
NOTE: if bisect brought you here, you should apply the series up until
"xen-netback: Timeout packets in RX path", otherwise Windows guests can't work
properly and malicious guests can block other guests by not releasing their sent
packets.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:26 +0000 (21:48 +0000)]
xen-netback: Introduce TX grant mapping
This patch introduces grant mapping on netback TX path. It replaces grant copy
operations, ditching grant copy coalescing along the way. Another solution for
copy coalescing is introduced in "xen-netback: Handle guests with too many
frags", older guests and Windows can broke before that patch applies.
There is a callback (xenvif_zerocopy_callback) from core stack to release the
slots back to the guests when kfree_skb or skb_orphan_frags called. It feeds a
separate dealloc thread, as scheduling NAPI instance from there is inefficient,
therefore we can't do dealloc from the instance.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:25 +0000 (21:48 +0000)]
xen-netback: Handle foreign mapped pages on the guest RX path
RX path need to know if the SKB fragments are stored on pages from another
domain.
Logically this patch should be after introducing the grant mapping itself, as
it makes sense only after that. But to keep bisectability, I moved it here. It
shouldn't change any functionality here. xenvif_zerocopy_callback and
ubuf_to_vif are just stubs here, they will be introduced properly later on.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:24 +0000 (21:48 +0000)]
xen-netback: Minor refactoring of netback code
This patch contains a few bits of refactoring before introducing the grant
mapping changes:
- introducing xenvif_tx_pending_slots_available(), as this is used several
times, and will be used more often
- rename the thread to vifX.Y-guest-rx, to signify it does RX work from the
guest point of view
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zoltan Kiss [Thu, 6 Mar 2014 21:48:23 +0000 (21:48 +0000)]
xen-netback: Use skb->cb for pending_idx
Storing the pending_idx at the first byte of the linear buffer never looked
good, skb->cb is a more proper place for this. It also prevents the header to
be directly grant copied there, and we don't have the pending_idx after we
copied the header here, so it's time to change it.
It also introduces helpers for the RX side
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 6 Mar 2014 02:19:34 +0000 (18:19 -0800)]
l2tp: keep original skb ownership
There is no reason to orphan skb in l2tp.
This breaks things like per socket memory limits, TCP Small queues...
Fix this before more people copy/paste it.
This is very similar to commit
8f646c922d550
("vxlan: keep original skb ownership")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 5 Mar 2014 22:08:38 +0000 (14:08 -0800)]
tcp: do not leak non zero tstamp in output packets
Usage of skb->tstamp should remain private to TCP stack
(only set on packets on write queue, not on cloned ones)
Otherwise, packets given to loopback interface with a non null tstamp
can confuse netif_rx() / net_timestamp_check()
Other possibility would be to clear tstamp in loopback_xmit(),
as done in skb_scrub_packet()
Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Fri, 28 Feb 2014 15:36:25 +0000 (16:36 +0100)]
can: add bittiming check at interface open for CAN FD
Additionally to have the second (data) bitrate available the data bitrate
has to be greater or equal to the arbitration bitrate in CAN FD.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Oliver Hartkopp [Fri, 28 Feb 2014 15:36:24 +0000 (16:36 +0100)]
can: allow to change the device mtu for CAN FD capable devices
The configuration for CAN FD depends on CAN_CTRLMODE_FD enabled in the driver
specific ctrlmode_supported capabilities.
The configuration can be done either with the 'fd { on | off }' option in the
'ip' tool from iproute2 or by setting the CAN netdevice MTU to CAN_MTU (16) or
to CANFD_MTU (72).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Oliver Hartkopp [Fri, 28 Feb 2014 15:36:23 +0000 (16:36 +0100)]
can: introduce the data bitrate configuration for CAN FD
As CAN FD offers a second bitrate for the data section of the CAN frame the
infrastructure for storing and configuring this second bitrate is introduced.
Improved the readability of the if-statement by inserting some newlines.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Oliver Hartkopp [Fri, 28 Feb 2014 15:36:22 +0000 (16:36 +0100)]
can: provide a separate bittiming_const parameter to bittiming functions
As the bittiming calculation functions are to be used with different
bittiming_const structures for CAN and CAN FD the direct reference to
priv->bittiming_const inside these functions has to be removed.
Also moved the check for existing bittiming const to one place.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Oliver Hartkopp [Fri, 28 Feb 2014 15:36:21 +0000 (16:36 +0100)]
can: move sanity check for bitrate and tq into can_get_bittiming
This patch moves a sanity check in order to have a second user for CAN FD.
Also simplify the return value generation in can_get_bittiming() as only
correct return values of can_[calc|fixup]_bittiming() lead to a return value of
zero.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Oliver Hartkopp [Fri, 28 Feb 2014 15:36:20 +0000 (16:36 +0100)]
can: only send bitrate data via netlink when available
When setting the bitrate both can_calc_bittiming() and can_fixup_bittiming()
lead to the bitrate variable to be set, when a proper bit timing is available.
Only then the bitrate configuration is stored for the device, so checking for
priv->bittiming.bitrate is always sufficient.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Oliver Hartkopp [Fri, 28 Feb 2014 15:36:19 +0000 (16:36 +0100)]
can: preserve skbuff protocol in can_put_echo_skb
The skbuff protocol value was formerly fixed/sanitized to ETH_P_CAN in
can_put_echo_skb(). With CAN FD this value has to be preserved.
This patch changes the hard assignment of the protocol value to a check of
valid protocol values for CAN and CAN FD.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Marc Kleine-Budde [Tue, 4 Mar 2014 21:00:47 +0000 (22:00 +0100)]
can: janz-ican3: convert dev_<level> printing to netdev_<level>
This patch converts the dev_<level> printing to netdev_<level>, this makes it
possible to remove the "struct device *dev" pointer from the "struct
ican3_dev".
Cc: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Catherine Sullivan [Thu, 6 Feb 2014 05:51:14 +0000 (05:51 +0000)]
i40e/i40evf: Bump pf&vf build versions
Bump i40e to 0.3.34 and i40evf to 0.9.14.
Change-ID: I6b3fb8ccf55b128d2baa4bdc20d3911ec81d4a5b
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Thu, 6 Feb 2014 05:51:13 +0000 (05:51 +0000)]
i40e/i40evf: carefully fill tx ring
We need to make sure that we stay away from the cache line
where the DD bit (done) may be getting written back for
the transmit ring since the hardware may be writing the
whole cache line for a partial update.
Change-ID: Id0b6dfc01f654def6a2a021af185803be1915d7e
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Thu, 6 Feb 2014 05:51:12 +0000 (05:51 +0000)]
i40e: fix nvm version and remove firmware report
The driver needs to use the format that the current NVM
uses when printing the version of the NVM. It should remain
this way from now on forward.
The driver was reporting when firmware was less than
an expected version number, but this is not a requirement
for the product and we print the firmware number at
init and in ethtool -i output. Just remove the print.
Change-ID: Ide0b856cd454ebf867610ef9a0d639bb358a4a60
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Thu, 6 Feb 2014 05:51:11 +0000 (05:51 +0000)]
i40e: Fix static checker warning
This patch fixes the following static checker warning:
drivers/net/ethernet/intel/i40e/i40e_dcb.c:342
i40e_lldp_to_dcb_config() warn: 'tlv' can't be NULL.
Exit criteria from the while loop is encountering LLDP END
LV or if the TLV length goes beyond the buffer length.
Change-ID: I7548b16db90230ec2ba0fa791b0343ca8b7dd5bb
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Neerav Parikh <Neerav.Parikh@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Kevin Scott <kevin.c.scott@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Tested-By: Jack Morgan<jack.morgan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Thu, 6 Feb 2014 05:51:10 +0000 (05:51 +0000)]
i40e: Remove a redundant filter addition
Remove a redundant filter addition to stop FW complaints about a redundant
filter removal.
Change-ID: I22bef6b682bd8d43432557e6e2b3e73ffb27b985
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Thu, 6 Feb 2014 05:51:09 +0000 (05:51 +0000)]
i40e: count timeout events
The ethtool -S statistics should have a counter for
tx timeouts in order to better help inform the masses.
Change-ID: Ice4b20ed4a151509f366719ab105be49c9e7b2b4
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Thu, 6 Feb 2014 05:51:08 +0000 (05:51 +0000)]
i40e: Remove a FW workaround for Number of MSIX vectors
The Number of MSIX vectors being reported is correct and hence
we need a check to do the right thing for FWs before and after.
Change-ID: I50902d1c848adcb960ea49ac73f7865ca871a1c3
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 6 Feb 2014 05:51:06 +0000 (05:51 +0000)]
i40e: clean up comment style
Lots of trivial changes to remove double spaces in function headers,
unnecessary periods in short comments, and adjust the English usage here
and there.
No actual code was harmed in the making of this patch.
Change-ID: I6e756c500756945e81a61ffb10221753eb7923ea
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Kevin Scott <kevin.c.scott@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Wed, 12 Feb 2014 01:45:33 +0000 (01:45 +0000)]
i40e/i40evf: i40e implementation for skb_set_hash
Original comment from Tom Herbert <therbert@google.com>
Drivers should call skb_set_hash to set the hash and its type
in an skbuff.
This patch builds upon Tom's original implementation and adds
the L4 type return when we know it is an L4 hash.
This requires use of the ptype decoder ring, so enable it.
Change-ID: I2f9fa86d1a6add58cff13386f7f4238b1abcc468
CC: Tom Herbert <therbert@google.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Akeem G Abodunrin [Thu, 6 Feb 2014 05:51:02 +0000 (05:51 +0000)]
i40e: Prevent overflow due to kzalloc
To prevent the possibility of overflow due multiplication of number and size
use kcalloc instead of kzalloc.
Change-ID: Ibe4d81ed7d9738d3bbe66ee4844ff9be817e8080
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Joseph Gasparakis [Wed, 12 Feb 2014 01:45:30 +0000 (01:45 +0000)]
i40e: Flow Director sideband accounting
This patch completes implementation of the ethtool ntuple
rule management interface. It adds the get, update and delete
interface reset.
Change-ID: Ida7f481d9ee4e405ed91340b858eabb18a52fdb5
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>