Jack Morgenstein [Sun, 8 Dec 2013 14:50:17 +0000 (16:50 +0200)]
mlx4_core: Roll back round robin bitmap allocation commit for CQs, SRQs, and MPTs
Commit
f4ec9e9 "mlx4_core: Change bitmap allocator to work in round-robin fashion"
introduced round-robin allocation (via bitmap) for all resources which allocate
via a bitmap.
Round robin allocation is desirable for mcgs, counters, pd's, UARs, and xrcds.
These are simply numbers, with no involvement of ICM memory mapping.
Round robin is required for QPs, since we had a problem with immediate
reuse of a 24-bit QP number (commit
f4ec9e9).
However, for other resources which use the bitmap allocator and involve
mapping ICM memory -- MPTs, CQs, SRQs -- round-robin is not desirable.
What happens in these cases is the following:
ICM memory is allocated and mapped in chunks of 256K.
Since the resource allocation index goes up monotonically, the allocator
will eventually require mapping a new chunk. Now, chunks are also unmapped
when their reference count goes back to zero. Thus, if a single app is
running and starts/exits frequently we will have the following situation:
When the app starts, a new chunk must be allocated and mapped.
When the app exits, the chunk reference count goes back to zero, and the
chunk is unmapped and freed. Therefore, the app must pay the cost of allocation
and mapping of ICM memory each time it runs (although the price is paid only when
allocating the initial entry in the new chunk).
For apps which allocate MPTs/SRQs/CQs and which operate as described above,
this presented a performance problem.
We therefore roll back the round-robin allocator modification for MPTs, CQs, SRQs.
Reported-by: Matthew Finlay <matt@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florent Fourcot [Sun, 8 Dec 2013 14:47:01 +0000 (15:47 +0100)]
ipv6: use ip6_flowinfo helper
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florent Fourcot [Sun, 8 Dec 2013 14:47:00 +0000 (15:47 +0100)]
ipv6: add ip6_flowlabel helper
And use it if possible.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florent Fourcot [Sun, 8 Dec 2013 14:46:59 +0000 (15:46 +0100)]
ipv6: remove rcv_tclass of ipv6_pinfo
tclass information in now already stored in rcv_flowinfo
We do not need to store the same information twice.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florent Fourcot [Sun, 8 Dec 2013 14:46:58 +0000 (15:46 +0100)]
ipv6: move IPV6_TCLASS_MASK definition in ipv6.h
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florent Fourcot [Sun, 8 Dec 2013 14:46:57 +0000 (15:46 +0100)]
ipv6: add flowinfo for tcp6 pkt_options for all cases
The current implementation of IPV6_FLOWINFO only gives a
result if pktoptions is available (thanks to the
ip6_datagram_recv_ctl function).
It gives inconsistent results to user space, sometimes
there is a result for getsockopt(IPV6_FLOWINFO), sometimes
not.
This patch add rcv_flowinfo to store it, and return it to
the userspace in the same way than other pkt_options.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafał Miłecki [Fri, 6 Dec 2013 23:53:55 +0000 (00:53 +0100)]
bgmac: connect to PHY and make use of PHY device
We were already registering MDIO bus, but we were not connecting bgmac
to the PHY. Add proper call and implement adjust link function to switch
MAC into requested state.
At the same time it's possible to drop our internal PHY management.
This is a "standard" PHY, so the "Generic PHY" driver works perfectly
fine with this. Don't duplicate the code.
Finally make use of phy_ethtool_[gs]set functions instead implementing
them from scratch.
This change was successfully tested on BCM5357. I was able to
autonegotiate 1000Mb/s full duplex, as well as force any of the
10/100/1000 half/full modes.
Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Fri, 6 Dec 2013 23:44:21 +0000 (15:44 -0800)]
etherdevice: Optimize a few is_<foo>_ether_addr functions
If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set,
several is_<foo>_ether_addr functions can be slightly
improved by using u32 dereferences.
I believe all current uses of is_zero_ether_addr and
is_broadcast_ether_addr are u16 aligned, so always use
u16 references to improve those functions performance.
Document the u16 alignment requirements.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Fri, 6 Dec 2013 22:39:46 +0000 (14:39 -0800)]
batadv: Slight optimization of batadv_compare_eth
Use the newly added generic routine ether_addr_equal_unaligned
to test if possibly unaligned to u16 Ethernet addresses are equal.
This slightly improves comparison time for systems with
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Fri, 6 Dec 2013 22:21:01 +0000 (14:21 -0800)]
etherdevice: Add ether_addr_equal_unaligned
Add a generic routine to test if possibly unaligned
to u16 Ethernet addresses are equal.
If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set,
this uses the slightly faster generic routine
ether_addr_equal, otherwise this uses memcmp.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 10 Dec 2013 01:56:27 +0000 (20:56 -0500)]
Merge branch 'neigh'
Jiri Pirko says:
====================
neigh: respect default parms values
This is a long standing regression. But since the patchset is bigger and
the regression happened in 2007, I'm proposing this to net-next instead.
Basically the problem is that if user wants to use /etc/sysctl.conf to specify
default values of neigh related params, he is not able to do that.
The reason is that the default values are copied to dev instance right after
netdev is registered. And that is way to early. The original behaviour
for ipv4 was that this happened after first address was assigned to device.
For ipv6 this was apparently from the very beginning.
So this patchset basically reverts the behaviour back to what it was in 2007 for
ipv4 and changes the behaviour for ipv6 so they are both the same.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sat, 7 Dec 2013 18:26:57 +0000 (19:26 +0100)]
neigh: ipv6: respect default values set before an address is assigned to device
Make the behaviour similar to ipv4. This will allow user to set sysctl
default neigh param values and these values will be respected even by
devices registered before (that ones what do not have address set yet).
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sat, 7 Dec 2013 18:26:56 +0000 (19:26 +0100)]
neigh: restore old behaviour of default parms values
Previously inet devices were only constructed when addresses are added.
Therefore the default neigh parms values they get are the ones at the
time of these operations.
Now that we're creating inet devices earlier, this changes the behaviour
of default neigh parms values in an incompatible way (see bug #8519).
This patch creates a compromise by setting the default values at the
same point as before but only for those that have not been explicitly
set by the user since the inet device's creation.
Introduced by:
commit
8030f54499925d073a88c09f30d5d844fb1b3190
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu Feb 22 01:53:47 2007 +0900
[IPV4] devinet: Register inetdev earlier.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sat, 7 Dec 2013 18:26:55 +0000 (19:26 +0100)]
neigh: use tbl->family to distinguish ipv4 from ipv6
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sat, 7 Dec 2013 18:26:54 +0000 (19:26 +0100)]
neigh: wrap proc dointvec functions
This will be needed later on to provide better management of default values.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sat, 7 Dec 2013 18:26:53 +0000 (19:26 +0100)]
neigh: convert parms to an array
This patch converts the neigh param members to an array. This allows easier
manipulation which will be needed later on to provide better management of
default values.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 10 Dec 2013 01:39:05 +0000 (20:39 -0500)]
Merge branch 'phy_reset'
Florian Fainelli says:
====================
net: phy: consolidate PHY reset
This patchset consolidates the PHY reset through the MII BMCR
register by using a central place were this is done.
This patchset resumes the work Kyle Moffett started here:
https://lkml.org/lkml/2011/10/20/301
Note that at this point, drivers doing funky things after issuing
a PHY reset using phy_init_hw() will still suffer from PHY state
machine problems, this will be taken care of later on.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:38 +0000 (13:01 -0800)]
net: sh_eth: do not issue a wild PHY reset through BMCR
The sh_eth driver issues an uncontrolled PHY reset through the MII
register BMCR but fails to wait for the reset to complete, and will also
implicitely wipe out all possible PHY fixups applied. Use phy_init_hw()
which remedies both problems.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:37 +0000 (13:01 -0800)]
net: tc35815: use phy_init_hw for PHY reset
Instead of open-coding the PHY reset through MII BMCR, use phy_init_hw()
which does that for us and also makes sure that any PHY specific fixups
are applied.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:36 +0000 (13:01 -0800)]
net: pxa168_eth: use phy_init_hw for PHY reset
Instead of open-coding a PHY reset through the MII BMCR register, use
phy_init_hw() which does this for us and ensures that PHY device fixups
are also applied. We also remove a call to ethernet_phy_reset() which is
now unncessary since phy_attach() calls phy_attach_direct() which in
turns calls phy_init_hw().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:35 +0000 (13:01 -0800)]
net: mv643xx_eth: use phy_init_hw to reset PHY
Instead of open-coding a PHY reset through the MII BMCR register, use
phy_init_hw() which does that for us and will also make sure that PHY
fixups are applied if required. We also remove a call to phy_reset()
due to the following sequence of calls in the driver:
phy_scan()
-> phy_connect()
-> phy_connect_direct()
-> phy_attach_direct()
-> phy_init_hw()
and we only have a call to phy_init() after phy_scan().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:34 +0000 (13:01 -0800)]
net: phy: consolidate PHY reset in phy_init_hw()
There are quite a lot of drivers touching a PHY device MII_BMCR
register to reset the PHY without taking care of:
1) ensuring that BMCR_RESET is cleared after a given timeout
2) the PHY state machine resuming to the proper state and re-applying
potentially changed settings such as auto-negotiation
Introduce phy_poll_reset() which will take care of polling the MII_BMCR
for the BMCR_RESET bit to be cleared after a given timeout or return a
timeout error code.
In order to make sure the PHY is in a correct state, phy_init_hw() first
issues a software reset through MII_BMCR and then applies any fixups.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:33 +0000 (13:01 -0800)]
net: bfin_mac: do not reset PHY after phy_start()
The PHY is already reset during driver probing, and this manual reset
after calling phy_start() will wipe out board-specific PHY fixups and
driver specific configuration initialization. Remove that explicit PHY
reset.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:32 +0000 (13:01 -0800)]
net: greth: use phy_read_status()
In case the greth driver is bound to anything but the Generic PHY
driver or the PHY has a special read_status callback implemented,
unexpected things will happen. Make sure we that we use
phy_read_status() which does the proper abstraction of calling the
driver specific read_status() callback for a given PHY.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:31 +0000 (13:01 -0800)]
net: phy: use phy_init_hw instead of open-coding it
Use phy_init_hw() instead of open-coding it in phy_mii_ioctl(), this
improves consistenty and makes sure that we will not duplicate the same
routine somewhere else.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 6 Dec 2013 21:01:30 +0000 (13:01 -0800)]
net: phy: report link partner features through ethtool
The PHY library already reads the MII_STAT1000 and MII_LPA registers in
genphy_read_status(), so extend it to also populate the PHY device link
partner advertised features such that we can feed this back into ethtool
when asked for it in phy_ethtool_gset().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhi Yong Wu [Fri, 6 Dec 2013 20:55:00 +0000 (04:55 +0800)]
tun: remove useless codes in tun_chr_aio_read() and tun_recvmsg()
By checking related codes, it is impossible that ret > len or total_len,
so we should remove some useless codes in both above functions.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhi Yong Wu [Fri, 6 Dec 2013 20:54:59 +0000 (04:54 +0800)]
macvtap: remove useless codes in macvtap_aio_read() and macvtap_recvmsg()
By checking related codes, it is impossible that ret > len or total_len,
so we should remove some useless coeds in both above functions.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Durrant [Fri, 6 Dec 2013 16:36:07 +0000 (16:36 +0000)]
xen-netback: improve guest-receive-side flow control
The way that flow control works without this patch is that, in start_xmit()
the code uses xenvif_count_skb_slots() to predict how many slots
xenvif_gop_skb() will consume and then adds this to a 'req_cons_peek'
counter which it then uses to determine if the shared ring has that amount
of space available by checking whether 'req_prod' has passed that value.
If the ring doesn't have space the tx queue is stopped.
xenvif_gop_skb() will then consume slots and update 'req_cons' and issue
responses, updating 'rsp_prod' as it goes. The frontend will consume those
responses and post new requests, by updating req_prod. So, req_prod chases
req_cons which chases rsp_prod, and can never exceed that value. Thus if
xenvif_count_skb_slots() ever returns a number of slots greater than
xenvif_gop_skb() uses, req_cons_peek will get to a value that req_prod cannot
possibly achieve (since it's limited by the 'real' req_cons) and, if this
happens enough times, req_cons_peek gets more than a ring size ahead of
req_cons and the tx queue then remains stopped forever waiting for an
unachievable amount of space to become available in the ring.
Having two routines trying to calculate the same value is always going to be
fragile, so this patch does away with that. All we essentially need to do is
make sure that we have 'enough stuff' on our internal queue without letting
it build up uncontrollably. So start_xmit() makes a cheap optimistic check
of how much space is needed for an skb and only turns the queue off if that
is unachievable. net_rx_action() is the place where we could do with an
accurate predicition but, since that has proven tricky to calculate, a cheap
worse-case (but not too bad) estimate is all we really need since the only
thing we *must* prevent is xenvif_gop_skb() consuming more slots than are
available.
Without this patch I can trivially stall netback permanently by just doing
a large guest to guest file copy between two Windows Server 2008R2 VMs on a
single host.
Patch tested with frontends in:
- Windows Server 2008R2
- CentOS 6.0
- Debian Squeeze
- Debian Wheezy
- SLES11
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Annie Li <annie.li@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne [Fri, 6 Dec 2013 15:08:00 +0000 (10:08 -0500)]
tipc: remove interface state mirroring in bearer
struct 'tipc_bearer' is a generic representation of the underlying
media type, and exists in a one-to-one relationship to each interface
TIPC is using. The struct contains a 'blocked' flag that mirrors the
operational and execution state of the represented interface, and is
updated through notification calls from the latter. The users of
tipc_bearer are checking this flag before each attempt to send a
packet via the interface.
This state mirroring serves no purpose in the current code base. TIPC
links will not discover a media failure any faster through this
mechanism, and in reality the flag only adds overhead at packet
sending and reception.
Furthermore, the fact that the flag needs to be protected by a spinlock
aggregated into tipc_bearer has turned out to cause a serious and
completely unnecessary deadlock problem.
CPU0 CPU1
---- ----
Time 0: bearer_disable() link_timeout()
Time 1: spin_lock_bh(&b_ptr->lock) tipc_link_push_queue()
Time 2: tipc_link_delete() tipc_bearer_blocked(b_ptr)
Time 3: k_cancel_timer(&req->timer) spin_lock_bh(&b_ptr->lock)
Time 4: del_timer_sync(&req->timer)
I.e., del_timer_sync() on CPU0 never returns, because the timer handler
on CPU1 is waiting for the bearer lock.
We eliminate the 'blocked' flag from struct tipc_bearer, along with all
tests on this flag. This not only resolves the deadlock, but also
simplifies and speeds up the data path execution of TIPC. It also fits
well into our ongoing effort to make the locking policy simpler and
more manageable.
An effect of this change is that we can get rid of functions such as
tipc_bearer_blocked(), tipc_continue() and tipc_block_bearer().
We replace the latter with a new function, tipc_reset_bearer(), which
resets all links associated to the bearer immediately after an
interface goes down.
A user might notice one slight change in link behaviour after this
change. When an interface goes down, (e.g. through a NETDEV_DOWN
event) all attached links will be reset immediately, instead of
leaving it to each link to detect the failure through a timer-driven
mechanism. We consider this an improvement, and see no obvious risks
with the new behavior.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <Paul.Gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Fri, 6 Dec 2013 11:24:33 +0000 (19:24 +0800)]
x25: convert printks to pr_<level>
use pr_<level> instead of printk(LEVEL)
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 6 Dec 2013 10:36:17 +0000 (11:36 +0100)]
packet: introduce PACKET_QDISC_BYPASS socket option
This patch introduces a PACKET_QDISC_BYPASS socket option, that
allows for using a similar xmit() function as in pktgen instead
of taking the dev_queue_xmit() path. This can be very useful when
PF_PACKET applications are required to be used in a similar
scenario as pktgen, but with full, flexible packet payload that
needs to be provided, for example.
On default, nothing changes in behaviour for normal PF_PACKET
TX users, so everything stays as is for applications. New users,
however, can now set PACKET_QDISC_BYPASS if needed to prevent
own packets from i) reentering packet_rcv() and ii) to directly
push the frame to the driver.
In doing so we can increase pps (here 64 byte packets) for
PF_PACKET a bit:
# CPUs -- QDISC_BYPASS -- qdisc path -- qdisc path[**]
1 CPU == 1,509,628 pps -- 1,208,708 -- 1,247,436
2 CPUs == 3,198,659 pps -- 2,536,012 -- 1,605,779
3 CPUs == 4,787,992 pps -- 3,788,740 -- 1,735,610
4 CPUs == 6,173,956 pps -- 4,907,799 -- 1,909,114
5 CPUs == 7,495,676 pps -- 5,956,499 -- 2,014,422
6 CPUs == 9,001,496 pps -- 7,145,064 -- 2,155,261
7 CPUs == 10,229,776 pps -- 8,190,596 -- 2,220,619
8 CPUs == 11,040,732 pps -- 9,188,544 -- 2,241,879
9 CPUs == 12,009,076 pps -- 10,275,936 -- 2,068,447
10 CPUs == 11,380,052 pps -- 11,265,337 -- 1,578,689
11 CPUs == 11,672,676 pps -- 11,845,344 -- 1,297,412
[...]
20 CPUs == 11,363,192 pps -- 11,014,933 -- 1,245,081
[**]: qdisc path with packet_rcv(), how probably most people
seem to use it (hopefully not anymore if not needed)
The test was done using a modified trafgen, sending a simple
static 64 bytes packet, on all CPUs. The trick in the fast
"qdisc path" case, is to avoid reentering packet_rcv() by
setting the RAW socket protocol to zero, like:
socket(PF_PACKET, SOCK_RAW, 0);
Tradeoffs are documented as well in this patch, clearly, if
queues are busy, we will drop more packets, tc disciplines are
ignored, and these packets are not visible to taps anymore. For
a pktgen like scenario, we argue that this is acceptable.
The pointer to the xmit function has been placed in packet
socket structure hole between cached_dev and prot_hook that
is hot anyway as we're working on cached_dev in each send path.
Done in joint work together with Jesper Dangaard Brouer.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 6 Dec 2013 10:36:16 +0000 (11:36 +0100)]
net: dev: move inline skb_needs_linearize helper to header
As we need it elsewhere, move the inline helper function of
skb_needs_linearize() over to skbuff.h include file. While
at it, also convert the return to 'bool' instead of 'int'
and add a proper kernel doc.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 10 Dec 2013 01:20:14 +0000 (20:20 -0500)]
Merge git://git./linux/kernel/git/davem/net
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 6 Dec 2013 10:36:15 +0000 (11:36 +0100)]
packet: fix send path when running with proto == 0
Commit
e40526cb20b5 introduced a cached dev pointer, that gets
hooked into register_prot_hook(), __unregister_prot_hook() to
update the device used for the send path.
We need to fix this up, as otherwise this will not work with
sockets created with protocol = 0, plus with sll_protocol = 0
passed via sockaddr_ll when doing the bind.
So instead, assign the pointer directly. The compiler can inline
these helper functions automagically.
While at it, also assume the cached dev fast-path as likely(),
and document this variant of socket creation as it seems it is
not widely used (seems not even the author of TX_RING was aware
of that in his reference example [1]). Tested with reproducer
from
e40526cb20b5.
[1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example
Fixes: e40526cb20b5 ("packet: fix use after free race in send path when dev is released")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Salam Noureddine <noureddine@aristanetworks.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 5 Dec 2013 19:12:02 +0000 (11:12 -0800)]
pkt_sched: give visibility to mq slave qdiscs
Commit
6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
added the ability to change default qdisc from pfifo_fast to say fq
But as most modern ethernet devices are multiqueue, we cant really
see all the statistics from "tc -s qdisc show", as the default root
qdisc is mq.
This patch adds the calls to qdisc_list_add() to mq and mqprio
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 10 Dec 2013 00:21:31 +0000 (19:21 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e only.
Jacob provides a i40e patch to get 1588 work correctly by separating
TSYNVALID and TSYNINDX fields in the receive descriptor.
Jesse provides several i40e patches, first to correct the checking
of the multi-bit state. The hash is reported correctly in the RSS
field if and only if the filter status is 3. Other values of the
filter status mean different things and we should not depend on a
bitwise result. Then provides a patch to enable a couple of
workarounds based on revision ID that allow the driver to work
more fully on early hardware.
Shannon provides several i40e patches as well. First sets the media
type in the hardware structure based on the external connection type.
Then provides a patch to only setup the rings that will be used. Lastly
provides a fix where the TESTING state was still set when exiting the
ethtool diagnostics.
Kevin Scott provides one i40e patch to add a new flag to the i40e_add_veb()
which allows the driver to request the hardware to filter on layer 2
parameters.
Anjali provides four i40e patches, first refactors the reset code in
order to re-size queues and vectors while the interface is still up.
Then provides a patch to enable all PCTYPEs expect FCoE for RSS. Adds
a message to notify the user of how many VFs are initialized on each
port. Lastly adds a new variable to track the number of PF instances,
this is a global counter on purpose so that each PF loaded has a
unique ID.
Catherine bumps the driver version.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:32:10 +0000 (12:32 +0900)]
wan: wanxl: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:31:46 +0000 (12:31 +0900)]
wan: pci200syn: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:31:12 +0000 (12:31 +0900)]
wan: pc300too: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:30:41 +0000 (12:30 +0900)]
wan: lmc: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:29:46 +0000 (12:29 +0900)]
wan: dscc4: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:29:08 +0000 (12:29 +0900)]
irda: vlsi_ir: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:28:28 +0000 (12:28 +0900)]
irda: via-ircc: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:27:50 +0000 (12:27 +0900)]
net: forcedeth: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:27:27 +0000 (12:27 +0900)]
net: ns83820: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:26:44 +0000 (12:26 +0900)]
net: bna: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:25:44 +0000 (12:25 +0900)]
net: sis900: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Mon, 9 Dec 2013 03:24:07 +0000 (12:24 +0900)]
net: sfc: remove unnecessary pci_set_drvdata()
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Anjali Singhai Jain [Wed, 20 Nov 2013 10:03:01 +0000 (10:03 +0000)]
i40e: Add a new variable to track number of pf instances
Track the number of physical functions (PFs) found, this is a global counter
on purpose so that each pf loaded has a unique ID.
Change-Id: I74d618520afbce4a774d0235449e3b5f97ff6d4a
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Wed, 20 Nov 2013 10:03:00 +0000 (10:03 +0000)]
i40e: add num_VFs message
Print a message to notify the user of how many VFs are initialized on each
port.
Change-Id: I29ac2acc478ee4e588fd6ffcc35133d4c6607ca9
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 20 Nov 2013 10:02:59 +0000 (10:02 +0000)]
i40e: refactor ethtool tests
Put the print and reset statements in the actual test functions to make
them more self-contained, and only run the reset for tests that need it.
Change-Id: Ic70f49b11bf8bae82e59d8fd25b46215c90c4510
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 20 Nov 2013 10:02:58 +0000 (10:02 +0000)]
i40e: clear test state bit after all ethtool tests
Fix a bug where the TESTING state was still set when
exiting the ethtool diagnostics.
Change-Id: Ic47950d2e86a67167d1d282256d477cecd86d820
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 20 Nov 2013 10:02:57 +0000 (10:02 +0000)]
i40e: only set up the rings to be used
The VSI may be allocated more queues (alloc_queue_pairs) than actually
are to be used (num_queue_pairs), so only allocate rings for the queues
to be used. The numbers will likely be the same for most VSIs, but can
be different based on how TCs are assigned and enabled.
Change-Id: Ie40f7ad0affbc4b45d6f049bcf02ee2fa24edc74
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Wed, 20 Nov 2013 10:02:56 +0000 (10:02 +0000)]
i40e: Enable all PCTYPEs except FCOE for RSS.
RSS can steer packets based on recognition of all
sorts of different headers. Enable some more of them.
Change-Id: I2264dedae66fb0bceca6fb6e772e050e3ca8efc8
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anjali Singhai Jain [Wed, 20 Nov 2013 10:02:55 +0000 (10:02 +0000)]
i40e: refactor reset code
In order to re-size queues and vectors while the interface is
still up, we need to be able to call functions to free and
re-allocate without bringing down the VSI.
We also need to reset the existing setup, update the
configuration and then rebuild again. This requires us to have
the reset flow broken down into two parts.
Change-Id: I374dd25aabf769decda69b676491c7b7730a4635
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Catherine Sullivan [Wed, 20 Nov 2013 10:02:54 +0000 (10:02 +0000)]
i40e: Bump version
Update the driver version to 0.3.12-k
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jeff Kirsher [Wed, 20 Nov 2013 10:02:53 +0000 (10:02 +0000)]
i40e: whitespace
Whitespace fixes
Change-Id: I95f4d02e4a2a92d6b6fca3ae2b7865c4b916a9bb
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Jesse Brandeburg [Tue, 26 Nov 2013 08:56:05 +0000 (08:56 +0000)]
i40e: enable early hardware support
Enable a couple of workarounds based on revision ID that allow the
driver to work more fully on early hardware.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kevin Scott [Wed, 20 Nov 2013 10:02:51 +0000 (10:02 +0000)]
i40e: Add flag for L2 VEB filtering
Add a new flag to the add VEB command which allows the
driver to request the hardware to filter on L2 parameters.
This is an implementation of the driver access to a new firmware
feature.
Change-Id: Id61d3cad4125bdc68b8fd9d555c448a10c344b6b
Signed-off-by: Kevin Scott <kevin.c.scott@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Wed, 20 Nov 2013 10:02:50 +0000 (10:02 +0000)]
i40e: get media type during link info
Set the media type in the hardware structure, based
on the external connection type.
Add Direct Attach to the type of media reported by ethtool.
Change-Id: I4ad2f5bf882766d6e737fac4477abf049491b3b3
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Wed, 20 Nov 2013 10:02:49 +0000 (10:02 +0000)]
i40e: check multi-bit state correctly
The hash is reported correctly in the rss field if and only if
the filter status is 3. Other values of filter status mean
different things and we shouldn't depend on a bitwise result.
The issue was that
a & b --> returns true for b={1,2,3}
the fix is
a & b == b
Also refactor this function to use constant operations because we
are in fast path.
Change-Id: I4e29be87439c1cf8b60bc31bea29dff89596c013
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 20 Nov 2013 10:02:48 +0000 (10:02 +0000)]
i40e: separate TSYNVALID and TSYNINDX fields in Rx descriptor
In order to get 1588 to work correctly the defines need a bit
of a tweak.
Change-Id: Ie50ce2a18e1593441f1560411e5a4f51c6d48aaa
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Joe Perches [Thu, 5 Dec 2013 22:54:38 +0000 (14:54 -0800)]
ether_addr_equal: Optimize implementation, remove unused compare_ether_addr
Add a new check for CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to reduce
the number of or's used in the ether_addr_equal comparison to very
slightly improve function performance.
Simplify the ether_addr_equal_64bits implementation.
Integrate and remove the zap_last_2bytes helper as it's now
used only once.
Remove the now unused compare_ether_addr function.
Update the unaligned-memory-access documentation to remove the
compare_ether_addr description and show how unaligned accesses
could occur with ether_addr_equal.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Fri, 6 Dec 2013 10:03:36 +0000 (18:03 +0800)]
unix: convert printks to pr_<level>
use pr_<level> instead of printk(LEVEL)
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 6 Dec 2013 08:45:22 +0000 (09:45 +0100)]
ipv6 addrconf: introduce IFA_F_MANAGETEMPADDR to tell kernel to manage temporary addresses
Creating an address with this flag set will result in kernel taking care
of temporary addresses in the same way as if the address was created by
kernel itself (after RA receive). This allows userspace applications
implementing the autoconfiguration (NetworkManager for example) to
implement ipv6 addresses privacy.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 6 Dec 2013 08:45:21 +0000 (09:45 +0100)]
ipv6 addrconf: extend ifa_flags to u32
There is no more space in u8 ifa_flags. So do what davem suffested and
add another netlink attr called IFA_FLAGS for carry more flags.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Dalton [Thu, 5 Dec 2013 21:14:05 +0000 (13:14 -0800)]
virtio-net: free bufs correctly on invalid packet length
When a packet with invalid length arrives, ensure that the packet
is freed correctly if mergeable packet buffers and big packets
(GUEST_TSO4) are both enabled.
Signed-off-by: Michael Dalton <mwdalton@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ezequiel Garcia [Thu, 5 Dec 2013 16:35:37 +0000 (13:35 -0300)]
net: mvneta: Fix incorrect DMA unmapping size
The current code unmaps the DMA mapping created for rx skb_buff's by
using the data_size as the the mapping size. This is wrong since the
correct size to specify should match the size used to create the mapping.
This commit removes the following DMA_API_DEBUG warning:
------------[ cut here ]------------
WARNING: at lib/dma-debug.c:887 check_unmap+0x3a8/0x860()
mvneta
d0070000.ethernet: DMA-API: device driver frees DMA memory with different size [device address=0x000000002eb80000] [map size=1600 bytes] [unmap size=66 bytes]
CPU: 0 PID: 0 Comm: swapper/0 Not tainted
3.10.21-01444-ga88ae13-dirty #92
[<
c0013600>] (unwind_backtrace+0x0/0xf8) from [<
c0010fb8>] (show_stack+0x10/0x14)
[<
c0010fb8>] (show_stack+0x10/0x14) from [<
c001afa0>] (warn_slowpath_common+0x48/0x68)
[<
c001afa0>] (warn_slowpath_common+0x48/0x68) from [<
c001b01c>] (warn_slowpath_fmt+0x30/0x40)
[<
c001b01c>] (warn_slowpath_fmt+0x30/0x40) from [<
c018d0fc>] (check_unmap+0x3a8/0x860)
[<
c018d0fc>] (check_unmap+0x3a8/0x860) from [<
c018d734>] (debug_dma_unmap_page+0x64/0x70)
[<
c018d734>] (debug_dma_unmap_page+0x64/0x70) from [<
c0233f78>] (mvneta_rx+0xec/0x468)
[<
c0233f78>] (mvneta_rx+0xec/0x468) from [<
c023436c>] (mvneta_poll+0x78/0x16c)
[<
c023436c>] (mvneta_poll+0x78/0x16c) from [<
c02db468>] (net_rx_action+0x94/0x160)
[<
c02db468>] (net_rx_action+0x94/0x160) from [<
c0021e68>] (__do_softirq+0xe8/0x1d0)
[<
c0021e68>] (__do_softirq+0xe8/0x1d0) from [<
c0021ff8>] (do_softirq+0x4c/0x58)
[<
c0021ff8>] (do_softirq+0x4c/0x58) from [<
c0022228>] (irq_exit+0x58/0x90)
[<
c0022228>] (irq_exit+0x58/0x90) from [<
c000e7c8>] (handle_IRQ+0x3c/0x94)
[<
c000e7c8>] (handle_IRQ+0x3c/0x94) from [<
c0008548>] (armada_370_xp_handle_irq+0x4c/0xb4)
[<
c0008548>] (armada_370_xp_handle_irq+0x4c/0xb4) from [<
c000dc20>] (__irq_svc+0x40/0x50)
Exception stack(0xc04f1f70 to 0xc04f1fb8)
1f60:
c1fe46f8 00000000 00001d92 00001d92
1f80:
c04f0000 c04f0000 c04f84a4 c03e081c c05220e7 00000001 c05220e7 c04f0000
1fa0:
00000000 c04f1fb8 c000eaf8 c004c048 60000113 ffffffff
[<
c000dc20>] (__irq_svc+0x40/0x50) from [<
c004c048>] (cpu_startup_entry+0x54/0x128)
[<
c004c048>] (cpu_startup_entry+0x54/0x128) from [<
c04c1a14>] (start_kernel+0x29c/0x2f0)
[<
c04c1a14>] (start_kernel+0x29c/0x2f0) from [<
00008074>] (0x8074)
---[ end trace
d4955f6acd178110 ]---
Mapped at:
[<
c018d600>] debug_dma_map_page+0x4c/0x11c
[<
c0235d6c>] mvneta_setup_rxqs+0x398/0x598
[<
c0236084>] mvneta_open+0x40/0x17c
[<
c02dbbd4>] __dev_open+0x9c/0x100
[<
c02dbe58>] __dev_change_flags+0x7c/0x134
Signed-off-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 5 Dec 2013 15:27:37 +0000 (16:27 +0100)]
br: fix use of ->rx_handler_data in code executed on non-rx_handler path
br_stp_rcv() is reached by non-rx_handler path. That means there is no
guarantee that dev is bridge port and therefore simple NULL check of
->rx_handler_data is not enough. There is need to check if dev is really
bridge port and since only rcu read lock is held here, do it by checking
->rx_handler pointer.
Note that synchronize_net() in netdev_rx_handler_unregister() ensures
this approach as valid.
Introduced originally by:
commit
f350a0a87374418635689471606454abc7beaa3a
"bridge: use rx_handler_data pointer to store net_bridge_port pointer"
Fixed but not in the best way by:
commit
b5ed54e94d324f17c97852296d61a143f01b227a
"bridge: fix RCU races with bridge port"
Reintroduced by:
commit
716ec052d2280d511e10e90ad54a86f5b5d4dcc2
"bridge: fix NULL pointer deref of br_port_get_rcu"
Please apply to stable trees as well. Thanks.
RH bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=
1025770
Reported-by: Laine Stump <laine@redhat.com>
Debugged-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Vagin [Thu, 5 Dec 2013 14:36:21 +0000 (18:36 +0400)]
virtio: delete napi structures from netdev before releasing memory
free_netdev calls netif_napi_del too, but it's too late, because napi
structures are placed on vi->rq. netif_napi_add() is called from
virtnet_alloc_queues.
general protection fault: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables virtio_balloon pcspkr virtio_net(-) i2c_pii
CPU: 1 PID: 347 Comm: rmmod Not tainted 3.13.0-rc2+ #171
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task:
ffff8800b779c420 ti:
ffff8800379e0000 task.ti:
ffff8800379e0000
RIP: 0010:[<
ffffffff81322e19>] [<
ffffffff81322e19>] __list_del_entry+0x29/0xd0
RSP: 0018:
ffff8800379e1dd0 EFLAGS:
00010a83
RAX:
6b6b6b6b6b6b6b6b RBX:
ffff8800379c2fd0 RCX:
dead000000200200
RDX:
6b6b6b6b6b6b6b6b RSI:
0000000000000001 RDI:
ffff8800379c2fd0
RBP:
ffff8800379e1dd0 R08:
0000000000000001 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000001 R12:
ffff8800379c2f90
R13:
ffff880037839160 R14:
0000000000000000 R15:
00000000013352f0
FS:
00007f1400e34740(0000) GS:
ffff8800bfb00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
00007f464124c763 CR3:
00000000b68cf000 CR4:
00000000000006e0
Stack:
ffff8800379e1df0 ffffffff8155beab 6b6b6b6b6b6b6b2b ffff8800378391c0
ffff8800379e1e18 ffffffff8156499b ffff880037839be0 ffff880037839d20
ffff88003779d3f0 ffff8800379e1e38 ffffffffa003477c ffff88003779d388
Call Trace:
[<
ffffffff8155beab>] netif_napi_del+0x1b/0x80
[<
ffffffff8156499b>] free_netdev+0x8b/0x110
[<
ffffffffa003477c>] virtnet_remove+0x7c/0x90 [virtio_net]
[<
ffffffff813ae323>] virtio_dev_remove+0x23/0x80
[<
ffffffff813f62ef>] __device_release_driver+0x7f/0xf0
[<
ffffffff813f6ca0>] driver_detach+0xc0/0xd0
[<
ffffffff813f5f28>] bus_remove_driver+0x58/0xd0
[<
ffffffff813f72ec>] driver_unregister+0x2c/0x50
[<
ffffffff813ae65e>] unregister_virtio_driver+0xe/0x10
[<
ffffffffa0036942>] virtio_net_driver_exit+0x10/0x6ce [virtio_net]
[<
ffffffff810d7cf2>] SyS_delete_module+0x172/0x220
[<
ffffffff810a732d>] ? trace_hardirqs_on+0xd/0x10
[<
ffffffff810f5d4c>] ? __audit_syscall_entry+0x9c/0xf0
[<
ffffffff81677f69>] system_call_fastpath+0x16/0x1b
Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00
RIP [<
ffffffff81322e19>] __list_del_entry+0x29/0xd0
RSP <
ffff8800379e1dd0>
---[ end trace
d5931cd3f87c9763 ]---
Fixes: 986a4f4d452d (virtio_net: multiqueue support)
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Vagin [Thu, 5 Dec 2013 14:36:20 +0000 (18:36 +0400)]
virtio-net: determine type of bufs correctly
free_unused_bufs must check vi->mergeable_rx_bufs before
vi->big_packets, because we use this sequence in other places.
Otherwise we allocate buffer of one type, then free it as another
type.
general protection fault: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables pcspkr virtio_balloon virtio_net(-) i2c_pii
CPU: 0 PID: 400 Comm: rmmod Not tainted 3.13.0-rc2+ #170
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task:
ffff8800b6d2a210 ti:
ffff8800aed32000 task.ti:
ffff8800aed32000
RIP: 0010:[<
ffffffffa00345f3>] [<
ffffffffa00345f3>] free_unused_bufs+0xc3/0x190 [virtio_net]
RSP: 0018:
ffff8800aed33dd8 EFLAGS:
00010202
RAX:
ffff8800b1fe2c00 RBX:
ffff8800b66a7240 RCX:
6b6b6b6b6b6b6b6b
RDX:
6b6b6b6b6b6b6b6b RSI:
ffff8800b8419a68 RDI:
ffff8800b66a1148
RBP:
ffff8800aed33e00 R08:
0000000000000001 R09:
0000000000000000
R10:
0000000000000000 R11:
0000000000000001 R12:
0000000000000000
R13:
ffff8800b66a1148 R14:
0000000000000000 R15:
000077ff80000000
FS:
00007fc4f9c4e740(0000) GS:
ffff8800bfa00000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
00007f63f432f000 CR3:
00000000b6538000 CR4:
00000000000006f0
Stack:
ffff8800b66a7240 ffff8800b66a7380 ffff8800377bd3f0 0000000000000000
00000000023302f0 ffff8800aed33e18 ffffffffa00346e2 ffff8800b66a7240
ffff8800aed33e38 ffffffffa003474d ffff8800377bd388 ffff8800377bd390
Call Trace:
[<
ffffffffa00346e2>] remove_vq_common+0x22/0x40 [virtio_net]
[<
ffffffffa003474d>] virtnet_remove+0x4d/0x90 [virtio_net]
[<
ffffffff813ae303>] virtio_dev_remove+0x23/0x80
[<
ffffffff813f62cf>] __device_release_driver+0x7f/0xf0
[<
ffffffff813f6c80>] driver_detach+0xc0/0xd0
[<
ffffffff813f5f08>] bus_remove_driver+0x58/0xd0
[<
ffffffff813f72cc>] driver_unregister+0x2c/0x50
[<
ffffffff813ae63e>] unregister_virtio_driver+0xe/0x10
[<
ffffffffa0036852>] virtio_net_driver_exit+0x10/0x7be [virtio_net]
[<
ffffffff810d7cf2>] SyS_delete_module+0x172/0x220
[<
ffffffff810a732d>] ? trace_hardirqs_on+0xd/0x10
[<
ffffffff810f5d4c>] ? __audit_syscall_entry+0x9c/0xf0
[<
ffffffff81677f69>] system_call_fastpath+0x16/0x1b
Code: c0 74 55 0f 1f 44 00 00 80 7b 30 00 74 7a 48 8b 50 30 4c 89 e6 48 03 73 20 48 85 d2 0f 84 bb 00 00 00 66 0f
RIP [<
ffffffffa00345f3>] free_unused_bufs+0xc3/0x190 [virtio_net]
RSP <
ffff8800aed33dd8>
---[ end trace
edb570ea923cce9c ]---
Fixes: 2613af0ed18a (virtio_net: migrate mergeable rx buffers to page frag allocators)
Cc: Michael Dalton <mwdalton@google.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 5 Dec 2013 12:45:08 +0000 (04:45 -0800)]
net: introduce dev_consume_skb_any()
Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq()
helpers to free skbs, both for dropped packets and TX completed ones.
We need to separate the two causes to get better diagnostics
given by dropwatch or "perf record -e skb:kfree_skb"
This patch provides two new helpers, dev_consume_skb_any() and
dev_consume_skb_irq() to be used for consumed skbs.
__dev_kfree_skb_irq() is slightly optimized to remove one
atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhi Yong Wu [Fri, 6 Dec 2013 20:13:06 +0000 (04:13 +0800)]
tun: remove unused parameter in tun_do_read()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhi Yong Wu [Fri, 6 Dec 2013 20:13:05 +0000 (04:13 +0800)]
macvtap: remove unused parameter in macvtap_do_read()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhi Yong Wu [Fri, 6 Dec 2013 20:13:04 +0000 (04:13 +0800)]
macvtap: remove the dead branch
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhi Yong Wu [Fri, 6 Dec 2013 20:13:03 +0000 (04:13 +0800)]
vhost: remove the dead branch
Since vhost_dev_init() forever return 0, some branches are never run,
therefore need to be removed.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Thu, 5 Dec 2013 10:36:58 +0000 (11:36 +0100)]
bonding: fix packets_per_slave showing
There's an issue when showing the value of packets_per_slave due to
using signed integer. The value may be < 0 and thus not put through
reciprocal_value() before showing. This patch makes it use unsigned
integer when showing it.
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Veaceslav Falico <vfalico@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nithin Sujir [Fri, 6 Dec 2013 17:53:20 +0000 (09:53 -0800)]
tg3: Update version to 3.135
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nithin Sujir [Fri, 6 Dec 2013 17:53:19 +0000 (09:53 -0800)]
tg3: Expand multicast drop counter miscounting fix to 5762
commit
4d95847381228639844c7197deb8b2211274ef22 - "tg3: Workaround
rx_discards stat bug", added a workaround for miscounted statistics for
multicast packets. This fix needs to be applied to the 5762.
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nithin Sujir [Fri, 6 Dec 2013 17:53:18 +0000 (09:53 -0800)]
tg3: Fix bit definition for the nvram Auto Power Down setting
The APD bit is 14 and not bit 10.
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nithin Sujir [Fri, 6 Dec 2013 17:53:17 +0000 (09:53 -0800)]
tg3: Add flag to disable 1G Half Duplex advertisement
Some link partners have issues if the non-standard 1G half duplex is
advertised. This patch adds support for an nvram setting to disable the
advertisement.
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nithin Sujir [Fri, 6 Dec 2013 17:53:16 +0000 (09:53 -0800)]
tg3: Don't add rxbds_empty to rx_over_errors
rxbds_empty is an informational statistic signifying that a ring full
condition was observed. It does not mean an overflow has occurred.
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somnath Kotur [Thu, 5 Dec 2013 06:38:16 +0000 (12:08 +0530)]
be2net: Free/delete pmacs (in be_clear()) only if they exist
During suspend-resume and lancer error recovery we will cleanup and
re-initialize the resources through be_clear() and be_setup() respectively.
During re-initialisation in be_setup(), if be_get_config() fails, we'll again
call be_clear() which will cause a NULL pointer dereference as adapter->pmac_id is
already freed.
Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somnath Kotur [Thu, 5 Dec 2013 06:37:55 +0000 (12:07 +0530)]
be2net: Fix Lancer error recovery to distinguish FW download
The Firmware update would be detected by looking at the sliport_error1/
sliport_error2 register values(0x02/0x00). If its not a FW reset the current
messaging would take place. If the error is due to FW reset, log a message to
user that "Firmware update in progress" and also do not log sliport_status and
sliport_error register values.
Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 6 Dec 2013 19:57:26 +0000 (14:57 -0500)]
Merge branch 'of_mdio'
Florian Fainelli says:
====================
net: of_mdio improvements
This patchset contains a few improvements to the MDIO device tree parsing
code such as refactoring and parsing the "max-speed" property which is
defined in the ePAPR specification.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 5 Dec 2013 22:52:16 +0000 (14:52 -0800)]
Documentation: update Ethernet PHY devices binding with 'max-speed'
The 'max-speed' property is optional but defined in the ePAPR
specification and now supported by the Linux Device Tree parsing
infrastructure.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 5 Dec 2013 22:52:15 +0000 (14:52 -0800)]
arc_emac: remove custom "max-speed" parsing code
The ARC emac driver was the only in-tree to parse a PHY device
'max-speed' property but yet failed to do it correctly because
'max-speed' is supposed to set a PHY device supported features, not the
advertising features as it was done.
Now that of_mdiobus_register() takes care of doing that, remove the
custom 'max-speed' parsing code.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 5 Dec 2013 22:52:14 +0000 (14:52 -0800)]
net: of_mdio: parse "max-speed" property to set PHY supported features
The "max-speed" property is defined per the ePAPR specification to
express the maximum speed a PHY supports. Use that property, if present
to set the phydev->supported features which properly restricts the PHY
within the range of defined speeds.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 5 Dec 2013 22:52:13 +0000 (14:52 -0800)]
net: phy: breakdown PHY_*_FEATURES defines
Breakdown the PHY_*_FEATURES into per speed defines such that we can
easily re-use them individually.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 5 Dec 2013 22:52:12 +0000 (14:52 -0800)]
net: of_mdio: do not overwrite PHY interrupt configuration
If irq_of_parse_and_map fails to find an interrupt line for a given PHY,
we will force the PHY interrupt to be PHY_POLL, completely overriding
the previous value that the MDIO bus may have set for us (e.g:
PHY_IGNORE_INTERRUPT). In case of failure, just restore the previous
value.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 5 Dec 2013 22:52:11 +0000 (14:52 -0800)]
net: of_mdio: use PHY_MAX_ADDR constant
Use the PHY_MAX_ADDR constant for checking if a MDIO bus address is
valid instead of using a plain "32".
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 5 Dec 2013 22:52:10 +0000 (14:52 -0800)]
net: of_mdio: factor PHY registration from of_mdiobus_register
Since commit
779d835e ("net: of_mdio: scan mdiobus for PHYs without reg
property") we have two foreach loops which do pretty much the same
thing. Factor the PHY device registration in a function helper:
of_mdiobus_register_phy() which takes care of the details and allows for
future PHY specific extensions.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Fri, 6 Dec 2013 01:36:30 +0000 (09:36 +0800)]
sctp: fix some comments in associola.c
fix some typos
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Fri, 6 Dec 2013 01:36:29 +0000 (09:36 +0800)]
sctp: convert sctp_peer_needs_update to boolean
sctp_peer_needs_update only return 0 or 1.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Fri, 6 Dec 2013 01:36:28 +0000 (09:36 +0800)]
sctp: remove the else path
Make the code more simplification.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Fri, 6 Dec 2013 01:36:27 +0000 (09:36 +0800)]
sctp: remove the duplicate initialize
kzalloc had initialize the allocated memroy. Therefore, remove the
initialize with 0 and the memset.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 6 Dec 2013 19:48:48 +0000 (14:48 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e only.
Christopher Pau provides a patch to set pf_id based on device and
function numbers since NICs with ARI enabled can have function
numbers larger than 8.
Anjali provides 3 i40e patches to update hardware defines to keep
in sync with hardware updates.
Shannon provides the majority of i40e patches, with 7. First patch
clears the admin queue head and tail registers during admin queue
shutdown. Then simplifies the admin queue head-tail-len setups to
use more virtual registers. Provides several patches to cleanup
and fix driver load and reset procedures to make more robust. Lastly,
provides an ethtool test for interrupts using the software interrupt.
Mitch provides some i40e patches which fixes up VF code in the PF
driver, specifically the number of vectors per VF are reported by the
hardware does not include vector 0, so we need to account for this
when checking. In addition, cleans up debugging messages.
Kamil provides an i40e patch to fix the diagnostics test by restricting
the diagnostic test length.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 6 Dec 2013 19:25:23 +0000 (14:25 -0500)]
Merge branch 'for-davem' of git://git./linux/kernel/git/linville/wireless-next
John W. Linville says:
====================
Please pull this batch of updates intended for the 3.14 stream...
For the mac80211 bits, Johannes says:
"I have various improvements/cleanups/fixes all over, but the shortlog
shows that Luis's regulatory work and mesh work from the cozybit folks
are the biggest ones, along with the CSA fixes."
Along with that, we have big batches of updates to brcmfmac, rtlwifi,
and ath9k. There are updates to wcn36xx, rt2x00, and a handful of
others as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 6 Dec 2013 06:36:05 +0000 (22:36 -0800)]
tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.
136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.
130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>