git.openwrt.org Git - openwrt/staging/blogic.git/log

net: Add function for parsing the header length out of linear ethernet frames

This patch updates some of the flow_dissector api so that it can be used to
parse the length of ethernet buffers stored in fragments. Most of the
changes needed were to __skb_get_poff as it needed to be updated to support
sending a linear buffer instead of a skb.

I have split __skb_get_poff into two functions, the first is skb_get_poff
and it retains the functionality of the original __skb_get_poff. The other
function is __skb_get_poff which now works much like __skb_flow_dissect in
relation to skb_flow_dissect in that it provides the same functionality but
works with just a data buffer and hlen instead of needing an skb.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'timestamping'

Alexander Duyck says:

====================
This change makes it so that the core path for the phy timestamping logic
is shared between skb_tx_tstamp and skb_complete_tx_timestamp.  In addition
it provides a means of using the same skb clone type path in non phy
timestamping drivers.

The main motivation for this is to enable non-phy drivers to be able to
manipulate tx timestamp skbs for such things as putting them in lists or
setting aside buffer in the context block.

v2: Incorporated suggested changes from Willem de Bruijn and Eric Dumazet
     dropped uneeded comment
     restored order of hwtstamp vs swtstamp
     added destructor for skb
    Dropped usage of skb_complete_tx_timestamp as a kfree_skb w/ destructor

v3: Updated destructor handling and dealt with socket reference counting issues

v4: Split out combining destructors into a separate patch
====================

net: merge cases where sock_efree and sock_edemux are the same function

Since sock_efree and sock_demux are essentially the same code for non-TCP
sockets and the case where CONFIG_INET is not defined we can combine the
code or replace the call to sock_edemux in several spots. As a result we
can avoid a bit of unnecessary code or code duplication.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: Make the clone operation stand-alone from phy timestamping

The phy timestamping takes a different path than the regular timestamping
does in that it will create a clone first so that the packets needing to be
timestamped can be placed in a queue, or the context block could be used.

In order to support these use cases I am pulling the core of the code out
so it can be used in other drivers beyond just phy devices.

In addition I have added a destructor named sock_efree which is meant to
provide a simple way for dropping the reference to skb exceptions that
aren't part of either the receive or send windows for the socket, and I
have removed some duplication in spots where this destructor could be used
in place of sock_edemux.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: Merge shared code between phy and regular timestamping

This change merges the shared bits that exist between skb_tx_tstamp and
skb_complete_tx_timestamp. By doing this we can avoid the two diverging as
there were already changes pushed into skb_tx_tstamp that hadn't made it
into the other function.

In addition this resolves issues with the fact that
skb_complete_tx_timestamp was included in linux/skbuff.h even though it was
only compiled in if phy timestamping was enabled.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: harden fnhe_hashfun()

Lets make this hash function a bit secure, as ICMP attacks are still
in the wild.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: fix allocation error in test

A buffer is incorrectly zeroed to the length of the pointer. If
cfg_payload_len < sizeof(void *) this can overwrites unrelated memory.
The buffer contents are never read, so no need to zero.

Fixes: 8fe2f761cae9 ("net-timestamp: expand documentation")
Reported-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hyperv: NULL dereference on error

We try to call free_netvsc_device(net_device) when "net_device" is NULL.
It leads to an Oops.

Fixes: f90251c8a6d0 ('hyperv: Increase the buffer length for netvsc_channel_cb()')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2014-09-04

This series contains updates to i40e, i40evf, ixgbe and ixgbevf.

Catherine adds dual speed module support to i40e.  Updates i40e to allow
the user to change link settings when the link is down.

Serey renames i40e_ndo_set_vf_spoofck() to i40e_ndo_set_vf_spookchk()
to be more consistent with what is defined in netdev and removes a
unnecessary variable assignment.

Jesse makes a malicious driver detection warning only print if extended
driver string is enabled for i40e.  Fixes a panic under traffic load when
resetting or if/whenever there was a Tx-timeout because we were enabling
the Tx queue to early.

Anjali fixes an issue when PF reset fails, where we were trying to restart
the admin queue which has not been setup at that point.  This resolves an
occasional kernel panic when PF reset fails for some reason.

Ethan Zhao replaces the use of a local i40e_vfs_are_assigned() with the
global kernel pci_vfs_assigned() for i40e.

Alex cleans up the FDB handling for ixgbe.  This change makes it so that
the behavior for FDB handling is consistent between both the SR-IOV and
non-SR-IOV cases.  The main change is that we perform bounds checking on
the number of SR-IOV addresses regardless of if SR-IOV is enabled or not
as we can only support a certain number of addresses in the hardware.

Emil extends the pending Tx work check to the VF interfaces, where the
driver initiates a reset of the interface on link loss with pending Tx
work in order to clear the rings.  Introduces a delay for 82599 VFs of
at least 500 usecs to make sure the VFLINKS value is correct, since this
bit tends to flap when a DA or SFP+ cable is disconnected.

Jacob adds code comments in ixgbe to make it more obvious that we are
resetting features based on the fact that we do not have MSI-X enabled,
and cannot use the previous settings.  Also resolves a kernel NULL
pointer dereference by limiting the combined total of MACVLAN and
SR-IOV VFs, since the hardware has a limited number of pools available
(64).  Previously, no checks were in place to limit the number of
accelerated MACVLAN devices based on the number of pools, which would
be ok since there was already a limit for these well below the number of
available pools.  However, SR-IOV uses the very same pools, therefore
we need to ensure that the total number of pools does not exceed the
number of pools available in the hardware.

v2:
- clean up code comment in patch 5 by replacing "an" with "auto
   negotiation" based on feedback from Sergei Shtylyov
- removed un-necessary parenthesis around function call in patch 8
   based on feedback from Sergei Shtylyov
====================

net: ethernet: cpsw: improve interrupt lookup logic in cpsw_probe()

Simplify the interrupt resource lookup code in cpsw_probe() by the
following:

* Only look at the first member of the resource. As the driver only
   works for DT-enabled platforms anyway, a resource of type
   IORESOURCE_IRQ will only contain one single entry
   (res->start == res->end), so there is no need for the iteration.

* Add a bounds check to avoid overflows if we are passed more than
   ARRAY_SIZE(priv->irqs_table) resources.

* Assign 'ret' with the return value of devm_request_irq() so that
   cpsw_probe() returns the appropriate error code.

* If devm_request_irq() fails, report the error code in the log
   message.

Signed-off-by: Daniel Mack <zonque@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fix a race in update_or_create_fnhe()

nh_exceptions is effectively used under rcu, but lacks proper
barriers. Between kzalloc() and setting of nh->nh_exceptions(),
we need a proper memory barrier.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 4895c771c7f00 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: fix missing line continuation

This syntax error was covered by L2TP_REFCNT_DEBUG not being set by
default.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'amd-xgbe-next'

Tom Lendacky says:

====================
amd-xgbe: AMD XGBE driver updates 2014-09-03

The following series of patches includes fixes/updates to the driver.

- Query the device for the actual speed mode (KR/KX) rather than trying
to track it
- Update parallel detection logic to support KR mode
- Fix new warnings from checkpatch in the amd-xgbe and amd-xgbe-phy
driver

This patch series is based on net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe-phy: Checkpatch driver fixes

This patch contains fixes identified by checkpatch when run with the
strict option.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Checkpatch driver fixes

This patch contains fixes identified by checkpatch when run with the
strict option.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe-phy: Enhance parallel detection to support KR speed

Add support to allow parallel detection to work in KR speed. With
both speed modes of KX and KR supported, KX must be checked first.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe-phy: Check device for current speed mode (KR/KX)

Since device resets can change the current mode it's possible to think
the device is in a different mode than it actually is. Rather than
trying to determine every place that is needed to set/save the current
mode, be safe and check the devices actual mode when needed rather than
trying to track it.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'r8152-next'

Hayes Wang says:

====================
r8152: random MAC address

If the interface has invalid MAC address, it couldn't
be used. In order to let it work normally, give a
random one.

v3:
Remove
ether_addr_copy(dev->perm_addr, dev->dev_addr);

v2:
Use "%pM" format specifier for printing a MAC address.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

r8152: use eth_hw_addr_random

If the hw doesn't have a valid MAC address, give a random one and
set it to the hw.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8152: change the location of rtl8152_set_mac_address

Exchange the location of rtl8152_set_mac_address() and
set_ethernet_addr(). Then, the set_ethernet_addr() could
set the MAC address by calling rtl8152_set_mac_address()
later.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rx_copybreak'

Govindarajulu Varadarajan says:

====================
enic: Add support for rx_copybreak

The following series implements rx_copybreak.

dma_map_single()/dma_unmap_single() is more expensive than alloc_skb & memcpy
for smaller packets. By doing this we can reuse the dma buff which is already
mapped. This is very useful when iommu is on. The default skb copybreak value
is 256.

When iommu is on, we can go much higher than 256. All the drivers that supports
rx_copybreak provides module parameter to change this value. Since module
parameter is the least preferred way for changing driver values, this series
adds ethtool support for setting rx_copybreak.

v4:
Validate tunable length in ethtool_get_tunable, not in driver implemented
function.

Loose tunable_ops array for each tunable type. Define one function and let the
driver use switch case for each type.

Use double underscore for data type in UAPI headers.
Use const qualifier where possible.

v3:
Add tunable namespace to ethtool. Use new ethtool cmd ETHTOOL_S/GTUNABLE to
set/get rx_copybreak from userspace.

v2:
Add new ethtool_cmd for DMA buffer parameters, instead of adding new members to
existing ethtool_ringparam.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

enic: Add tunable_ops support for rx_copybreak

This patch adds support for setting/getting rx_copybreak using
generic ethtool tunable.

Defines enic_get_tunable() & enic_set_tunable() to get/set rx_copybreak.
As of now, these two function supports only rx_copybreak.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: Add generic options for tunables

This patch adds new ethtool cmd, ETHTOOL_GTUNABLE & ETHTOOL_STUNABLE for getting
tunable values from driver.

Add get_tunable and set_tunable to ethtool_ops. Driver implements these
functions for getting/setting tunable value.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enic: implement rx_copybreak

Calling dma_map_single()/dma_unmap_single() is quite expensive compared
to copying a small packet. So let's copy short frames and keep the buffers
mapped.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dev_ioctl: remove dev_load() CAP_SYS_MODULE message

Marcel reported to see the following message when autoloading
is being triggered when adding nlmon device:

  Loading kernel module for a network device with
  CAP_SYS_MODULE (deprecated). Use CAP_NET_ADMIN and alias
  netdev-nlmon instead.

This false-positive happens despite with having correct
capabilities set, e.g. through issuing `ip link del dev nlmon`
more than once on a valid device with name nlmon, but Marcel
has also seen it on creation time when no nlmon module is
previously compiled-in or loaded as module and the device
name equals a link type name (e.g. nlmon, vxlan, team).

Stephen says:

  The netdev module alias is a hold over from the past. For
  normal devices, people used to create a alias eth0 to and
  point it to the type of network device used, that was back
  in the bad old ISA days before real discovery.

  Also, the tunnels create module alias for the control device
  and ip used to use this to autoload the tunnel device.

  The message is bogus and should just be removed, I also see
  it in a couple of other cases where tap devices are renamed
  for other usese.

As mentioned in 8909c9ad8ff0 ("net: don't allow CAP_NET_ADMIN
to load non-netdev kernel modules"), we nevertheless still
might want to leave the old autoloading behaviour in place
as it could break old scripts, so for now, lets just remove
the log message as Stephen suggests.

Reference: http://thread.gmane.org/gmane.linux.kernel/1105168
Reported-by: Marcel Holtmann <marcel@holtmann.org>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bpf: make eBPF interpreter images read-only

With eBPF getting more extended and exposure to user space is on it's way,
hardening the memory range the interpreter uses to steer its command flow
seems appropriate. This patch moves the to be interpreted bytecode to
read-only pages.

In case we execute a corrupted BPF interpreter image for some reason e.g.
caused by an attacker which got past a verifier stage, it would not only
provide arbitrary read/write memory access but arbitrary function calls
as well. After setting up the BPF interpreter image, its contents do not
change until destruction time, thus we can setup the image on immutable
made pages in order to mitigate modifications to that code. The idea
is derived from commit 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks").

This is possible because bpf_prog is not part of sk_filter anymore.
After setup bpf_prog cannot be altered during its life-time. This prevents
any modifications to the entire bpf_prog structure (incl. function/JIT
image pointer).

Every eBPF program (including classic BPF that are migrated) have to call
bpf_prog_select_runtime() to select either interpreter or a JIT image
as a last setup step, and they all are being freed via bpf_prog_free(),
including non-JIT. Therefore, we can easily integrate this into the
eBPF life-time, plus since we directly allocate a bpf_prog, we have no
performance penalty.

Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual
inspection of kernel_page_tables. Brad Spengler proposed the same idea
via Twitter during development of this patch.

Joint work with Hannes Frederic Sowa.

Suggested-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: systemport: update UMAC_CMD only when link is detected

When we bring the interface down, phy_stop() will schedule the PHY
state machine to call our link adjustment callback. By the time we do so,
we may have clock gated off the SYSTEMPORT hardware block, and this will
cause bus errors to happen in bcm_sysport_adj_link():

Make sure that we only touch the UMAC_CMD register when there is an
actual link. This is safe to do for two reasons:

- updating the Ethernet MAC registers only make sense when a physical
link is present
- the PHY library state machine first set phydev->link = 0 before
invoking phydev->adjust_link in the PHY_HALTED case

This is a similar fix to the GENET one:
c677ba8b3c47650358572091ed8a6af50bfca877 ("net: bcmgenet: update
UMAC_CMD only when link is detected").

Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: implement igmp_qrv sysctl to tune igmp robustness variable

As in IPv6 people might increase the igmp query robustness variable to
make sure unsolicited state change reports aren't lost on the network. Add
and document this new knob to igmp code.

RFCs allow tuning this parameter back to first IGMP RFC, so we also use
this setting for all counters, including source specific multicast.

Also take over sysctl value when upping the interface and don't reuse
the last one seen on the interface.

Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: add sysctl_mld_qrv to configure query robustness variable

This patch adds a new sysctl_mld_qrv knob to configure the mldv1/v2 query
robustness variable. It specifies how many retransmit of unsolicited mld
retransmit should happen. Admins might want to tune this on lossy links.

Also reset mld state on interface down/up, so we pick up new sysctl
settings during interface up event.

IPv6 certification requests this knob to be available.

I didn't make this knob netns specific, as it is mostly a setting in a
physical environment and should be per host.

Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: limit combined total of macvlan and SR-IOV VFs

Hardware has a limited number of pools available (64). Previously, no
checks were in place to limit the number of accelerated macvlan devices
based on the number of pools. Normally this would be ok, because there
was already a limit for these well below the number of available pools.
However, SR-IOV uses the very same pools. Therefor, we need to ensure
that the total number of pools (number of VFs plus the number of non-VF
pools in use for accelerated macvlans) does not exceed the number of
pools available in hardware.

This patch resolves a kernel NULL pointer dereference caused by the following commands:

$modprobe ixgbe max_vfs=63

$ethtool -K eth2 l2-fwd-offload on

$ip link add link eth2 macvlan0 type macvlan

$ip link set dev macvlan0 up

[  992.950080] BUG: unable to handle kernel NULL pointer dereference at 0000000000000056
[  992.951109] IP: [<ffffffffa003b71e>] ixgbe_disable_fwd_ring+0x1e/0xf0 [ixgbe]
[  992.951684] PGD 22a80e067 PUD 232e9b067 PMD 0
[  992.952389] Oops: 0000 [#1] SMP
[  992.953014] Modules linked in: nfsd lockd nfs_acl exportfs auth_rpcgss oid_registry sunrpc bridge stp llc vhost_net macvtap macvlan vhost tun kvm_intel kvm ioatdma ixgbe mdio igb dca
[  992.956042] CPU: 2 PID: 11928 Comm: ifconfig Not tainted 3.16.0-rc6-net-next-07-29-2014-FCoE+ #1
[  992.956915] Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
[  992.957791] task: ffff8804341c0000 ti: ffff8801d7dc8000 task.ti: ffff8801d7dc8000
[  992.958660] RIP: 0010:[<ffffffffa003b71e>]  [<ffffffffa003b71e>] ixgbe_disable_fwd_ring+0x1e/0xf0 [ixgbe]
[  992.959613] RSP: 0018:ffff8801d7dcbbb8  EFLAGS: 00010286
[  992.960093] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
[  992.960575] RDX: ffff880232eb7000 RSI: 0000000000000000 RDI: ffff88022dc05800
[  992.961059] RBP: ffff8801d7dcbbd8 R08: 0000000000000000 R09: 0000000000000000
[  992.961541] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88022ec20980
[  992.962023] R13: ffff880232eb7000 R14: 0000000000000001 R15: 0000000000000001
[  992.962508] FS:  00007fab264887a0(0000) GS:ffff880237640000(0000) knlGS:0000000000000000
[  992.963378] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  992.963858] CR2: 0000000000000056 CR3: 000000022a939000 CR4: 00000000001427e0
[  992.964340] Stack:
[  992.964806]  ffff88022ec28840 ffff88022ec20980 ffff88022dc05800 ffff880232eb7000
[  992.965976]  ffff8801d7dcbc28 ffffffffa003bae8 ffff8801d7dcbbe8 0000000000000400
[  992.967147]  000000000000000d ffff88022ec20980 ffff88022ec20000 ffff88022dc05800
[  992.968319] Call Trace:
[  992.968795]  [<ffffffffa003bae8>] ixgbe_fwd_ring_up+0x88/0x280 [ixgbe]
[  992.969284]  [<ffffffffa0041d83>] ixgbe_fwd_add+0x173/0x220 [ixgbe]
[  992.969767]  [<ffffffffa015056c>] macvlan_open+0x1bc/0x230 [macvlan]
[  992.970256]  [<ffffffff816b8de7>] __dev_open+0xd7/0x150
[  992.970735]  [<ffffffff816b8bd7>] __dev_change_flags+0xa7/0x170
[  992.971220]  [<ffffffff816b8ccb>] dev_change_flags+0x2b/0x70
[  992.971703]  [<ffffffff817471b2>] devinet_ioctl+0x602/0x6d0
[  992.972184]  [<ffffffff81748168>] inet_ioctl+0x78/0x90
[  992.972666]  [<ffffffff816a143b>] sock_do_ioctl+0x2b/0x70
[  992.973146]  [<ffffffff816a14ed>] sock_ioctl+0x6d/0x260
[  992.973627]  [<ffffffff811ad3b4>] do_vfs_ioctl+0x84/0x540
[  992.974109]  [<ffffffff811a4c81>] ? final_putname+0x21/0x50
[  992.974593]  [<ffffffff818725d5>] ? sysret_check+0x22/0x5d
[  992.975073]  [<ffffffff811ad901>] SyS_ioctl+0x91/0xa0
[  992.975550]  [<ffffffff818725a9>] system_call_fastpath+0x16/0x1b
[  992.976026] Code: ff 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 89 6d f8 4c 8b a7 08 02 00 00 <44> 0f b6 6e 56 44 03 af 14 02 00 00 4c 89 e7 e8 5e f2 ff ff be
[  992.982261] RIP  [<ffffffffa003b71e>] ixgbe_disable_fwd_ring+0x1e/0xf0 [ixgbe]
[  992.983212]  RSP <ffff8801d7dcbbb8>
[  992.983681] CR2: 0000000000000056
[  992.984248] ---[ end trace 9f54802b5cc3638b ]---

Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: add comment noting recalculation of queues

Since we previously called ixgbe_set_num_queues just prior to attempting
to set our interrupt scheme, it may be non obvious why we have to call
it again inside the function. Add a comment which helps make it more
obvious that we are resetting features based on the fact that we do not
have MSI-X enabled, and cannot use the previous settings.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbevf: introduce delay for checking VFLINKS on 82599

VFLINKS.LINKUP bit tends to flap when a DA or SFP+ cable is disconnected.
It can take up to 500 usecs for the LINKUP bit to be correct.

This patch resolves the issue by introducing a delay for 82599 VFs of at
least 500 usecs to make sure the VFLINKS value is correct.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: reset interface on link loss with pending Tx work from the VF

ixgbe initiates a reset of the interface on link loss with pending Tx work
in order to clear the rings.

This patch extends the pending Tx work check to the VF interfaces with the
same purpose.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Cleanup FDB handling code

This change makes it so that the behavior for FDB handling is consistent
between both the SR-IOV and non-SR-IOV cases. The main change here is that we
perform bounds checking on the number of SR-IOV addresses regardless of if
SR-IOV is enabled or not as we can only support a certain number of addresses
in the hardware.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: use global pci_vfs_assigned() to replace local i40e_vfs_are_assigned()

There is global funcion pci_vfs_assigned(), so use it instead of composing
local one.

Signed-off-by: Ethan Zhao <ethan.kernel@gmail.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Bump i40e/i40evf versions

Bump i40e version to 1.0.11 and i40evf version to 1.0.5.

Change-ID: I63a60fa2efe82aae87a8a3095f43218db57d46ce
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>

i40e: fix panic due to too-early Tx queue enable

This fixes the panic under traffic load when resetting. This issue
could also show up if/whenever there is a Tx-timeout.

Change-ID: Ie393a1f17fd5d962e56fc3bfe784899ef25402f5
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Fix an issue when PF reset fails

We shouldn't restart Admin queue subtask if PF reset fails since we do
not have the AQ setup at that point. This patch makes sure we disable AQ
clean subtask when PF reset fails.

This will resolve an occasional kernel panic when PF reset fails for
some reason.

Change-ID: I11a747773362a8c5c0ad7a10cd34be0bda8eb9e8
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: make warning less verbose

The driver is un-necessarily printing a warning that is only marginally
useful to the user. Make the warning only print if extended driver
string printing is enabled, other messages related to a reset event
will still continue to print.

Change-ID: I5e8beca6516a2f176cd2e72b0ac2b3b909e6c953
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Tell OS link is going down when calling set_phy_config

Since we don't seem to be getting an LSE telling us link is going down
during set_phy_config (but we do get an LSE telling us we are coming
back up), fake one for the OS and tell them link is going down. Also
do an atomic restart no matter what because there are times the user
may want to end with link up even if they started with link down (like
if they accidentally set it to a speed that can't link and are trying to
fix it).

Change-ID: I0a642af9c1d0feb67bce741aba1a9c33bd349ed6
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Remove unnecessary assignment

Remove unnecessary setting of "ret" variable as it's already set at
the top of the function.

Change-ID: Icaccfc67f335817a23579b7c43625d59ad6c9925
Signed-off-by: Serey Kong <serey.kong@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Change wording to be more consistent

Change "spoofck" to "spoofchk" to be consistent with as defined in netdev.

Change-ID: I9866d6284cb5f92c8d71dc0776c6d1e71dfb62a5
Signed-off-by: Serey Kong <serey.kong@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Allow user to change link settings if link is down

Allow the user to change auto-negotiation and speed settings if
link is down.

Change-ID: I372967c627682b5e1835f623a7cbf41b21b51043
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Add dual speed module support

Now that fw has implemented dual speed module support, we can add ours.
Also, add the phy type for 1G LR/SR and set its media type to fiber.
Lastly, instead of a WARN_ON if the phy type is not recognized just print
a warning.

Change-ID: I2e5227d4a8c2907b0ed423038e5dbce774e466b0
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

mlx4_en: Convert the normal skb free path to dev_consume_skb_any()

It would appear the mlx4_en driver was still making a call to
dev_kfree_skb_any() where dev_consume_skb_any() would be more
appropriate. This should make dropped packet profiling/tracking
easier/better over a NIC driven by mlx4_en.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

lib/rhashtable: allow user to set the minimum shifts of shrinking

Although rhashtable library allows user to specify a quiet big size
for user's created hash table, the table may be shrunk to a
very small size - HASH_MIN_SIZE(4) after object is removed from
the table at the first time. Subsequently, even if the total amount
of objects saved in the table is quite lower than user's initial
setting in a long time, the hash table size is still dynamically
adjusted by rhashtable_shrink() or rhashtable_expand() each time
object is inserted or removed from the table. However, as
synchronize_rcu() has to be called when table is shrunk or
expanded by the two functions, we should permit user to set the
minimum table size through configuring the minimum number of shifts
according to user specific requirement, avoiding these expensive
actions of shrinking or expanding because of calling synchronize_rcu().

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

qdisc: validate frames going through the direct_xmit path

In commit 50cbe9ab5f8d ("net: Validate xmit SKBs right when we
pull them out of the qdisc") the validation code was moved out of
dev_hard_start_xmit and into dequeue_skb.

However this overlooked the fact that we do not always enqueue
the skb onto a qdisc. First situation is if qdisc have flag
TCQ_F_CAN_BYPASS and qdisc is empty. Second situation is if
there is no qdisc on the device, which is a common case for
software devices.

Originally spotted and inital patch by Alexander Duyck.
As a result Alex was seeing issues trying to connect to a
vhost_net interface after commit 50cbe9ab5f8d was applied.

Added a call to validate_xmit_skb() in __dev_xmit_skb(), in the
code path for qdiscs with TCQ_F_CAN_BYPASS flag, and in
__dev_queue_xmit() when no qdisc.

Also handle the error situation where dev_hard_start_xmit() could
return a skb list, and does not return dev_xmit_complete(rc) and
falls through to the kfree_skb(), in that situation it should
call kfree_skb_list().

Fixes: 50cbe9ab5f8d ("net: Validate xmit SKBs right when we pull them out of the qdisc")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qdisc: exit case fixes for skb list handling in qdisc layer

More minor fixes to merge commit 53fda7f7f9e (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.

Fixing exit cases in qdisc_reset() and qdisc_destroy(), where a
leftover requeued SKB (qdisc->gso_skb) can have the potential of
being a skb list, thus use kfree_skb_list().

This is a followup to commit 10770bc2d1 ("qdisc: adjustments for
API allowing skb list xmits").

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qdisc: adjustments for API allowing skb list xmits

Minor adjustments for merge commit 53fda7f7f9e (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.

Update code doc to function sch_direct_xmit().

In handle_dev_cpu_collision() use kfree_skb_list() in error handling.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: arc: remove unused dev

Remove unused 'dev' variable from arc_emac_remove(), since it's
not being used any more.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: nvidia: Remove extra parens

Remove unnecessary double parenthesis around if statement.

Signed-off-by: David Wood <devel@dtwood.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'netdev_modified'

Nicolas Dichtel says:

====================
rtnl: send notification in do_setlink()

This series ensures to call the notifier chain and to send a netlink
message when a change is done by do_setlink().

The three first patches mainly prepare the last one, which do this change.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

rtnl/do_setlink(): notify when a netdev is modified

Depending on which parameters were updated, the changes were not propagated via
the notifier chain and netlink.

The new flag has been set only when the change did not cause a call to the
notifier chain and/or to the netlink notification functions.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtnl/do_setlink(): last arg is now a set of flags

There is no functional changes with this commit, it only prepares the next one.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtnl/do_setlink(): set modified when IFLA_LINKMODE is updated

The only effect of this patch is to print a warning if IFLA_LINKMODE is updated
and a following change fails.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtnl/do_setlink(): set modified when IFLA_TXQLEN is updated

The only effect of this patch is to print a warning if IFLA_TXQLEN is updated
and a following change fails.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'be2net-next'

Sathya Perla says:

====================
be2net: patch set

v2 changes: add a new line after variable declaration in patch 12.

***
Patch 1 adds a few new log messages to help debugging in failure cases.

Patch 2 uses new macros for parsing RX/TX completions and TX wrbs to
help shorten the lines.

Patch 3 adds a description for the RX counter rx_input_fifo_overflow_drop.

Patch 4 adds TX completion error statistics reporting via ethtool.

Patch 5 adds a dma_mapping_error counter and its reporting via ethtool.

Patch 6 fixes up log messages in the Lancer FW download path.

Patch 7 replaces gotos with direct return statements.

Patch 8 cleans up be_change_mtu() code by using a new macro BE_MAX_MTU

Patch 9 makes be_cmd_get_regs() routine to return an integer status
similar to other FW cmd routines in be_cmds.c

Patch 10 gets rid of TX budget as enforcing a budget on TX completion
processing in NAPI is neither suggested nor it provides a performance benefit.

Patch 11 defines and uses a new macro for_all_tx_queues_on_eq() similar
to the RX processing code.

Patch 12 queries max_tx_qs from the FW for BE3 super-nic profiles.
For those profiles, the driver cannot assume a constant BE3_MAX_TX_QS value,
as the value may change for each function.

Please consider applying this patch set to the net-next tree. Thanks!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: query max_tx_qs for BE3 super-nic profile from FW

In the BE3 super-nic profile, the max_tx_qs value can vary for each function.
So the driver needs to query this value from FW instead of using the
pre-defined constant BE3_MAX_TX_QS.

Signed-off-by: Suresh Reddy <Suresh.Reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: define macro for_all_tx_queues_on_eq()

Replace the for() loop that traverses all the TX queues on an EQ
with the macro for_all_tx_queues_on_eq(). With this expalnatory
name, the one line comment is not required anymore.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: get rid of TX budget

Enforcing a budget on the TX completion processing in NAPI doesn't
benefit performance in anyway. Just get rid of it.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: make be_cmd_get_regs() return a status

There are a few failure cases in be_cmd_get_regs() that ideally must return
an error value. This style is used across all the routines in be_cmds.c with
this routine being an exception. This patch fixes this.

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: define BE_MAX_MTU

This patch defines a new macro BE_MAX_MTU to make the code in be_change_mtu()
more readable.

Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: remove unncessary gotos

In cases where there is no extra code to handle an error, this patch replaces
gotos with a direct return statement.

Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix log messages in lancer FW download path

Log messages in the Lancer FW download path have issues such as:
- a single message spanning multiple lines
- the success message is logged even in failure cases
- status codes are already logged in the FW cmd routines
This patch fixes these issues.

Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Add a dma_mapping_error counter in ethtool

Add a dma_mapping_error counter to count the number of packets dropped
due to DMA mapping errors.

Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: Add TX completion error statistics in ethtool

HW reports TX completion errors in TX completion. This patch adds these
counters to ethtool statistics.

Signed-off-by: Kalesh AP <kalesh.purayil@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: add a description for counter rx_input_fifo_overflow_drop

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: shorten AMAP_GET/SET_BITS() macro calls

The AMAP_GET/SET_BITS() macro calls take structure name as a parameter
and hence are long and span more than one line. Replace these calls
with a wrapper macros for RX/Tx compls and TX wrb. This results in fewer
lines and more readable code in be_main.c

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: add a few log messages

This patch adds the following log messages to help debugging
failure cases:
1) log FW version number: this is useful when driver initialization
fails and the FW version number cannot be queried via ethtool
2) per function resource limits for BEx chips: these values are
currently being printed only for Skyhawk and Lancer
3) PCI BAR mapping failure
4) function_mode/caps queried from FW: this helps catch any FW bugs
that could advertise wrong capabilities to the driver

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock: deduplicate errqueue dequeue

sk->sk_error_queue is dequeued in four locations. All share the
exact same logic. Deduplicate.

Also collapse the two critical sections for dequeue (at the top of
the recv handler) and signal (at the bottom).

This moves signal generation for the next packet forward, which should
be harmless.

It also changes the behavior if the recv handler exits early with an
error. Previously, a signal for follow-up packets on the errqueue
would then not be scheduled. The new behavior, to always signal, is
arguably a bug fix.

For rxrpc, the change causes the same function to be called repeatedly
for each queued packet (because the recv handler == sk_error_report).
It is likely that all packets will fail for the same reason (e.g.,
memory exhaustion).

This code runs without sk_lock held, so it is not safe to trust that
sk->sk_err is immutable inbetween releasing q->lock and the subsequent
test. Introduce int err just to avoid this potential race.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: expand documentation

Expand Documentation/networking/timestamping.txt with new
interfaces and bytestream timestamping. Also minor
cleanup of the other text.

Import txtimestamp.c test of the new features.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'csums-next'

Tom Herbert says:

====================
net: Checksum offload changes - Part VI

I am working on overhauling RX checksum offload. Goals of this effort
are:

- Specify what exactly it means when driver returns CHECKSUM_UNNECESSARY
- Preserve CHECKSUM_COMPLETE through encapsulation layers
- Don't do skb_checksum more than once per packet
- Unify GRO and non-GRO csum verification as much as possible
- Unify the checksum functions (checksum_init)
- Simplify code

What is in this seventh patch set:

- Add skb->csum. This allows a device or GRO to indicate that an
invalid checksum was detected.
- Checksum unncessary to checksum complete conversions.

With these changes, I believe that the third goal of the overhaul is
now mostly achieved. In the case of no encapsulation or one layer of
encapsulation, there should only be at most one skb_checksum over
each packet (between GRO and normal path). In the case of two layers
of encapsulation, it is still possible with the right combination of
non-zero and zero UDP checksums to have >1 skb_checksum. For instance:
IP>GRE(with csum)>IP>UDP(zero csum)>VXLAN>IP>UDP(non-zero csum),
would likely necessiate an skb_checksum in GRO and normal path.
This doesn't seem like a common scenario at all so I'm inclined to
not address this now, if multiple layers of encapsulation becomes
popular we can reassess.

Note that checksum conversion shows a nice improvement for RX VXLAN when
outer UDP checksum is enabled (12.65% CPU compared to 20.94%). This
is not only from the fact that we don't need checksum calculation on
the host, but also allows GRO for VXLAN in this case. Checksum
conversion does not help send side (which still needs to perform
a checksum on host). For that we will implement remote checksum offload
in a later patch
(http://tools.ietf.org/html/draft-herbert-remotecsumoffload-00).

Please review carefully and test if possible, mucking with basic
checksum functions is always a little precarious :-)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: Enable checksum unnecessary conversions for l2tp/UDP sockets

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: Enable checksum unnecessary conversions for vxlan/UDP sockets

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gre: Add support for checksum unnecessary conversions

Call skb_checksum_try_convert and skb_gro_checksum_try_convert
after checksum is found present and validated in the GRE header
for normal and GRO paths respectively.

In GRO path, call skb_gro_checksum_try_convert

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: Add support for doing checksum unnecessary conversion

Add support for doing CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE
conversion in UDP tunneling path.

In the normal UDP path, we call skb_checksum_try_convert after locating
the UDP socket. The check is that checksum conversion is enabled for
the socket (new flag in UDP socket) and that checksum field is
non-zero.

In the UDP GRO path, we call skb_gro_checksum_try_convert after
checksum is validated and checksum field is non-zero. Since this is
already in GRO we assume that checksum conversion is always wanted.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Infrastructure for checksum unnecessary conversions

For normal path, added skb_checksum_try_convert which is called
to attempt to convert CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE. The
primary condition to allow this is that ip_summed is CHECKSUM_NONE
and csum_valid is true, which will be the state after consuming
a CHECKSUM_UNNECESSARY.

For GRO path, added skb_gro_checksum_try_convert which is the GRO
analogue of skb_checksum_try_convert. The primary condition to allow
this is that NAPI_GRO_CB(skb)->csum_cnt == 0 and
NAPI_GRO_CB(skb)->csum_valid is set. This implies that we have consumed
all available CHECKSUM_UNNECESSARY checksums in the GRO path.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Support for csum_bad in skbuff

This flag indicates that an invalid checksum was detected in the
packet. __skb_mark_checksum_bad helper function was added to set this.

Checksums can be marked bad from a driver or the GRO path (the latter
is implemented in this patch). csum_bad is checked in
__skb_checksum_validate_complete (i.e. calling that when ip_summed ==
CHECKSUM_NONE).

csum_bad works in conjunction with ip_summed value. In the case that
ip_summed is CHECKSUM_NONE and csum_bad is set, this implies that the
first (or next) checksum encountered in the packet is bad. When
ip_summed is CHECKSUM_UNNECESSARY, the first checksum after the last
one validated is bad. For example, if ip_summed == CHECKSUM_UNNECESSARY,
csum_level == 1, and csum_bad is set-- then the third checksum in the
packet is bad. In the normal path, the packet will be dropped when
processing the protocol layer of the bad checksum:
__skb_decr_checksum_unnecessary called twice for the good checksums
changing ip_summed to CHECKSUM_NONE so that
__skb_checksum_validate_complete is called to validate the third
checksum and that will fail since csum_bad is set.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8152: rename rx_buf_sz

The variable "rx_buf_sz" is used by both tx and rx buffers. Replace
it with "agg_buf_sz".

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mdio-bcm-unimac: NULL-terminate unimac_mdio_ids

drivers/net/phy/mdio-bcm-unimac.c:195:37-38: unimac_mdio_ids is not NULL
terminated at line 195

Make sure of_device_id tables are NULL terminated
Generated by: scripts/coccinelle/misc/of_table.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: make dsa_pack_type static

net/dsa/dsa.c:624:20: sparse: symbol 'dsa_pack_type' was not declared.
Should it be static?

Fixes: 3e8a72d1dae374 ("net: dsa: reduce number of protocol hooks")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: add slave_changelink support and use it for queue_id

This patch adds support for slave_changelink to the bonding and uses it
to give the ability to change the queue_id of the enslaved devices via
netlink. It sets slave_maxtype and uses bond_changelink as a prototype for
bond_slave_changelink.
Example/test command after the iproute2 patch:
ip link set eth0 type bond_slave queue_id 10

CC: David S. Miller <davem@davemloft.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: whitespace fixes

Fix places where there is space before tab, long lines, and
awkward if(){, double spacing etc. Add blank line after declaration/initialization.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: systemport: tell RXCHK if we are using Broadcom tags

When Broadcom tags are enabled, e.g: when interfaced to an Ethernet
switch, make sure that we tell the RXCHK engine that it should be
expecting a 4-bytes Broadcom tag after the Ethernet MAC Source Address.

Use netdev_uses_dsa() to check for that condition since that will tell
us if a switch is attached to our network interface.

Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

pktgen: add flag NO_TIMESTAMP to disable timestamping

Then testing the TX limits of the stack, then it is useful to
be-able to disable the do_gettimeofday() timetamping on every packet.

This implements a pktgen flag NO_TIMESTAMP which will disable this
call to do_gettimeofday().

The performance change on (my system E5-2695) with skb_clone=0, goes
from TX 2,423,751 pps to 2,567,165 pps with flag NO_TIMESTAMP. Thus,
the cost of do_gettimeofday() or saving is approx 23 nanosec.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: fix tunneled GSO over IPv6

Set correct bit for packed description.

Introduced in e42780b66aab88d3a82b6087bcd6095b90eecde7
bnx2x: Utilize FW 7.10.51

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dmitry Kravkov <Dmitry.Kravkov@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: prevent incorrect byte-swap in BE

Fixes incorrectly defined struct in FW HSI for BE platform.
Affects tunneling, tx-switching and anti-spoofing.

Introduced in e42780b66aab88d3a82b6087bcd6095b90eecde7
bnx2x: Utilize FW 7.10.51

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dmitry Kravkov <Dmitry.Kravkov@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: add name distributor resiliency queue

TIPC name table updates are distributed asynchronously in a cluster,
entailing a risk of certain race conditions. E.g., if two nodes
simultaneously issue conflicting (overlapping) publications, this may
not be detected until both publications have reached a third node, in
which case one of the publications will be silently dropped on that
node. Hence, we end up with an inconsistent name table.

In most cases this conflict is just a temporary race, e.g., one
node is issuing a publication under the assumption that a previous,
conflicting, publication has already been withdrawn by the other node.
However, because of the (rtt related) distributed update delay, this
may not yet hold true on all nodes. The symptom of this failure is a
syslog message: "tipc: Cannot publish {%u,%u,%u}, overlap error".

In this commit we add a resiliency queue at the receiving end of
the name table distributor. When insertion of an arriving publication
fails, we retain it in this queue for a short amount of time, assuming
that another update will arrive very soon and clear the conflict. If so
happens, we insert the publication, otherwise we drop it.

The (configurable) retention value defaults to 2000 ms. Knowing from
experience that the situation described above is extremely rare, there
is no risk that the queue will accumulate any large number of items.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: refactor name table updates out of named packet receive routine

We need to perform the same actions when processing deferred name
table updates, so this functionality is moved to a separate
function.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8152: reduce the number of Tx

Because the Tx has the features of stopping queue and aggregation,
We don't need many tx buffers. Change the tx number from 10 to 4
to reduce the usage of the memory. This could save 16K * 6 bytes
memory.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'xmit_list'

David Miller says:

====================
net: Make dev_hard_start_xmit() work fundamentally on lists

After this patch set, dev_hard_start_xmit() will work fundemantally on
any and all SKB lists.

This opens the path for a clean implementation of pulling multiple
packets out during qdisc_restart(), and then passing that blob in one
shot to dev_hard_start_xmit().

There were two main architectural blockers to this:

1) The GSO handling, we kept the original GSO head SKB around simply
   because dev_hard_start_xmit() had no way to communicate to the
   caller how far into the segmented list it was able to go.  Now it
   can, so the head GSO can be liberated immediately.

   All of the special GSO head SKB destructor et al. handling goes
   away too.

2) Validate of VLAN, CSUM, and segmentation characteristics was being
   performed inside of dev_hard_start_xmit().  If want to truly batch,
   we have to let the higher levels to this.  In particular, this is
   now dequeue_skb()'s job.

And with those two issues out of the way, it should now be trivial to
build experiments on top of this patch set, all of the framework
should be there now.  You could do something as simple as:

skb = q->dequeue(q);
if (skb)
skb = validate_xmit_skb(skb, qdisc_dev(q));
if (skb) {
struct sk_buff *new, *head = skb;
int limit = 5;

do {
new = q->dequeue(q);
if (new)
new = validate_xmit_skb(new, qdisc_dev(q));
if (new) {
skb->next = new;
skb = new;
}
} while (new && --limit);
skb = head;
}

inside of the else branch of dequeue_skb().

Signed-off-by: David S. Miller <davem@davemloft.net>

net: xmit_list() becomes dev_hard_start_xmit().

Now fundamentally we can process lists of SKBs as cheaply
as single packets.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Don't keep around original SKB when we software segment GSO frames.

Just maintain the list properly by returning the head of the remaining
SKB list from dev_hard_start_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Validate xmit SKBs right when we pull them out of the qdisc.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Separate out SKB validation logic from transmit path.

dev_hard_start_xmit() does two things, it first validates and
canonicalizes the SKB, then it actually sends it.

Make a set of helper functions for doing the first part.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Have xmit_list() signal more==true when appropriate.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Pass a "more" indication down into netdev_start_xmit() code paths.

For now it will always be false.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Move main gso loop out of dev_hard_start_xmit() into helper.

There is a slight policy change happening here as well.

The previous code would drop the entire rest of the GSO skb if any of
them got, for example, a congestion notification.

That makes no sense, anything NET_XMIT_MASK and below is something
like congestion or policing. And in the congestion case it doesn't
even mean the packet was actually dropped.

Just continue until dev_xmit_complete() evaluates to false.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Create xmit_one() helper for dev_hard_start_xmit()

Hopefully making the code a bit easier to read and digest.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Do txq_trans_update() in netdev_start_xmit()

That way we don't have to audit every call site to make sure it is
doing this properly.

Signed-off-by: David S. Miller <davem@davemloft.net>