git.openwrt.org Git - openwrt/staging/blogic.git/log

pkt_sched: Fix warning false positives.

GCC refuses to recognize that all error control flows do in fact
set err to something.

Add an explicit initialization to shut it up.

net/sched/sch_drr.c: In function ‘drr_enqueue’:
net/sched/sch_drr.c:359:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
net/sched/sch_qfq.c: In function ‘qfq_enqueue’:
net/sched/sch_qfq.c:885:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]

Signed-off-by: David S. Miller <davem@davemloft.net>

bna: Fix warning false positive.

GCC can't see that in all non-error-return paths we do in fact
set *using_dac to something.

Add an explicit initialization to remove this warning:

drivers/net/ethernet/brocade/bna/bnad.c: In function ‘bnad_pci_probe’:
drivers/net/ethernet/brocade/bna/bnad.c:3079:5: warning: ‘using_dac’ may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/net/ethernet/brocade/bna/bnad.c:3233:7: note: ‘using_dac’ was declared here

Signed-off-by: David S. Miller <davem@davemloft.net>

nf_defrag_ipv6: fix oops on module unloading

fix copy-paste error introduced in linux-next commit
"ipv6: add a new namespace for nf_conntrack_reasm"

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix vfs enumeration

Current VFs enumeration algorithm used in be_find_vfs does not take domain
number into the match. The match found in igb/ixgbe is more elegant and
safe.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tunnel: drop packet if ECN present with not-ECT

Linux tunnels were written before RFC6040 and therefore never
implemented the corner case of ECN getting set in the outer header
and the inner header not being ready for it.

Section 4.2.  Default Tunnel Egress Behaviour.
o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
      propagate any other ECN codepoint onwards.  This is because the
      inner Not-ECT marking is set by transports that rely on dropped
      packets as an indication of congestion and would not understand or
      respond to any other ECN codepoint [RFC4774].  Specifically:

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         CE, the decapsulator MUST drop the packet.

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
         outgoing packet with the ECN field cleared to Not-ECT.

This patch moves the ECN decap logic out of the individual tunnels
into a common place.

It also adds logging to allow detecting broken systems that
set ECN bits incorrectly when tunneling (or an intermediate
router might be changing the header).

Overloads rx_frame_error to keep track of ECN related error.

Thanks to Chris Wright who caught this while reviewing the new VXLAN
tunnel.

This code was tested by injecting faulty logic in other end GRE
to send incorrectly encapsulated packets.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

xfrm: remove extranous rcu_read_lock

The handlers for xfrm_tunnel are always invoked with rcu read lock
already.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gre: remove unnecessary rcu_read_lock/unlock

The gre function pointers for receive and error handling are
always called (from gre.c) with rcu_read_lock already held.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gre: fix handling of key 0

GRE driver incorrectly uses zero as a flag value. Zero is a perfectly
valid value for key, and the tunnel should match packets with no key only
with tunnels created without key, and vice versa.

This is a slightly visible change since previously it might be possible to
construct a working tunnel that sent key 0 and received only because
of the key wildcard of zero. I.e the sender sent key of zero, but tunnel
was defined without key.

Note: using gre key 0 requires iproute2 utilities v3.2 or later.
The original utility code was broken as well.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sparc: bpf_jit_comp: add XOR instruction for BPF JIT JIT

This patch is a follow-up for patch "filter: add XOR instruction for use
with X/K" that implements BPF SPARC JIT parts for the BPF XOR operation.

Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

lxt PHY: Support for the buggy LXT973 rev A2

This patch adds proper handling of the buggy revision A2 of LXT973 phy, adding
precautions linked to ERRATA Item 4:

Revision A2 of LXT973 chip randomly returns the contents of the previous even
register when you read a odd register regularly

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Don't attempt to upgrade T4 firmware when cxgb4 will end up as a slave

This patch adds a new common code routine to upgrade an adapter's
firmware. This routine handles all of the complexities of working with the
the existing adapter firmware in order to quiesce the adapter and uP, etc.
For an automatic upgrade it will send a HELLO command to check if cxgb4
want/can upgrade firmware, i.e. if cxgb4 is MASTER and has newer firmware
that it wants to load and call the new common code routine t4_fw_upgrade.
Note that it should not issue a RESET command after a successful firmware
upgrade.

Signed-off-by: Jay Hernandez <jay@chelsio.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Inform caller if driver didn't upgrade firmware

If a card had already been initialized, on reloading cxgb4 driver firmware
required an upgrade but the upgrade did not happen. In that case a mailbox
timeout would occur during T4 configuration file stuff. The fix is to let the
caller know the firmware was not upgraded so a reset would be issued before
starting the T4 config stuff.

Signed-off-by: Jay Hernandez <jay@chelsio.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Add support for T4 hardwired driver configuration settings

In case if user defined configuration file at /lib/firmware/cxgb4/t4-config.txt
location and also factory default configuration file written to FLASH are not
present then driver will use hardwired configuration settings.

Signed-off-by: Jay Hernandez <jay@chelsio.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Add support for T4 configuration file

Starting with T4 firmware version 1.3.11.0 the firmware now supports device
configuration via a Firmware Configuration File. The Firmware Configuration
File was primarily developed in order to centralize all of the configuration,
resource allocation, etc. for Unified Wire operation where multiple
Physical / Virtual Function Drivers would be using a T4 adapter simultaneously.

The Firmware Configuration file can live in three locations as shown below
in order of precedence.
1) User defined configuration file: /lib/firmware/cxgb4/t4-config.txt
2) Factory Default configuration file written to FLASH within
the manufacturing process.
3) Hardwired driver configuration.

Signed-off-by: Jay Hernandez <jay@chelsio.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4/cxgb4vf: Code cleanup to enable T4 Configuration File support

This patch adds new enums and macros to enable T4 configuration file support. It
also removes duplicate macro definitions.

It fixes the build failure in cxgb4vf driver introduced because of old macro
definition removal.

It also performs SGE initialization based on T4 configuration file is provided
or not. If it is provided then it uses the parameters provided in it otherwise
it uses hard coded values.

Signed-off-by: Jay Hernandez <jay@chelsio.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Add functions to read memory via PCIE memory window

This patch implements two new functions t4_mem_win_read and t4_memory_read.
These new functions can be used to read memory via the PCIE memory window.
Please note, for proper execution of these functions PCIE_MEM_ACCESS_BASE_WIN
registers must be setup correctly like how setup_memwin in the cxgb4 driver
does it.

Signed-off-by: Jay Hernandez <jay@chelsio.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Fix incorrect values for MEMWIN*_APERTURE and MEMWIN*_BASE

Signed-off-by: Jay Hernandez <jay@chelsio.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipconfig: fix trivial build error

The commit 5e953778a2aab04929a5e7b69f53dc26e39b079e ("ipconfig: add nameserver
IPs to kernel-parameter ip=") introduces ic_nameservers_predef() that defined
only for BOOTP. However it is used by ip_auto_config_setup() as well. This
patch moves it outside of #ifdef BOOTP.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Christoph Fritz <chf.fritz@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: raw: revert unrelated change

Commit 5640f7685831 ("net: use a per task frag allocator")
accidentally contained an unrelated change to net/ipv4/raw.c,
later committed (without the pr_err() debugging bits) in
net tree as commit ab43ed8b749 (ipv4: raw: fix icmp_filter())

This patch reverts this glitch, noticed by Stephen Rothwell.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

x86: bpf_jit_comp: add XOR instruction for BPF JIT

This patch is a follow-up for patch "filter: add XOR instruction for use
with X/K" that implements BPF x86 JIT parts for the BPF XOR operation.

Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

filter: add XOR instruction for use with X/K

SKF_AD_ALU_XOR_X has been added a while ago, but as an 'ancillary'
operation that is invoked through a negative offset in K within BPF
load operations. Since BPF_MOD has recently been added, BPF_XOR should
also be part of the common ALU operations. Removing SKF_AD_ALU_XOR_X
might not be an option since this is exposed to user space.

Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mipsnet: Remove the MIPSsim Ethernet driver.

The MIPSsim platform is no longer supported or used. This patch
deletes the Ethernet driver.

Signed-off-by: Steven J. Hill <sjhill@mips.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: use a per task frag allocator

We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gianfar: Change default HW Tx queue scheduling mode

This is primarily to address transmission timeout occurrences, when
multiple H/W Tx queues are being used concurrently. Because in
the priority scheduling mode the controller does not service the
Tx queues equally (but in ascending index order), Tx timeouts are
being triggered rightaway for a basic test with multiple simultaneous
connections like:
iperf -c <server_ip> -n 100M -P 8

resulting in kernel trace:
NETDEV WATCHDOG: eth1 (fsl-gianfar): transmit queue <X> timed out
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:255
...
and controller reset during intense traffic, and possibly further
complications.

This patch changes the default H/W Tx scheduling setting (TXSCHED)
for multi-queue devices, from priority scheduling mode to a weighted
round robin mode with equal weights for all H/W Tx queues, and
addresses the issue above.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: loopback: set default mtu to 64K

loopback current mtu of 16436 bytes allows no more than 3 MSS TCP
segments per frame, or 48 Kbytes. Changing mtu to 64K allows TCP
stack to build large frames and significantly reduces stack overhead.

Performance boost on bulk TCP transferts can be up to 30 %, partly
because we now have one ACK message for two 64KB segments, and a lower
probability of hitting /proc/sys/net/ipv4/tcp_reordering default limit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Remove unnecessary NULL check in scm_destroy().

All callers provide a non-NULL scm argument.

Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Improve code around bnx2x_tests_str_arr

This patch changes the definition of bnx2x_tests_str_arr from static char
pointer to static const char bi-directional array. Also the
bnx2x_get_strings function is simplified.

Reported-by: Joe Perches <joe@perches.com>
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Merav Sicron <meravs@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ctcm: fix error return code

Convert a nonnegative error return code to a negative one, as returned
elsewhere in the function.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
(
if@p1 (\(ret < 0\|ret != 0\))
{ ... return ret; }
|
ret@p1 = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
return ret;
}
// </smpl>

Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers/s390/net: removes unnecessary semicolon

removes unnecessary semicolon

Found by Coccinelle: http://coccinelle.lip6.fr/

Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qeth: fix possible memory leak in qeth_l3_add_[vipa|rxip]()

ipaddr has been allocated in function qeth_l3_add_vipa() but
does not free before leaving from the error handling cases. The
same problem also exists in function qeth_l3_add_rxip().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

lcs: ensure proper ccw setup

Make sure that all ccws used for writing are initialized with
zeros - especially since the last ccw contains a TIC for which
the unused fields have to be zeros.

Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qeth: cleanup channel path descriptor function

Cleanup the qeth_get_channel_path_desc function and rename it
to qeth_update_from_chp_desc. No functional change.

Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Acked-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://1984.lsi.us.es/nf-next

Pablo Neira Ayuso says:

====================
This patchset contains updates for your net-next tree, they are:

* Mostly fixes for the recently pushed IPv6 NAT support:

- Fix crash while removing nf_nat modules from Patrick McHardy.
- Fix unbalanced rcu_read_unlock from Ulrich Weber.
- Merge NETMAP and REDIRECT into one single xt_target module, from
  Jan Engelhardt.
- Fix Kconfig for IPv6 NAT, which allows inconsistent configurations,
  from myself.

* Updates for ipset, all of the from Jozsef Kadlecsik:

- Add the new "nomatch" option to obtain reverse set matching.
- Support for /0 CIDR in hash:net,iface set type.
- One non-critical fix for a rare crash due to pass really
  wrong configuration parameters.
- Coding style cleanups.
- Sparse fixes.
- Add set revision supported via modinfo.i

* One extension for the xt_time match, to support matching during
  the transition between two days with one single rule, from
  Florian Westphal.

* Fix maximum packet length supported by nfnetlink_queue and add
  NFQA_CAP_LEN attribute, from myself.

You can notice that this batch contains a couple of fixes that may
go to 3.6-rc but I don't consider them critical to push them:

* The ipset fix for the /0 cidr case, which is triggered with one
  inconsistent command line invocation of ipset.

* The nfnetlink_queue maximum packet length supported since it requires
  the new NFQA_CAP_LEN attribute to provide a full workaround for the
  described problem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: nfnetlink_queue: add NFQA_CAP_LEN attribute

This patch adds the NFQA_CAP_LEN attribute that allows us to know
what is the real packet size from user-space (even if we decided
to retrieve just a few bytes from the packet instead of all of it).

Security software that inspects packets should always check for
this new attribute to make sure that it is inspecting the entire
packet.

This also helps to provide a workaround for the problem described
in: http://marc.info/?l=netfilter-devel&m=134519473212536&w=2

Original idea from Florian Westphal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_queue: fix maximum packet length to userspace

The packets that we send via NFQUEUE are encapsulated in the NFQA_PAYLOAD
attribute. The length of the packet in userspace is obtained via
attr->nla_len field. This field contains the size of the Netlink
attribute header plus the packet length.

If the maximum packet length is specified, ie. 65535 bytes, and
packets in the range of (65531,65535] are sent to userspace, the
attr->nla_len overflows and it reports bogus lengths to the
application.

To fix this, this patch limits the maximum packet length to 65531
bytes. If larger packet length is specified, the packet that we
send to user-space is truncated to 65531 bytes.

To support 65535 bytes packets, we have to revisit the idea of
the 32-bits Netlink attribute length.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_ct_ftp: add sequence tracking pickup facility for injected entries

This patch allows the FTP helper to pickup the sequence tracking from
the first packet seen. This is useful to fix the breakage of the first
FTP command after the failover while using conntrackd to synchronize
states.

The seq_aft_nl_num field in struct nf_ct_ftp_info has been shrinked to
16-bits (enough for what it does), so we can use the remaining 16-bits
to store the flags while using the same size for the private FTP helper
data.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: xt_time: add support to ignore day transition

Currently, if you want to do something like:
"match Monday, starting 23:00, for two hours"
You need two rules, one for Mon 23:00 to 0:00 and one for Tue 0:00-1:00.

The rule: --weekdays Mo --timestart 23:00 --timestop 01:00

looks correct, but it will first match on monday from midnight to 1 a.m.
and then again for another hour from 23:00 onwards.

This permits userspace to explicitly ignore the day transition and
match for a single, continuous time period instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ixgbevf: Return error on failure to enable VLAN

With recent kernel changes we can now return errors on a failure to setup a
VLAN filter. This patch takes advantage of that opportunity so that we can
return either an EIO error in the case of a mailbox failure, or an EACCESS
error in the case of being denied access to the VLAN filter table by the
PF.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Robert Garrett <robertx.e.garrett@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbevf: Add fix to VF to handle multi-descriptor buffers

This change fixes the ixgbevf driver so that it can correctly drop a frame
should it receive a jumbo frame.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbevf: Fix AIM (Adaptive Interrupt Moderation)

While fixing up a patch from Alex Duyck to use q_vectors in ring containers
to update the ITR I bungled it and missed actually updating the counters
in the ring container q_vectors. This patch fixes my mistake and makes
interrupt moderation actually work.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbevf - Remove unused parameter in ixgbevf_receive_skb

Remove 'rx_ring' parameter as it is not used in ixgbevf_receive_skb

Signed-off-by: Narendra K <narendra_k@dell.com>
Acked-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Do not read the spoofed packets counter when not in IOV mode

The counter is not valid unless the controller is running in IOV mode.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbevf: Fix code for handling timeout

The VF driver was not designed to correctly handle a message timeout. As
a result it is possible for one bad message to invalidate all messages
following it until the part is reset. Instead we should copy the example
in igbvf of how to handle a mailbox event and message timeout.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

netlink: Rearrange netlink_kernel_cfg to save space on 64-bit.

Suggested by Jan Engelhardt.

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: TCP Fast Open Server - record retransmits after 3WHS

When recording the number of SYNACK retransmits for servers using TCP
Fast Open, fix the code to ensure that we copy over the retransmit
count from the request_sock after we receive the ACK that completes
the 3-way handshake.

The story here is similar to that of SYNACK RTT
measurements. Previously we were always doing this in
tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
receive the SYN. So for TFO we must copy the final SYNACK retransmit
count in tcp_rcv_state_process().

Note that copying over the SYNACK retransmit count will give us the
correct count since, as is mentioned in a comment in
tcp_retransmit_timer(), before we receive an ACK for our SYN-ACK a TFO
passive connection does not retransmit anything else (e.g., data or
FIN segments).

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Make ZNET driver config depend on X86.

We're now using isa_virt_to_bus(), and there really
isn't a generic and consistent test for whether a
platform provides this interface or not.

This driver is also for an x86-only device.

Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: ipset: Support to match elements marked with "nomatch"

Exceptions can now be matched and we can branch according to the
possible cases:

a. match in the set if the element is not flagged as "nomatch"
b. match in the set if the element is flagged with "nomatch"
c. no match

i.e.

iptables ... -m set --match-set ... -j ...
iptables ... -m set --match-set ... --nomatch-entries -j ...
...

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Coding style fixes

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Include supported revisions in module description

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Add /0 network support to hash:net,iface type

Now it is possible to setup a single hash:net,iface type of set and
a single ip6?tables match which covers all egress/ingress filtering.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Rewrite cidr book keeping to handle /0

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

tcp: TCP Fast Open Server - call tcp_validate_incoming() for all packets

A TCP Fast Open (TFO) passive connection must call both
tcp_check_req() and tcp_validate_incoming() for all incoming ACKs that
are attempting to complete the 3WHS.

This is needed to parallel all the action that happens for a non-TFO
connection, where for an ACK that is attempting to complete the 3WHS
we call both tcp_check_req() and tcp_validate_incoming().

For example, upon receiving the ACK that completes the 3WHS, we need
to call tcp_fast_parse_options() and update ts_recent based on the
incoming timestamp value in the ACK.

One symptom of the problem with the previous code was that for passive
TFO connections using TCP timestamps, the outgoing TS ecr values
ignored the incoming TS val value on the ACK that completed the 3WHS.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: TCP Fast Open Server - note timestamps and retransmits for SYNACK RTT

Previously, when using TCP Fast Open a server would return from
tcp_check_req() before updating snt_synack based on TCP timestamp echo
replies and whether or not we've retransmitted the SYNACK. The result
was that (a) for TFO connections using timestamps we used an incorrect
baseline SYNACK send time (tcp_time_stamp of SYNACK send instead of
rcv_tsecr), and (b) for TFO connections that do not have TCP
timestamps but retransmit the SYNACK we took a SYNACK RTT sample when
we should not take a sample.

This fix merely moves the snt_synack update logic a bit earlier in the
function, so that connections using TCP Fast Open will properly do
these updates when the ACK for the SYNACK arrives.

Moving this snt_synack update logic means that with TCP_DEFER_ACCEPT
enabled we do a few instructions of wasted work on each bare ACK, but
that seems OK.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS

When taking SYNACK RTT samples for servers using TCP Fast Open, fix
the code to ensure that we only call tcp_valid_rtt_meas() after we
receive the ACK that completes the 3-way handshake.

Previously we were always taking an RTT sample in
tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
to take the RTT sample.

To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
already ensures that we only take a SYNACK RTT sample if snt_synack is
non-zero. To be careful, we only take a snt_synack timestamp when
a SYNACK transmit or retransmit succeeds.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: extract code to compute SYNACK RTT

In preparation for adding another spot where we compute the SYNACK
RTT, extract this code so that it can be shared.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: clarify the clock_name sysfs attribute

There has been some confusion among PHC driver authors about the
intended purpose of the clock_name attribute. This patch expands the
documation in order to clarify how the clock_name field should be
understood.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: link the phc device to its parent device

PTP Hardware Clock devices appear as class devices in sysfs. This patch
changes the registration API to use the parent device, clarifying the
clock's relationship to the underlying device.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: provide the clock's adjusted frequency

If the timex.mode field indicates a query, then we provide the value of
the current frequency adjustment.

[ Get rid of extraneous empty lines -DaveM ]

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ptp: remember the adjusted frequency

This patch adds a field to the representation of a PTP hardware clock in
order to remember the frequency adjustment value dialed by the user.

Adding this field will let us answer queries in the manner of adjtimex
in a follow on patch.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next

Jeff Kirsher says:

====================
This series contains updates to igb only.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

can: sja1000: Add support for listen-only mode and one-shot mode

One-shot mode uses the TCS bit of the status register to discern
whether a transmission was successful or not. On a failed
transmission, the frame is not echoed back.

Signed-off-by: Andreas Larsson <andreas@gaisler.com>
Acked-by: Wolfgang Grandegger <wg@grandegger.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

igb: Use dma_unmap_addr and dma_unmap_len defines

This change is meant to improve performance on systems that do not require
the DMA unmap calls. On those systems we do not need to make use of the
unmap address for Tx or the unmap length so we can drop both thereby
reducing the size of the Tx buffer info structure.

In addition I have changed the logic to check for unmap length instead of
unmap address when checking to see if a buffer needs to be unmapped from
DMA use. The reasons for this change is that on some platforms it is
possible to receive a valid DMA address of 0 from an IOMMU.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Simplify how we populate the RSS key

Instead of storing the RSS key as a character array we can simplify the
configuration by making it a u32 array. This allows us to just write one
value per register without any unnecessary operations to construct the
value.

This change will produce the same exact key, the only difference is that I
translated the u8 array to a u32 array which will be correctly ordered on
writes to hardware by the cpu_to_le32 operations that are built into the
writel calls.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Change how we populate the RSS indirection table

This patch cleans up our RSS indirection table configuration so that we
generate the same table regardless of CPU endianness. In addition it
changes the table setup so that instead of doing a modulo based setup it is
instead a divisor based setup. The advantage to this is that we should be
able to take the Rx hash and compute the Rx queue with very little CPU
overhead if needed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Change Tx cleanup loop to do/while instead of for

This change makes it so that Tx cleanup is done in a do/while loop instead
of a for loop. The main motivation behind this is the fact that we should
never be invoked with a budget less than 1 so we can skip checking the
budget before processing the first descriptor.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Remove logic that was doing NUMA pseudo-aware allocations

This change removes the code that was doing the NUMA allocations for the
q_vectors, rings, and ring resources. The problem is the logic used assumed
that the NUMA nodes were always interleved and that is not always the case.

At some point I hope to add this functionality back in a more controlled
manner in the future.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Fix stats output on i210/i211 parts.

Due to a hardware issue, on i210 and i211 parts, the TNCRS statistic
provides an invalid value. This patch changes the update stats function
to increment the stat only for non-i210/i211 parts.

Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Change how we check for pre-existing and assigned VFs

Adapt the pre-existing and assigned VFs code to the ixgbe way introduced
in commit 9297127b9cdd8d30c829ef5fd28b7cc0323a7bcd.

Instead of searching the enabled VFs we use pci_num_vf to determine enabled VFs.
By comparing to which PF an assigned VF is owned it's possible to decide
whether to leave it enabled or not.

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Acked-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Robert Garrett <robertx.e.garrett@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

can: mscan-mpc5xxx: fix return value check in mpc512x_can_get_clock()

In case of error, the function clk_get() returns ERR_PTR()
and never returns NULL pointer. The NULL test in the error
handling should be replaced with IS_ERR().

dpatch engine is used to auto generated this patch.
(https://github.com/weiyj/dpatch)

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Wolfgang Grandegger <wg@grandegger.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: usb: peak: rename peak_usb dump_mem function

Rename generic-sounding function dump_mem() to pcan_dump_mem()
so that it does not conflict with the dump_mem() function in
arch/sh/include/asm/kdebug.h.

drivers/net/can/usb/peak_usb/pcan_usb_core.c: error: conflicting types for 'dump_mem': => 56:6
drivers/net/can/usb/peak_usb/pcan_usb_core.h: error: conflicting types for 'dump_mem': => 134:6

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Stephane Grosjean <s.grosjean@peak-system.com>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
[mkl: convert all users of dump_mem(), too]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: c_can: Adopt pinctrl support

Adopt pinctrl support to c_can driver based on c_can device
pointer, pinctrl driver configure SoC pins to d_can mode
according to definitions provided in .dts file.

In device specific device tree file 'pinctrl-names = "default";'
and 'pinctrl-0 = <&d_can1_pins>;' needs to add to configure pins
from c_can driver. d_can1_pins node contains the pinmux/config
details of d_can L/H pins.

Signed-off-by: AnilKumar Ch <anilkumar@ti.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: c_can: Add d_can suspend resume support

Adds suspend resume support to DCAN driver which enables
DCAN power down mode bit (PDR). Then DCAN will ack the local
power-down mode by setting PDA bit in STATUS register.

Signed-off-by: AnilKumar Ch <anilkumar@ti.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: c_can: Add runtime PM support to Bosch C_CAN/D_CAN controller

Add Runtime PM support to C_CAN/D_CAN controller. The runtime PM
APIs control clocks for C_CAN/D_CAN IP and prevent access to the
register of C_CAN/D_CAN IP when clock is turned off.

Signed-off-by: AnilKumar Ch <anilkumar@ti.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: c_can: Add device tree support to Bosch C_CAN/D_CAN controller

Add device tree support to C_CAN/D_CAN controller and usage details
are added to device tree documentation. Driver was tested on AM335x
EVM.

Signed-off-by: AnilKumar Ch <anilkumar@ti.com>
For the of binding doc:
Reviewed-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: c_can: Modify c_can device names

Modify c_can device names from *_CAN_DEVTYPE to BOSCH_*_CAN to make
use of same names for array indexes in c_can_id_table[] as well as
device names.

This patch also add indexes to c_can_id_table array.

Signed-off-by: AnilKumar Ch <anilkumar@ti.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

netfilter: ipset: Check and reject crazy /0 input parameters

bitmap:ip and bitmap:ip,mac type did not reject such a crazy range
when created and using such a set results in a kernel crash.
The hash types just silently ignored such parameters.

Reject invalid /0 input parameters explicitely.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netfilter: ipset: Fix sparse warnings "incorrect type in assignment"

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

netlink: use <linux/export.h> instead of <linux/module.h>

Since (9f00d97 netlink: hide struct module parameter in netlink_kernel_create),
linux/netlink.h includes linux/module.h because of the use of THIS_MODULE.

Use linux/export.h instead, as suggested by Stephen Rothwell, which is
significantly smaller and defines THIS_MODULES.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

sunbmac: Remove unused local variable.

Commit eb716c54b1c71ad28ab20461bff831bd481066c4 ("sunbmac: remove
unnecessary setting of skb->dev") caused the local varible 'dev'
in bigmac_init_rings to become unused. And now the compiler
warns about it.

Signed-off-by: David S. Miller <davem@davemloft.net>

team: send port changed when added

On some hw, link is not up during adding iface to team. That causes event
not being sent to userspace and that may cause confusion.
Fix this bug by sending port changed event once it's added to team.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipconfig: add nameserver IPs to kernel-parameter ip=

On small systems (e.g. embedded ones) IP addresses are often configured
by bootloaders and get assigned to kernel via parameter "ip=". If set to
"ip=dhcp", even nameserver entries from DHCP daemons are handled. These
entries exported in /proc/net/pnp are commonly linked by /etc/resolv.conf.

To configure nameservers for networks without DHCP, this patch adds option
<dns0-ip> and <dns1-ip> to kernel-parameter 'ip='.

Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com>
Tested-by: Jan Weitzel <j.weitzel@phytec.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qmi_wwan: adding Huawei E367, ZTE MF683 and Pantech P4200

One of the modes of Huawei E367 has this QMI/wwan interface:

I:* If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=01 Prot=07 Driver=(none)
E:  Ad=83(I) Atr=03(Int.) MxPS=  64 Ivl=2ms
E:  Ad=84(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=4ms

Huawei use subclass and protocol to identify vendor specific
functions, so adding a new vendor rule for this combination.

The Pantech devices UML290 (106c:3718) and P4200 (106c:3721) use
the same subclass to identify the QMI/wwan function.  Replace the
existing device specific UML290 entries with generic vendor matching,
adding support for the Pantech P4200.

The ZTE MF683 has 6 vendor specific interfaces, all using
ff/ff/ff for cls/sub/prot.  Adding a match on interface #5 which
is a QMI/wwan interface.

Cc: Fangxiaozhi (Franko) <fangxiaozhi@huawei.com>
Cc: Thomas Schäfer <tschaefer@t-online.de>
Cc: Dan Williams <dcbw@redhat.com>
Cc: Shawn J. Goff <shawn7400@gmail.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: fix compile error when CONFIG_IPV6=m and CONFIG_L2TP=y

When CONFIG_IPV6=m and CONFIG_L2TP=y, I got the following compile error:

LD init/built-in.o
net/built-in.o: In function `l2tp_xmit_core':
l2tp_core.c:(.text+0x147781): undefined reference to `inet6_csk_xmit'
net/built-in.o: In function `l2tp_tunnel_create':
(.text+0x149067): undefined reference to `udpv6_encap_enable'
net/built-in.o: In function `l2tp_ip6_recvmsg':
l2tp_ip6.c:(.text+0x14e991): undefined reference to `ipv6_recv_error'
net/built-in.o: In function `l2tp_ip6_sendmsg':
l2tp_ip6.c:(.text+0x14ec64): undefined reference to `fl6_sock_lookup'
l2tp_ip6.c:(.text+0x14ed6b): undefined reference to `datagram_send_ctl'
l2tp_ip6.c:(.text+0x14eda0): undefined reference to `fl6_sock_lookup'
l2tp_ip6.c:(.text+0x14ede5): undefined reference to `fl6_merge_options'
l2tp_ip6.c:(.text+0x14edf4): undefined reference to `ipv6_fixup_options'
l2tp_ip6.c:(.text+0x14ee5d): undefined reference to `fl6_update_dst'
l2tp_ip6.c:(.text+0x14eea3): undefined reference to `ip6_dst_lookup_flow'
l2tp_ip6.c:(.text+0x14eee7): undefined reference to `ip6_dst_hoplimit'
l2tp_ip6.c:(.text+0x14ef8b): undefined reference to `ip6_append_data'
l2tp_ip6.c:(.text+0x14ef9d): undefined reference to `ip6_flush_pending_frames'
l2tp_ip6.c:(.text+0x14efe2): undefined reference to `ip6_push_pending_frames'
net/built-in.o: In function `l2tp_ip6_destroy_sock':
l2tp_ip6.c:(.text+0x14f090): undefined reference to `ip6_flush_pending_frames'
l2tp_ip6.c:(.text+0x14f0a0): undefined reference to `inet6_destroy_sock'
net/built-in.o: In function `l2tp_ip6_connect':
l2tp_ip6.c:(.text+0x14f14d): undefined reference to `ip6_datagram_connect'
net/built-in.o: In function `l2tp_ip6_bind':
l2tp_ip6.c:(.text+0x14f4fe): undefined reference to `ipv6_chk_addr'
net/built-in.o: In function `l2tp_ip6_init':
l2tp_ip6.c:(.init.text+0x73fa): undefined reference to `inet6_add_protocol'
l2tp_ip6.c:(.init.text+0x740c): undefined reference to `inet6_register_protosw'
net/built-in.o: In function `l2tp_ip6_exit':
l2tp_ip6.c:(.exit.text+0x1954): undefined reference to `inet6_unregister_protosw'
l2tp_ip6.c:(.exit.text+0x1965): undefined reference to `inet6_del_protocol'
net/built-in.o:(.rodata+0xf2d0): undefined reference to `inet6_release'
net/built-in.o:(.rodata+0xf2d8): undefined reference to `inet6_bind'
net/built-in.o:(.rodata+0xf308): undefined reference to `inet6_ioctl'
net/built-in.o:(.data+0x1af40): undefined reference to `ipv6_setsockopt'
net/built-in.o:(.data+0x1af48): undefined reference to `ipv6_getsockopt'
net/built-in.o:(.data+0x1af50): undefined reference to `compat_ipv6_setsockopt'
net/built-in.o:(.data+0x1af58): undefined reference to `compat_ipv6_getsockopt'
make: *** [vmlinux] Error 1

This is due to l2tp uses symbols from IPV6, so when IPV6
is a module, l2tp is not allowed to be builtin.

Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: combine ipt_REDIRECT and ip6t_REDIRECT

Combine more modules since the actual code is so small anyway that the
kmod metadata and the module in its loaded state totally outweighs the
combined actual code size.

IP_NF_TARGET_REDIRECT becomes a compat option; IP6_NF_TARGET_REDIRECT
is completely eliminated since it has not see a release yet.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: combine ipt_NETMAP and ip6t_NETMAP

Combine more modules since the actual code is so small anyway that the
kmod metadata and the module in its loaded state totally outweighs the
combined actual code size.

IP_NF_TARGET_NETMAP becomes a compat option; IP6_NF_TARGET_NETMAP
is completely eliminated since it has not see a release yet.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_nat: remove obsolete rcu_read_unlock call

hlist walk in find_appropriate_src() is not protected anymore by rcu_read_lock(),
so rcu_read_unlock() is unnecessary if in_range() matches.

This bug was added in (c7232c9 netfilter: add protocol independent NAT core).

Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_nat: fix oops when unloading protocol modules

When unloading a protocol module nf_ct_iterate_cleanup() is used to
remove all conntracks using the protocol from the bysource hash and
clean their NAT sections. Since the conntrack isn't actually killed,
the NAT callback is invoked twice, once for each direction, which
causes an oops when trying to delete it from the bysource hash for
the second time.

The same oops can also happen when removing both an L3 and L4 protocol
since the cleanup function doesn't check whether the conntrack has
already been cleaned up.

Pid: 4052, comm: modprobe Not tainted 3.6.0-rc3-test-nat-unload-fix+ #32 Red Hat KVM
RIP: 0010:[<ffffffffa002c303>]  [<ffffffffa002c303>] nf_nat_proto_clean+0x73/0xd0 [nf_nat]
RSP: 0018:ffff88007808fe18  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800728550c0 RCX: ffff8800756288b0
RDX: dead000000200200 RSI: ffff88007808fe88 RDI: ffffffffa002f208
RBP: ffff88007808fe28 R08: ffff88007808e000 R09: 0000000000000000
R10: dead000000200200 R11: dead000000100100 R12: ffffffff81c6dc00
R13: ffff8800787582b8 R14: ffff880078758278 R15: ffff88007808fe88
FS:  00007f515985d700(0000) GS:ffff88007cd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f515986a000 CR3: 000000007867a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 4052, threadinfo ffff88007808e000, task ffff8800756288b0)
Stack:
ffff88007808fe68 ffffffffa002c290 ffff88007808fe78 ffffffff815614e3
ffffffff00000000 00000aeb00000246 ffff88007808fe68 ffffffff81c6dc00
ffff88007808fe88 ffffffffa00358a0 0000000000000000 000000000040f5b0
Call Trace:
[<ffffffffa002c290>] ? nf_nat_net_exit+0x50/0x50 [nf_nat]
[<ffffffff815614e3>] nf_ct_iterate_cleanup+0xc3/0x170
[<ffffffffa002c55a>] nf_nat_l3proto_unregister+0x8a/0x100 [nf_nat]
[<ffffffff812a0303>] ? compat_prepare_timeout+0x13/0xb0
[<ffffffffa0035848>] nf_nat_l3proto_ipv4_exit+0x10/0x23 [nf_nat_ipv4]
...

To fix this,

- check whether the conntrack has already been cleaned up in
  nf_nat_proto_clean

- change nf_ct_iterate_cleanup() to only invoke the callback function
  once for each conntrack (IP_CT_DIR_ORIGINAL).

The second change doesn't affect other callers since when conntracks are
actually killed, both directions are removed from the hash immediately
and the callback is already only invoked once. If it is not killed, the
second callback invocation will always return the same decision not to
kill it.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: fix IPv6 NAT dependencies in Kconfig

* NF_NAT_IPV6 requires IP6_NF_IPTABLES

* IP6_NF_TARGET_MASQUERADE, IP6_NF_TARGET_NETMAP, IP6_NF_TARGET_REDIRECT
and IP6_NF_TARGET_NPT require NF_NAT_IPV6.

This change just mirrors what IPv4 does in Kconfig, for consistency.

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge branch 'master' of git://git./linux/kernel/git/jkirsher/net-next

Jeff Kirsher says:

====================
This series contains updates to igb and ixgbevf.

v2: updated patch description in 04 patch (ixgbevf: scheduling while
    atomic in reset hw path)
...
Akeem G. Abodunrin (1):
  igb: Support to enable EEE on all eee_supported devices

Alexander Duyck (2):
  igb: Remove artificial restriction on RQDPC stat reading
  ixgbevf: Add support for VF API negotiation

John Fastabend (1):
  ixgbevf: scheduling while atomic in reset hw path
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net1080: Neaten netdev_dbg use

Remove unnecessary temporary variable and #ifdef DEBUG block.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

USB: remove dbg() usage in USB networking drivers

The dbg() USB macro is so old, it predates me. The USB networking drivers are
the last hold-out using this macro, and we want to get rid of it, so replace
the usage of it with the proper netdev_dbg() or dev_dbg() (depending on the
context) calls.

Some places we end up using a local variable for the debug call, so also
convert the other existing dev_* calls to use it as well, to save tiny amounts
of code space.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: Document use of undefined variable.

Both tcp_timewait_state_process and tcp_check_req use the same basic
construct of

struct tcp_options received tmp_opt;
tmp_opt.saw_tstamp = 0;

then call

tcp_parse_options

However if they are fed a frame containing a TCP_SACK then tbe code
behaviour is undefined because opt_rx->sack_ok is undefined data.

This ought to be documented if it is intentional.

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: Don't add TCP-code in inet_sock_destruct

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: H.K. Jerry Chu <hkchu@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

IB/ipoib: Add rtnl_link_ops support

Add rtnl_link_ops to IPoIB, with the first usage being child device
create/delete through them. Childs devices are now either legacy ones,
created/deleted through the ipoib sysfs entries, or RTNL ones.

Adding support for RTNL childs involved refactoring of ipoib_vlan_add
which is now used by both the sysfs and the link_ops code.

Also, added ndo_uninit entry to support calling unregister_netdevice_queue
from the rtnl dellink entry. This required removal of calls to
ipoib_dev_cleanup from the driver in flows which use unregister_netdevice,
since the networking core will invoke ipoib_uninit which does exactly that.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-davem' of git://git./linux/kernel/git/bwh/sfc-next

Ben Hutchings says:

====================
1. Extension to PPS/PTP to allow for PHC devices where pulses are
subject to a variable but measurable delay.
2. PPS/PTP/PHC support for Solarflare boards with a timestamping
peripheral.
3. MTD support for updating the timestamping peripheral on those boards.
4. Fix for potential over-length requests to firmware.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbevf: scheduling while atomic in reset hw path

In ixgbevf_reset_hw_vf() msleep is called while holding mbx_lock
resulting in a schedule while atomic bug with trace below.

This patch uses mdelay instead.

BUG: scheduling while atomic: ip/6539/0x00000002
2 locks held by ip/6539:
#0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81419cc3>] rtnl_lock+0x17/0x19
#1: (&(&adapter->mbx_lock)->rlock){+.+...}, at: [<ffffffffa0030855>] ixgbevf_reset+0x30/0xc1 [ixgbevf]
Modules linked in: ixgbevf ixgbe mdio libfc scsi_transport_fc 8021q scsi_tgt garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 uinput igb coretemp hwmon crc32c_intel ioatdma i2c_i801 shpchp microcode lpc_ich mfd_core i2c_core joydev dca pcspkr serio_raw pata_acpi ata_generic usb_storage pata_jmicron
Pid: 6539, comm: ip Not tainted 3.6.0-rc3jk-net-next+ #104
Call Trace:
[<ffffffff81072202>] __schedule_bug+0x6a/0x79
[<ffffffff814bc7e0>] __schedule+0xa2/0x684
[<ffffffff8108f85f>] ? trace_hardirqs_off+0xd/0xf
[<ffffffff814bd0c0>] schedule+0x64/0x66
[<ffffffff814bb5e2>] schedule_timeout+0xa6/0xca
[<ffffffff810536b9>] ? lock_timer_base+0x52/0x52
[<ffffffff812629e0>] ? __udelay+0x15/0x17
[<ffffffff814bb624>] schedule_timeout_uninterruptible+0x1e/0x20
[<ffffffff810541c0>] msleep+0x1b/0x22
[<ffffffffa002e723>] ixgbevf_reset_hw_vf+0x90/0xe5 [ixgbevf]
[<ffffffffa0030860>] ixgbevf_reset+0x3b/0xc1 [ixgbevf]
[<ffffffffa0032fba>] ixgbevf_open+0x43/0x43e [ixgbevf]
[<ffffffff81409610>] ? dev_set_rx_mode+0x2e/0x33
[<ffffffff8140b0f1>] __dev_open+0xa0/0xe5
[<ffffffff814097ed>] __dev_change_flags+0xbe/0x142
[<ffffffff8140b01c>] dev_change_flags+0x21/0x56
[<ffffffff8141a843>] do_setlink+0x2e2/0x7f4
[<ffffffff81016e36>] ? native_sched_clock+0x37/0x39
[<ffffffff8141b0ac>] rtnl_newlink+0x277/0x4bb
[<ffffffff8141aee9>] ? rtnl_newlink+0xb4/0x4bb
[<ffffffff812217d1>] ? selinux_capable+0x32/0x3a
[<ffffffff8104fb17>] ? ns_capable+0x4f/0x67
[<ffffffff81419cc3>] ? rtnl_lock+0x17/0x19
[<ffffffff81419f28>] rtnetlink_rcv_msg+0x236/0x253
[<ffffffff81419cf2>] ? rtnetlink_rcv+0x2d/0x2d
[<ffffffff8142fd42>] netlink_rcv_skb+0x43/0x94
[<ffffffff81419ceb>] rtnetlink_rcv+0x26/0x2d
[<ffffffff8142faf1>] netlink_unicast+0xee/0x174
[<ffffffff81430327>] netlink_sendmsg+0x26a/0x288
[<ffffffff813fb04f>] ? rcu_read_unlock+0x56/0x67
[<ffffffff813f5e6d>] __sock_sendmsg_nosec+0x58/0x61
[<ffffffff813f81b7>] __sock_sendmsg+0x3d/0x48
[<ffffffff813f8339>] sock_sendmsg+0x6e/0x87
[<ffffffff81107c9f>] ? might_fault+0xa5/0xac
[<ffffffff81402a72>] ? copy_from_user+0x2a/0x2c
[<ffffffff81402e62>] ? verify_iovec+0x54/0xaa
[<ffffffff813f9834>] __sys_sendmsg+0x206/0x288
[<ffffffff810694fa>] ? up_read+0x23/0x3d
[<ffffffff811307e5>] ? fcheck_files+0xac/0xea
[<ffffffff8113095e>] ? fget_light+0x3a/0xb9
[<ffffffff813f9a2e>] sys_sendmsg+0x42/0x60
[<ffffffff814c5ba9>] system_call_fastpath+0x16/0x1b

CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-By: Robert Garrett <robertx.e.garrett@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbevf: Add support for VF API negotiation

This change makes it so that the VF can support the PF/VF API negotiation
protocol. Specifically in this case we are adding support for API 1.0
which will mean that the VF is capable of cleaning up buffers that span
multiple descriptors without triggering an error.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Support to enable EEE on all eee_supported devices

Current implementation enables EEE on only i350 device. This patch enables
EEE on all eee_supported devices. Also, configured LPI clock to keep
running before EEE is enabled on i210 and i211 devices.

Signed-off-by: Akeem G. Abodunrin <akeem.g.abodunrin@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Remove artificial restriction on RQDPC stat reading

For some reason the reading of the RQDPC register was being artificially
limited to 4K. Instead of limiting the value we should read the value and
add the full amount. Otherwise this can lead to a misleading number of
dropped packets when the actual value is in fact much higher.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

r8169: use unlimited DMA burst for TX

The r8169 driver currently limits the DMA burst for TX to 1024 bytes. I have
a box where this prevents the interface from using the gigabit line to its full
potential. This patch solves the problem by setting TX_DMA_BURST to unlimited.

The box has an ASRock B75M motherboard with on-board RTL8168evl/8111evl
(XID 0c900880). TSO is enabled.

I used netperf (TCP_STREAM test) to measure the dependency of TX throughput
on MTU. I did it for three different values of TX_DMA_BURST ('5'=512, '6'=1024,
'7'=unlimited). This chart shows the results:
http://michich.fedorapeople.org/r8169/r8169-effects-of-TX_DMA_BURST.png

Interesting points:
- With the current DMA burst limit (1024):
   - at the default MTU=1500 I get only 842 Mbit/s.
   - when going from small MTU, the performance rises monotonically with
     increasing MTU only up to a peak at MTU=1076 (908 MBit/s). Then there's
     a sudden drop to 762 MBit/s from which the throughput rises monotonically
     again with further MTU increases.
- With a smaller DMA burst limit (512):
   - there's a similar peak at MTU=1076 and another one at MTU=564.
- With unlimited DMA burst:
   - at the default MTU=1500 I get nice 940 Mbit/s.
   - the throughput rises monotonically with increasing MTU with no strange
     peaks.

Notice that the peaks occur at MTU sizes that are multiples of the DMA burst
limit plus 52. Why 52? Because:
  20 (IP header) + 20 (TCP header) + 12 (TCP options) = 52

The Realtek-provided r8168 driver (v8.032.00) uses unlimited TX DMA burst too,
except for CFG_METHOD_1 where the TX DMA burst is set to 512 bytes.
CFG_METHOD_1 appears to be the oldest MAC version of "RTL8168B/8111B",
i.e. RTL_GIGA_MAC_VER_11 in r8169. Not sure if this MAC version really needs
the smaller burst limit, or if any other versions have similar requirements.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Acked-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>