Julia Lawall [Thu, 15 May 2014 03:43:19 +0000 (05:43 +0200)]
net/ariadne: delete unneeded call to netdev_priv
Netdev_priv is an accessor function, and has no purpose if its result is
not used.
A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall [Thu, 15 May 2014 03:43:18 +0000 (05:43 +0200)]
drivers/net/wan: delete unneeded call to netdev_priv
Netdev_priv is an accessor function, and has no purpose if its result is
not used.
A simplified version of the semantic match that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@ local idexpression x; @@
-x = netdev_priv(...);
... when != x
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 15 May 2014 02:32:14 +0000 (19:32 -0700)]
net: systemport: pad packets to a minimum of 68 bytes
Packets need to be at least 64 bytes to enter the switch port logic,
including the FCS, otherwise they will be discarded as RUNT packets.
With packets having Broadcom tags, the 4-bytes tag is first stripped
off the packet, and the packet length is then checked, so we need to
make sure that the packet length with FCS is at least 64 bytes.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Thu, 15 May 2014 02:32:13 +0000 (19:32 -0700)]
net: systemport: only update UMAC_CMD if something changed
The link adjustment callback can be called as frequently as desired by
the PHY library, as such, let's avoid doing a Read/Modify/Write sequence
if nothing changed, which is more than likely since we are interfaced
with a switch device.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 14 May 2014 19:15:13 +0000 (12:15 -0700)]
ti: Remove trailing semicolon from do {...} while (0) macro
These should not have trailing semicolons so remove them.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 15 May 2014 20:32:20 +0000 (16:32 -0400)]
Merge branch 'filter-next'
Alexei Starovoitov says:
====================
internal BPF jit for x64 and JITed seccomp
Internal BPF JIT compiler for x86_64 replaces classic BPF JIT.
Use it in seccomp and in tracing filters (sent as separate patch)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Wed, 14 May 2014 02:50:47 +0000 (19:50 -0700)]
seccomp: JIT compile seccomp filter
Take advantage of internal BPF JIT
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i <
10000000; i++)
syscall(...);
$ sudo sysctl net.core.bpf_jit_enable=1
$ time ./bench
real 0m2.769s
user 0m1.136s
sys 0m1.624s
$ sudo sysctl net.core.bpf_jit_enable=0
$ time ./bench
real 0m5.825s
user 0m1.268s
sys 0m4.548s
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Wed, 14 May 2014 02:50:46 +0000 (19:50 -0700)]
net: filter: x86: internal BPF JIT
Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.
Performance:
1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
No performance difference is observed for filters that were JIT-able before
Example assembler code for BPF filter "tcpdump port 22"
original BPF -> old JIT: original BPF -> internal BPF -> new JIT:
0: push %rbp 0: push %rbp
1: mov %rsp,%rbp 1: mov %rsp,%rbp
4: sub $0x60,%rsp 4: sub $0x228,%rsp
8: mov %rbx,-0x8(%rbp) b: mov %rbx,-0x228(%rbp) // prologue
12: mov %r13,-0x220(%rbp)
19: mov %r14,-0x218(%rbp)
20: mov %r15,-0x210(%rbp)
27: xor %eax,%eax // clear A
c: xor %ebx,%ebx 29: xor %r13,%r13 // clear X
e: mov 0x68(%rdi),%r9d 2c: mov 0x68(%rdi),%r9d
12: sub 0x6c(%rdi),%r9d 30: sub 0x6c(%rdi),%r9d
16: mov 0xd8(%rdi),%r8 34: mov 0xd8(%rdi),%r10
3b: mov %rdi,%rbx
1d: mov $0xc,%esi 3e: mov $0xc,%esi
22: callq 0xffffffffe1021e15 43: callq 0xffffffffe102bd75
27: cmp $0x86dd,%eax 48: cmp $0x86dd,%rax
2c: jne 0x0000000000000069 4f: jne 0x000000000000009a
2e: mov $0x14,%esi 51: mov $0x14,%esi
33: callq 0xffffffffe1021e31 56: callq 0xffffffffe102bd91
38: cmp $0x84,%eax 5b: cmp $0x84,%rax
3d: je 0x0000000000000049 62: je 0x0000000000000074
3f: cmp $0x6,%eax 64: cmp $0x6,%rax
42: je 0x0000000000000049 68: je 0x0000000000000074
44: cmp $0x11,%eax 6a: cmp $0x11,%rax
47: jne 0x00000000000000c6 6e: jne 0x0000000000000117
49: mov $0x36,%esi 74: mov $0x36,%esi
4e: callq 0xffffffffe1021e15 79: callq 0xffffffffe102bd75
53: cmp $0x16,%eax 7e: cmp $0x16,%rax
56: je 0x00000000000000bf 82: je 0x0000000000000110
58: mov $0x38,%esi 88: mov $0x38,%esi
5d: callq 0xffffffffe1021e15 8d: callq 0xffffffffe102bd75
62: cmp $0x16,%eax 92: cmp $0x16,%rax
65: je 0x00000000000000bf 96: je 0x0000000000000110
67: jmp 0x00000000000000c6 98: jmp 0x0000000000000117
69: cmp $0x800,%eax 9a: cmp $0x800,%rax
6e: jne 0x00000000000000c6 a1: jne 0x0000000000000117
70: mov $0x17,%esi a3: mov $0x17,%esi
75: callq 0xffffffffe1021e31 a8: callq 0xffffffffe102bd91
7a: cmp $0x84,%eax ad: cmp $0x84,%rax
7f: je 0x000000000000008b b4: je 0x00000000000000c2
81: cmp $0x6,%eax b6: cmp $0x6,%rax
84: je 0x000000000000008b ba: je 0x00000000000000c2
86: cmp $0x11,%eax bc: cmp $0x11,%rax
89: jne 0x00000000000000c6 c0: jne 0x0000000000000117
8b: mov $0x14,%esi c2: mov $0x14,%esi
90: callq 0xffffffffe1021e15 c7: callq 0xffffffffe102bd75
95: test $0x1fff,%ax cc: test $0x1fff,%rax
99: jne 0x00000000000000c6 d3: jne 0x0000000000000117
d5: mov %rax,%r14
9b: mov $0xe,%esi d8: mov $0xe,%esi
a0: callq 0xffffffffe1021e44 dd: callq 0xffffffffe102bd91 // MSH
e2: and $0xf,%eax
e5: shl $0x2,%eax
e8: mov %rax,%r13
eb: mov %r14,%rax
ee: mov %r13,%rsi
a5: lea 0xe(%rbx),%esi f1: add $0xe,%esi
a8: callq 0xffffffffe1021e0d f4: callq 0xffffffffe102bd6d
ad: cmp $0x16,%eax f9: cmp $0x16,%rax
b0: je 0x00000000000000bf fd: je 0x0000000000000110
ff: mov %r13,%rsi
b2: lea 0x10(%rbx),%esi 102: add $0x10,%esi
b5: callq 0xffffffffe1021e0d 105: callq 0xffffffffe102bd6d
ba: cmp $0x16,%eax 10a: cmp $0x16,%rax
bd: jne 0x00000000000000c6 10e: jne 0x0000000000000117
bf: mov $0xffff,%eax 110: mov $0xffff,%eax
c4: jmp 0x00000000000000c8 115: jmp 0x000000000000011c
c6: xor %eax,%eax 117: mov $0x0,%eax
c8: mov -0x8(%rbp),%rbx 11c: mov -0x228(%rbp),%rbx // epilogue
cc: leaveq 123: mov -0x220(%rbp),%r13
cd: retq 12a: mov -0x218(%rbp),%r14
131: mov -0x210(%rbp),%r15
138: leaveq
139: retq
On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.
The difference in generated assembler is due to the following:
Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.
New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.
Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.
2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
extensions. New JIT supports all BPF extensions.
Performance of such filters improves 2-4 times depending on a filter.
The longer the filter the higher performance gain.
Synthetic benchmarks with many ancillary loads see 20x speedup
which seems to be the maximum gain from JIT
Notes:
. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
and can be used to see generated assembler
. there are two jit_compile() functions and code flow for classic filters is:
sk_attach_filter() - load classic BPF
bpf_jit_compile() - try to JIT from classic BPF
sk_convert_filter() - convert classic to internal
bpf_int_jit_compile() - JIT from internal BPF
seccomp and tracing filters will just call bpf_int_jit_compile()
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Wed, 14 May 2014 02:50:45 +0000 (19:50 -0700)]
net: filter: x86: split bpf_jit_compile()
Split bpf_jit_compile() into two functions to improve readability
of for(pass++) loop. The change follows similar style of JIT compilers
for arm, powerpc, s390
The body of new do_jit() was not reformatted to reduce noise
in this patch, since the following patch replaces most of it.
Tested with BPF testsuite.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 15 May 2014 19:51:52 +0000 (15:51 -0400)]
Merge branch 'ieee802154-next'
Phoebe Buckheister says:
====================
802154: some cleanups and fixes
This series adds some definitions for 802.15.4 header fields that were missing,
changes 6lowpan fragmentation to be aware of security headers and fixes
802.15.4 datagram socket sendmsg(), which was entirely incompliant to date.
Also a few minor changes to mac_cb handling, mark a single-use function static,
and correctly check for EMSGSIZE conditions during wpan_header_create.
Changes since v1:
* rename mac_cb_alloc to mac_cb_init
* catch all error cases of sendmsg() instead of only !conn && msg_name
* redo 6lowpan fragmentation to not clone lower layer headers
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Phoebe Buckheister [Wed, 14 May 2014 15:43:11 +0000 (17:43 +0200)]
mac802154: make mac802154_wpan_open static
This function is only used within the same translation unit, so mark it
static.
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phoebe Buckheister [Wed, 14 May 2014 15:43:10 +0000 (17:43 +0200)]
ieee802154: fix dgram socket sendmsg()
802.15.4 datagram sockets do not currently have a compliant sendmsg().
The destination address supplied is always ignored, and in unconnected
mode, packets are broadcast instead of dropped with -EDESTADDRREQ. This
patch fixes 802.15.4 dgram sockets to be compliant, i.e.
!conn && !msg_name => -EDESTADDRREQ
!conn && msg_name => send to msg_name
conn && !msg_name => send to connected
conn && msg_name => -EISCONN
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phoebe Buckheister [Wed, 14 May 2014 15:43:09 +0000 (17:43 +0200)]
6lowpan: fix fragmentation
Currently, 6lowpan creates one 802.15.4 MAC header for the original
packet the device was given by upper layers and reuses this header for
all fragments, if fragmentation is required. This also reuses frame
sequence numbers, which must not happen. 6lowpan also has issues with
fragmentation in the presence of security headers, since those may imply
the presence of trailing fields that are not accounted for by the
fragmentation code right now.
Fix both of these issues by properly allocating fragment skbs with
headromm and tailroom as specified by the underlying device, create one
header for each skb instead of reusing the original header, let the
underlying device do the rest.
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phoebe Buckheister [Wed, 14 May 2014 15:43:08 +0000 (17:43 +0200)]
ieee802154: change _cb handling slightly
The current mac_cb handling of ieee802154 is rather awkward and limited.
Decompose the single flags field into multiple fields with the meanings
of each subfield of the flags field to make future extensions (for
example, link-layer security) easier. Also don't set the frame sequence
number in upper layers, since that's a thing the MAC is supposed to set
on frame transmit - we set it on header creation, but assuming that
upper layers do not blindly duplicate our headers, this is fine.
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phoebe Buckheister [Wed, 14 May 2014 15:43:07 +0000 (17:43 +0200)]
mac802154: account for all header parts during wpan header creationg
The current WPAN header creation code checks for EMSGSIZE conditions,
but does not account for the MIC field that link layer security may add
at the end of the frame. Now that we can accurately calculate the
maximum payload size of packets, use that to check for EMSGSIZE
conditions.
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phoebe Buckheister [Wed, 14 May 2014 15:43:06 +0000 (17:43 +0200)]
ieee802154: add definitions for link-layer security and header functions
When dealing with 802.15.4, one often has to know the maximum payload
size for a given packet. This depends on many factors, one of which is
whether or not a security header is present in the frame. These
definitions and functions provide an easy way for any upper layer to
calculate the maximum payload size for a packet. The first obvious user
for this is 6lowpan, which duplicates this calculation and gets it
partially wrong because it ignores security headers.
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Lottmann [Wed, 14 May 2014 12:02:04 +0000 (14:02 +0200)]
drivers: net: Register Micrel ksz884x network devices in PCI device tree.
This unifies the behaviour with other network device drivers and
allows for a matching of the PCI device path in UDEV rules.
Signed-off-by: Markus Lottmann <markus.lottmann1986@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 15 May 2014 17:43:14 +0000 (13:43 -0400)]
net: Fix CONFIG_SYSCTL ifdef test.
> include/net/ip.h:211:5: warning: "CONFIG_SYSCTL" is not defined [-Wundef]
> #if CONFIG_SYSCTL
> ^
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 15 May 2014 17:42:20 +0000 (13:42 -0400)]
Merge branch 'cpsw_cleanups'
George Cherian says:
====================
TI CPSW Cleanup
This series does some minimal cleanups.
-Conversion of pr_*() to dev_*()
-Convert kzalloc to devm_kzalloc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
George Cherian [Mon, 12 May 2014 04:51:21 +0000 (10:21 +0530)]
drivers: net: davinci_cpdma: Convert kzalloc() to devm_kzalloc().
Convert kzalloc() to devm_kzalloc().
Signed-off-by: George Cherian <george.cherian@ti.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
George Cherian [Mon, 12 May 2014 04:51:20 +0000 (10:21 +0530)]
net: davinci_mdio: Convert pr_err() to dev_err() call
Convert the lone pr_err() to dev_err() call.
Signed-off-by: George Cherian <george.cherian@ti.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
George Cherian [Mon, 12 May 2014 04:51:19 +0000 (10:21 +0530)]
driver net: cpsw: Convert pr_*() to dev_*() calls
Convert all pr_*() calls to dev_*() calls.
No functional changes.
Signed-off-by: George Cherian <george.cherian@ti.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Darek Marcinkiewicz [Wed, 14 May 2014 06:04:32 +0000 (08:04 +0200)]
driver/net/ethernet/ec_bhf.c: fix sparse warnings
Sparse was reporting quite a few warnings for the driver.
Those get fixed by this patch.
Signed-off-by: Dariusz Marcinkiewicz <reksio@newterm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 14 May 2014 03:30:07 +0000 (20:30 -0700)]
net: Use a more standard macro for INET_ADDR_COOKIE
Missing a colon on definition use is a bit odd so
change the macro for the 32 bit case to declare an
__attribute__((unused)) and __deprecated variable.
The __deprecated attribute will cause gcc to emit
an error if the variable is actually used.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingoo Han [Wed, 14 May 2014 03:15:42 +0000 (12:15 +0900)]
net: systemport: Use devm_ioremap_resource()
Use devm_ioremap_resource() because devm_request_and_ioremap() is
obsoleted by devm_ioremap_resource().
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 14 May 2014 19:41:05 +0000 (15:41 -0400)]
Merge branch 'mlx4-next'
Or Gerlitz says:
====================
Mellanox driver update 2014-05-12
This patchset introduce some small bug fixes:
Eyal fixed some compilation and syntactic checkers warnings. Ido fixed a
coruption in user priority mapping when changing number of channels. Shani
fixed some other problems when modifying MAC address. Yuval fixed a problem
when changing IRQ affinity during high traffic - IRQ changes got effective
only after the first pause in traffic.
This patchset was tested and applied over commit
93dccc5: "mdio_bus: fix
devm_mdiobus_alloc_size export"
Changes from V1:
- applied feedback from Dave to use true/false and not 0/1 in patch 1/9
- removed the patch from Noa which adddressed a bug in flow steering table
when using a bond device, as the fix might need to be in the bonding driver,
this is now dicussed in the netdev thread "bonding directly changes
underlying device address"
Changes from V0:
- Patch 1/9 - net/mlx4_core: Enforce irq affinity changes immediatly
- Moved the new members to a hot cache line as Eric suggested
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Perry [Wed, 14 May 2014 09:15:17 +0000 (12:15 +0300)]
net/mlx4_core: Fix inaccurate return value of mlx4_flow_attach()
Adopt the "info: why not propagate 'ret' from parse_trans_rule()..."
suggestion made by the smatch semantic checker on:
drivers/net/ethernet/mellanox/mlx4/mcg.c:867 mlx4_flow_attach()
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Perry [Wed, 14 May 2014 09:15:16 +0000 (12:15 +0300)]
net/mlx4_en: Using positive error value for unsigned
Using a positive value for error: MLX4_NET_TRANS_RULE_NUM instead
of -EPROTONOSUPPORT, to remove compilation warning.
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shani Michaelli [Wed, 14 May 2014 09:15:15 +0000 (12:15 +0300)]
net/mlx4_en: Protect MAC address modification with the state_lock mutex
This Patches solves an issue that could raise when modifying the
device's MAC. It occurs due to a simultaneous access to priv->mac_hash
from two contexts. The buggy scenario described below:
Context 1: copy the new mac address to the dev->dev_addr field.
Context 2: mlx4_en_do_uc_filter removes prev_mac entry from the mac_hash
db since it is not in dev->uc and not equal to dev->dev_addr.
Context 1: mlx4_en_do_set_mac() calls mlx4_en_replace_mac() to replace
prev_mac with dev_addr but it fails to update the mac_hash db
since it no longer contains prev_mac, therefore it returns
with an error.
The fix is to prevent mlx4_en_do_uc_filter from being executed by both
of the context 1 calls described above, This is done by putting them
both under the mdev->state_lock lock, it will solve this issue since
mlx4_en_do_uc_filter is already protected by the mdev->state_lock.
Reviewed-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Perry [Wed, 14 May 2014 09:15:14 +0000 (12:15 +0300)]
net/mlx4_core: Removed unnecessary bit operation condition
Fix the "warn: suspicious bitop condition" made by the smatch semantic
checker on:
drivers/net/ethernet/mellanox/mlx4/main.c:509 mlx4_slave_cap()
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Perry [Wed, 14 May 2014 09:15:13 +0000 (12:15 +0300)]
net/mlx4_core: Fix smatch error - possible access to a null variable
Fix the "error: we previously assumed 'out_param' could be null" found
by smatch semantic checker on:
drivers/net/ethernet/mellanox/mlx4/cmd.c:506 mlx4_cmd_poll()
drivers/net/ethernet/mellanox/mlx4/cmd.c:578 mlx4_cmd_wait()
Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shani Michaelli [Wed, 14 May 2014 09:15:12 +0000 (12:15 +0300)]
net/mlx4_en: Fix errors in MAC address changing when port is down
This patch fix an issue that happen when changing the MAC address when
the port is down, described as follows:
1. Set the port down.
2. Change the MAC address - mlx4_en_set_mac() will change dev->dev_addr.
3. Set the port up - will result in mlx4_en_do_uc_filter that will
remove the prev_mac entry from the mac_hash db.
4. Changing the MAC address again will eventually trigger the call to
mlx4_en_replace_mac() in order to replace prev_mac with dev_addr but
the prev_mac entry is already not exist in the mac_hash db therefore
the operation fails.
The fix is to set the prev_mac with the new MAC address so in step 3
above, after setting the port up mlx4_en_get_qp() is updating the
mac_hash with the entry of dev_addr which is equal to prev_mac.
Therefore in step 4, when calling mlx4_en_replace_mac, the entry related
to prev_mac exist in mac_hash and the replace operation succeed.
Reviewed-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Shamay [Wed, 14 May 2014 09:15:11 +0000 (12:15 +0300)]
net/mlx4_en: User prio mapping gets corrupted when changing number of channels
When using ethtool set_channels, mlx4_en_setup_tc is always called, even
when it was not configured. Fixed code to call mlx4_en_setup_tc() only
if needed.
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Atias [Wed, 14 May 2014 09:15:10 +0000 (12:15 +0300)]
net/mlx4_core: Enforce irq affinity changes immediatly
During heavy traffic, napi is constatntly polling the complition queue
and no interrupt is fired. Because of that, changes to irq affinity are
ignored until traffic is stopped and resumed.
By registering to the irq notifier mechanism, and forcing interrupt when
affinity is changed, irq affinity changes will be immediatly enforced.
Signed-off-by: Yuval Atias <yuvala@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Tue, 13 May 2014 06:39:27 +0000 (14:39 +0800)]
macvlan: Propagate lowerdev MTU changes
When the physical MTU changes we should ensure that all existing MACVLAN
dev MTU do not exceed the new lowerdev MTU. This patch adds that
propagation.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Tue, 13 May 2014 02:57:58 +0000 (10:57 +0800)]
dccp: make the request_retries minimum is 1
In Documentation/networking/dccp.txt points that request_retries
should be greater than 0. So make the extra1 to be &one instead
of &zero.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Mon, 12 May 2014 23:52:02 +0000 (16:52 -0700)]
snmp: fix some left over of snmp stats
Fengguang reported the following sparse warning:
>> net/ipv6/proc.c:198:41: sparse: incorrect type in argument 1 (different address spaces)
net/ipv6/proc.c:198:41: expected void [noderef] <asn:3>*mib
net/ipv6/proc.c:198:41: got void [noderef] <asn:3>**pcpumib
Fixes: commit 698365fa1874aa7635d51667a3 (net: clean up snmp stats code)
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Mon, 12 May 2014 23:04:53 +0000 (16:04 -0700)]
ipv4: make ip_local_reserved_ports per netns
ip_local_port_range is already per netns, so should ip_local_reserved_ports
be. And since it is none by default we don't actually need it when we don't
enable CONFIG_SYSCTL.
By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Laurent Pinchart [Mon, 12 May 2014 23:04:36 +0000 (01:04 +0200)]
irda: sh_irda: Enable driver compilation with COMPILE_TEST
This helps increasing build testing coverage.
Signed-off-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 14 May 2014 19:20:19 +0000 (15:20 -0400)]
Merge branch 'tipc-next'
Jon Maloy says:
====================
tipc: bug fixes and improvements
Intensive and extensive testing has revealed some rather infrequent
problems related to flow control, buffer handling and link
establishment. Commits ##1 to 4 deal with these problems.
The remaining four commits are just code improvments, aiming at
making the code more comprehensible and maintainable. There are
no functional enhancements in this series.
v2: Fixed a typo in commit log #2. Otherwise no changes from v1.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 14 May 2014 09:39:15 +0000 (05:39 -0400)]
tipc: merge port message reception into socket reception function
In order to reduce complexity and save a call level during message
reception at port/socket level, we remove the function tipc_port_rcv()
and merge its functionality into tipc_sk_rcv().
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 14 May 2014 09:39:14 +0000 (05:39 -0400)]
tipc: clean up neigbor discovery message reception
The function tipc_disc_rcv(), which is handling received neighbor
discovery messages, is perceived as messy, and it is hard to verify
its correctness by code inspection. The fact that the task it is set
to resolve is fairly complex does not make the situation better.
In this commit we try to take a more systematic approach to the
problem. We define a decision machine which takes three state flags
as input, and produces three action flags as output. We then walk
through all permutations of the state flags, and for each of them we
describe verbally what is going on, plus that we set zero or more of
the action flags. The action flags indicate what should be done once
the decision machine has finished its job, while the last part of the
function deals with performing those actions.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 14 May 2014 09:39:13 +0000 (05:39 -0400)]
tipc: improve and extend media address conversion functions
TIPC currently handles two media specific addresses: Ethernet MAC
addresses and InfiniBand addresses. Those are kept in three different
formats:
1) A "raw" format as obtained from the device. This format is known
only by the media specific adapter code in eth_media.c and
ib_media.c.
2) A "generic" internal format, in the form of struct tipc_media_addr,
which can be referenced and passed around by the generic media-
unaware code.
3) A serialized version of the latter, to be conveyed in neighbor
discovery messages.
Conversion between the three formats can only be done by the media
specific code, so we have function pointers for this purpose in
struct tipc_media. Here, the media adapters can install their own
conversion functions at startup.
We now introduce a new such function, 'raw2addr()', whose purpose
is to convert from format 1 to format 2 above. We also try to as far
as possible uniform commenting, variable names and usage of these
functions, with the purpose of making them more comprehensible.
We can now also remove the function tipc_l2_media_addr_set(), whose
job is done better by the new function.
Finally, we expand the field for serialized addresses (format 3)
in discovery messages from 20 to 32 bytes. This is permitted
according to the spec, and reduces the risk of problems when we
add new media in the future.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 14 May 2014 09:39:12 +0000 (05:39 -0400)]
tipc: rename and move message reassembly function
The function tipc_link_frag_rcv() is in reality a re-entrant generic
message reassemby function that has nothing in particular to do with
the link, where it is defined now. This becomes obvious when we see
the need to call the function from other places in the code.
In this commit rename it to tipc_buf_append() and move it to the file
msg.c. We also simplify its signature by moving the tail pointer to
the control block of the head buffer, hence making the head buffer
self-contained.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 14 May 2014 09:39:11 +0000 (05:39 -0400)]
tipc: mark head of reassembly buffer as non-linear
The message reassembly function does not update the 'len' and 'data_len'
fields of the head skbuff correctly when fragments are chained to it.
This may sometimes lead to obsure errors, such as fragment reordering
when we receive fragments which are cloned buffers.
This commit fixes this, by ensuring that the two fields are updated
correctly.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 14 May 2014 09:39:10 +0000 (05:39 -0400)]
tipc: don't record link RESET or ACTIVATE messages as traffic
In the current code, all incoming LINK_PROTOCOL messages, irrespective
of type, nudge the "last message received" checkpoint, informing the
link state machine that a message was received from the peer since last
supervision timeout event. This inhibits the link from starting probing
the peer unnecessarily.
However, not only STATE messages are recorded as legitimate incoming
traffic this way, but even RESET and ACTIVATE messages, which in
reality are there to inform the link that the peer endpoint has been
reset. At the same time, some RESET messages may be dropped instead
of causing a link reset. This happens when the link endpoint thinks
it is fully up and working, and the session number of the RESET is
lower than or equal to the current link session. In such cases the
RESET is perceived as a delayed remnant from an earlier session, or
the current one, and dropped.
Now, if a TIPC module is removed and then immediately reinserted, e.g.
when using a script, RESET messages may arrive at the peer link endpoint
before this one has had time to discover the failure. The RESET may be
dropped because of the session number, but only after it has been
recorded as a legitimate traffic event. Hence, the receiving link will
not start probing, and not discover that the peer endpoint is down, at
the same time ignoring the periodic RESET messages coming from that
endpoint. We have ended up in a stale state where a failed link cannot
be re-established.
In this commit, we remedy this by nudging the checkpoint only for
received STATE messages, not for RESET or ACTIVATE messages.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 14 May 2014 09:39:09 +0000 (05:39 -0400)]
tipc: compensate for double accounting in socket rcv buffer
The function net/core/sock.c::__release_sock() runs a tight loop
to move buffers from the socket backlog queue to the receive queue.
As a security measure, sk_backlog.len of the receiving socket
is not set to zero until after the loop is finished, i.e., until
the whole backlog queue has been transferred to the receive queue.
During this transfer, the data that has already been moved is counted
both in the backlog queue and the receive queue, hence giving an
incorrect picture of the available queue space for new arriving buffers.
This leads to unnecessary rejection of buffers by sk_add_backlog(),
which in TIPC leads to unnecessarily broken connections.
In this commit, we compensate for this double accounting by adding
a counter that keeps track of it. The function socket.c::backlog_rcv()
receives buffers one by one from __release_sock(), and adds them to the
socket receive queue. If the transfer is successful, it increases a new
atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the
transferred buffer. If a new buffer arrives during this transfer and
finds the socket busy (owned), we attempt to add it to the backlog.
However, when sk_add_backlog() is called, we adjust the 'limit'
parameter with the value of the new counter, so that the risk of
inadvertent rejection is eliminated.
It should be noted that this change does not invalidate the original
purpose of zeroing 'sk_backlog.len' after the full transfer. We set an
upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that
doesn't respect the send window) keeps pumping in buffers to
sk_add_backlog(), he will eventually reach an upper limit,
(2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added
to the backlog, and the connection will be broken. Ordinary, well-
behaved senders will never reach this buffer limit at all.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 14 May 2014 09:39:08 +0000 (05:39 -0400)]
tipc: decrease connection flow control window
Memory overhead when allocating big buffers for data transfer may
be quite significant. E.g., truesize of a 64 KB buffer turns out
to be 132 KB, 2 x the requested size.
This invalidates the "worst case" calculation we have been
using to determine the default socket receive buffer limit,
which is based on the assumption that 1024x64KB = 67MB buffers
may be queued up on a socket.
Since TIPC connections cannot survive hitting the buffer limit,
we have to compensate for this overhead.
We do that in this commit by dividing the fix connection flow
control window from 1024 (2*512) messages to 512 (2*256). Since
older version nodes send out acks at 512 message intervals,
compatibility with such nodes is guaranteed, although performance
may be non-optimal in such cases.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Mon, 12 May 2014 07:08:43 +0000 (15:08 +0800)]
bonding: alloc the structure ad_info dynamically in per slave
The struct ad_slave_info is very huge, and only be used for 802.3ad mode,
so alloc the structure dynamically could save 356 Bits for every slave in
non 802.3ad mode.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Mon, 12 May 2014 22:30:14 +0000 (02:30 +0400)]
sh_eth: replace devm_kzalloc() with devm_kmalloc_array()
When I was converting the driver to the managed device API, only devm_kzalloc()
was available for memory allocation, so I had to use it, despite zeroing out the
PHY IRQ array right before initializing all its entries to PHY_POLL was quite
stupid. Now that devm_kmalloc_array() has become available, we can avoid the
needless zeroing out...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 May 2014 22:39:01 +0000 (18:39 -0400)]
Merge branch 'tg3-next'
Michael Chan says:
====================
tg3: TSO related enhancements to prevent memory allocation failure
Michael Chan (3):
tg3: Don't modify ip header fields when doing GSO
tg3: Prevent page allocation failure during TSO workaround
tg3: Update copyright and version to 3.137
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Mon, 12 May 2014 03:22:55 +0000 (20:22 -0700)]
tg3: Update copyright and version to 3.137
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Mon, 12 May 2014 03:22:54 +0000 (20:22 -0700)]
tg3: Prevent page allocation failure during TSO workaround
If any TSO fragment hits hardware bug conditions (e.g. 4G boundary), the
driver will workaround by calling skb_copy() to copy to a linear SKB. Users
have reported page allocation failures as the TSO packet can be up to 64K.
Copying such a large packet is also very inefficient. We fix this by using
existing tg3_tso_bug() to transmit the packet using GSO.
Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Mon, 12 May 2014 03:22:53 +0000 (20:22 -0700)]
tg3: Don't modify ip header fields when doing GSO
tg3 uses GSO as workaround if the hardware cannot perform TSO on certain
packets. We should not modify the ip header fields if we do GSO on the
packet. It happens to work by accident because GSO recalculates the IP
checksum and IP total length.
Also fix the tg3_start_xmit comment to reflect that this is the only
xmit function for all devices.
Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 May 2014 22:35:18 +0000 (18:35 -0400)]
Merge branch 'inet_fwmark_reflect'
Lorenzo Colitti says:
====================
Make mark-based routing work better with multiple separate networks.
Mark-based routing (ip rule fwmark 17 lookup 100) combined with
either iptables marking (iptables -j MARK --set-mark 17) or
application-based marking (the SO_MARK setsockopt) are a good
way to deal with connecting simultaneously to multiple networks.
Each network can be given a routing table, and ip rules can
be configured to make different fwmarks select different
networks. Applications can select networks them by setting
appropriate socket marks, and iptables rules can be used to
handle non-aware applications, enforce policy, etc.
This patch series improves functionality when mark-based routing
is used in this way. Current behaviour has the following
limitations:
1. Kernel-originated replies that are not associated with a
socket always use a mark of zero. This means that, for
example, when the kernel sends a ping reply or a TCP reset,
it does not send it on the network from which it received the
original packet.
2. Path MTU discovery, which is triggered by incoming packets,
does not always work correctly, because the routing lookups it
uses to clone routes do not take the fwmark into account and
thus can happen in the wrong routing table.
3. Application-based marking works well for outbound connections,
but does not work well for incoming connections. Marking a
listening socket causes that socket to only accept
connections from a given network, and sockets that are
returned by accept() are not marked (and are thus not routed
correctly).
sysctl. This causes route lookups for kernel-generated replies
and PMTUD to use the fwmark of the packet that caused them.
which causes TCP sockets returned by accept() to be marked with
the same mark that sent the intial SYN packet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Tue, 13 May 2014 17:17:35 +0000 (10:17 -0700)]
net: support marking accepting TCP sockets
When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.
This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.
This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established. If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.
Black-box tested using user-mode linux:
- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Tue, 13 May 2014 17:17:34 +0000 (10:17 -0700)]
net: Use fwmark reflection in PMTU discovery.
Currently, routing lookups used for Path PMTU Discovery in
absence of a socket or on unmarked sockets use a mark of 0.
This causes PMTUD not to work when using routing based on
netfilter fwmark mangling and fwmark ip rules, such as:
iptables -j MARK --set-mark 17
ip rule add fwmark 17 lookup 100
This patch causes these route lookups to use the fwmark from the
received ICMP error when the fwmark_reflect sysctl is enabled.
This allows the administrator to make PMTUD work by configuring
appropriate fwmark rules to mark the inbound ICMP packets.
Black-box tested using user-mode linux by pointing different
fwmarks at routing tables egressing on different interfaces, and
using iptables mangling to mark packets inbound on each interface
with the interface's fwmark. ICMPv4 and ICMPv6 PMTU discovery
work as expected when mark reflection is enabled and fail when
it is disabled.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Colitti [Tue, 13 May 2014 17:17:33 +0000 (10:17 -0700)]
net: add a sysctl to reflect the fwmark on replies
Kernel-originated IP packets that have no user socket associated
with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
are emitted with a mark of zero. Add a sysctl to make them have
the same mark as the packet they are replying to.
This allows an administrator that wishes to do so to use
mark-based routing, firewalling, etc. for these replies by
marking the original packets inbound.
Tested using user-mode linux:
- ICMP/ICMPv6 echo replies and errors.
- TCP RST packets (IPv4 and IPv6).
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 May 2014 22:02:49 +0000 (18:02 -0400)]
Merge branch 'arc_emac-next'
Beniamino Galvani says:
====================
arc_emac: promiscuous/multicast mode and netpoll support
These patches add support for promiscuous mode, multicast filtering
and netpoll to the ARC EMAC driver.
They were both tested on a Radxa Rock board which uses a ARC EMAC IP
core integrated in the Rockchip RK3188 SoC.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Beniamino Galvani [Sun, 11 May 2014 16:11:48 +0000 (18:11 +0200)]
arc_emac: add netpoll support
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Beniamino Galvani [Sun, 11 May 2014 16:11:47 +0000 (18:11 +0200)]
arc_emac: implement promiscuous mode and multicast filtering
This patch implements the set_rx_mode function to enable/disable
promiscuous or all-multicast modes and to update the multicast
filtering list of the device.
Signed-off-by: Beniamino Galvani <b.galvani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 May 2014 21:53:46 +0000 (17:53 -0400)]
Merge branch 'tcp-fastopen-ipv6'
Yuchung Cheng says:
====================
tcp: IPv6 support for fastopen server
This patch series add IPv6 support for fastopen server. To minimize
code duplication in IPv4 and IPv6, the current v4 only code is
refactored and common code is moved into net/ipv4/tcp_fastopen.c.
Also the current code uses a different function from
tcp_v4_send_synack() to send the first SYN-ACK in fastopen.
The new code eliminates this separate function by refactoring the
child-socket and syn-ack creation code. After these refactoring
in the first four patches, we can easily add the fastopen code in
IPv6 by changing corresponding IPv6 functions.
Note Fast Open client already supports IPv6. This patch is for
the server-side (passive open) IPv6 support only.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Lee [Mon, 12 May 2014 03:22:13 +0000 (20:22 -0700)]
tcp: IPv6 support for fastopen server
After all the preparatory works, supporting IPv6 in Fast Open is now easy.
We pretty much just mirror v4 code. The only difference is how we
generate the Fast Open cookie for IPv6 sockets. Since Fast Open cookie
is 128 bits and we use AES 128, we use CBC-MAC to encrypt both the
source and destination IPv6 addresses since the cookie is a MAC tag.
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Mon, 12 May 2014 03:22:12 +0000 (20:22 -0700)]
tcp: improve fastopen icmp handling
If a fast open socket is already accepted by the user, it should
be treated like a connected socket to record the ICMP error in
sk_softerr, so the user can fetch it. Do that in both tcp_v4_err
and tcp_v6_err.
Also refactor the sequence window check to improve readability
(e.g., there were two local variables named 'req').
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Mon, 12 May 2014 03:22:11 +0000 (20:22 -0700)]
tcp: use tcp_v4_send_synack on first SYN-ACK
To avoid large code duplication in IPv6, we need to first simplify
the complicate SYN-ACK sending code in tcp_v4_conn_request().
To use tcp_v4(6)_send_synack() to send all SYN-ACKs, we need to
initialize the mini socket's receive window before trying to
create the child socket and/or building the SYN-ACK packet. So we move
that initialization from tcp_make_synack() to tcp_v4_conn_request()
as a new function tcp_openreq_init_req_rwin().
After this refactoring the SYN-ACK sending code is simpler and easier
to implement Fast Open for IPv6.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Mon, 12 May 2014 03:22:10 +0000 (20:22 -0700)]
tcp: simplify fast open cookie processing
Consolidate various cookie checking and generation code to simplify
the fast open processing. The main goal is to reduce code duplication
in tcp_v4_conn_request() for IPv6 support.
Removes two experimental sysctl flags TFO_SERVER_ALWAYS and
TFO_SERVER_COOKIE_NOT_CHKD used primarily for developmental debugging
purposes.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Mon, 12 May 2014 03:22:09 +0000 (20:22 -0700)]
tcp: move fastopen functions to tcp_fastopen.c
Move common TFO functions that will be used by both v4 and v6
to tcp_fastopen.c. Create a helper tcp_fastopen_queue_check().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Daniel Lee <longinus00@gmail.com>
Signed-off-by: Jerry Chu <hkchu@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 May 2014 21:46:19 +0000 (17:46 -0400)]
Merge branch 'cdc_mbim-next'
Bjørn Mork says:
====================
cdc_mbim: cleanups and new features
This series depends on commit
6b5eeb7f874b ("net: cdc_mbim: handle
unaccelerated VLAN tagged frames"), which is currently in "net" but
not yet in "net-next".
Patch 4 might have a minor context collision with the "cdc_ncm: add
buffer tuning and stats using ethtool" series I just posted for
review. Please let me know if I should submit an adjusted version
in either direction. These two series' are otherwise completely
independent of each other.
The major new feature here is in patch 1, which I hope will solve
some problems with the original design without changing the existing
API, optionally allowing IP session 0 to be treated like any other
MBIM session.
The rest are some minor cleanups and finally some documentation of
the current driver APIs, after this series has been applied. I
started feeling a bit more mortal than usual lately, which probably
is healthy, and realized that I should put some of the stuff in my
head in a somewhat less volatile storage.
v2:
Fixed patch 1 so that it actually does what it claims to do. This time
it is even tested for functionality, and not just build tested...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjørn Mork [Sun, 11 May 2014 08:47:15 +0000 (10:47 +0200)]
net: cdc_ncm/cdc_mbim: rework probing of NCM/MBIM functions
The NCM class match in the cdc_mbim driver is confusing and
cause unexpected behaviour. The USB core guarantees that a
USB interface is in altsetting 0 when probing starts. This
means that devices implementing a NCM 1.0 backwards
compatible MBIM function (a "NCM/MBIM function") always hit
the NCM entry in the cdc_mbim driver match table. Such
functions will never match any of the MBIM entries.
This causes unexpeced behaviour for cases where the NCM and
MBIM entries are differet, which is currently the case for
all except Ericsson devices.
Improve the probing of NCM/MBIM functions by looking up the
device again in the cdc_mbim match table after switching to
the MBIM identity.
The shared altsetting selection is updated to better
accommodate the new probing logic, returning the preferred
altsetting for the control interface instead of the data
interface. The control interface altsetting update is moved
to the cdc_mbim driver. It is never necessary to change the
control interface altsetting for NCM.
Cc: Greg Suarez <gsuarez@smithmicro.com>
Reported by: Yu-an Shih <yshih@nvidia.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjørn Mork [Sun, 11 May 2014 08:47:14 +0000 (10:47 +0200)]
net: cdc_mbim: add driver documentation
An initial attempt on describing some of the odd APIs
provided by this driver.
Cc: Greg Suarez <gsuarez@smithmicro.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjørn Mork [Sun, 11 May 2014 08:47:13 +0000 (10:47 +0200)]
net: cdc_mbim: reject IP packets on DSS VLANs
DSS VLANs are pseudo network interfaces representing arbitrary
data streams, and specifically not IP. Preventing spurious IP
packets can sometimes be a hassle. The kernel will for example
send an IPv6 Router Solicit when the interface is brought up
unless the user has been careful enough to disable IPv6 first.
Such packets forwared to a MBIM DSS session will look like
spurious noise to the device, and can cause it to log an error
or even malfunction.
Drop all IP packets on the designated DSS VLANs to prevent such
unwanted spurious transmissions.
Cc: Greg Suarez <gsuarez@smithmicro.com>
Reported-by: Arnaud Desmier <adesmier@sequans.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjørn Mork [Sun, 11 May 2014 08:47:12 +0000 (10:47 +0200)]
net: cdc_mbim: optionally use VLAN ID 4094 for IP session 0
The cdc_mbim driver maps 802.1q VLANs to MBIM IP and DSS
sessions. MBIM IP session 0 is handled as an exception and
is mapped to untagged frames.
This patch adds optional support for remapping MBIM IP
session 0 to 802.1q VLAN ID 4094 instead. The default
behaviour is not changed. The new behaviour is triggered
by adding a link for this previously unsupported VLAN.
The untagged mapping was chosen initially to support the
assumed most common use case: Most current MBIM devices only
support a single IP session (i.e. session 0 only), and using
untagged frames lets the users completely ignore the
additonal complexity of the multiplexing layer.
But when the multiplexing features of MBIM are used, then
this netdev gets a double meaning: It becomes the master
interface for all the VLAN subdevs the additional sessions
are mapped to, while still serving as the untagged IP
interface for session 0.
This can be problematic, especially when using Device Service
Streams (DSS), as have become apparent recently with the
availability of devices with real DSS support. Some use cases
need to e.g set a MTU which is higher than allowed for IP
Session 0. The dual role also leads to the situation where
the IP Session 0 interface cannot be taken down without
breaking unrelated IP or DSS sessions - a devastating side
effect which applications managing a simple IP session cannot
be expected to be aware of. A typical DHCP client will assume
that it should bring the interface down after releasing the
IP lease.
These problems can be avoided by tagging IP session 0 packets
too, making this session similar to all other multiplexed
sessions. This redefines the main netdev as an upper master
interface only.
Cc: Greg Suarez <gsuarez@smithmicro.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wilfried Klaebe [Sun, 11 May 2014 00:12:32 +0000 (00:12 +0000)]
net: get rid of SET_ETHTOOL_OPS
net: get rid of SET_ETHTOOL_OPS
Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone.
This does that.
Mostly done via coccinelle script:
@@
struct ethtool_ops *ops;
struct net_device *dev;
@@
- SET_ETHTOOL_OPS(dev, ops);
+ dev->ethtool_ops = ops;
Compile tested only, but I'd seriously wonder if this broke anything.
Suggested-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mathias Krause [Sat, 10 May 2014 20:23:28 +0000 (22:23 +0200)]
net: ptp: mark filter as __initdata
sk_unattached_filter_create() will copy the filter's instructions so we
don't need to have the master copy hanging around after initialization.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Tue, 13 May 2014 16:58:44 +0000 (09:58 -0700)]
net: fix test_bpf build to depend on NET
Fix build when CONFIG_NET is not enabled.
Fixes these build errors:
WARNING: "sk_unattached_filter_destroy" [lib/test_bpf.ko] undefined!
WARNING: "kfree_skb" [lib/test_bpf.ko] undefined!
WARNING: "sk_unattached_filter_create" [lib/test_bpf.ko] undefined!
WARNING: "sk_run_filter_int_skb" [lib/test_bpf.ko] undefined!
WARNING: "__alloc_skb" [lib/test_bpf.ko] undefined!
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 May 2014 17:13:33 +0000 (13:13 -0400)]
net: filter: Fix redefinition warnings on x86-64.
Do not collide with the x86-64 PTRACE user API namespace.
net/core/filter.c:57:0: warning: "R8" redefined [enabled by default]
arch/x86/include/uapi/asm/ptrace-abi.h:38:0: note: this is the location of the previous definition
Fix by adding a BPF_ prefix to the register macros.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Sat, 10 May 2014 18:33:54 +0000 (11:33 -0700)]
bnx2x: fix build when BNX2X_SRIOV is not enabled
Fix build when BNX2X_SRIOV is not enabled.
Change one parameter struct from bnx2 to bnx2x and don't return a value
from a void function.
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h:576:48: warning: 'struct bnx2' declared inside parameter list [enabled by default]
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h:576:48: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h:576:60: warning: 'return' with a value, in function returning void [enabled by default]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Ariel Elior <ariele@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 May 2014 04:11:07 +0000 (00:11 -0400)]
Merge branch 'cpsw'
Mugunthan V N says:
====================
Add DRA7xx and AM43xx platform support in cpsw-phy-sel driver
Adding DRA7xx and AM43xx platform support to cpsw-phy-sel driver to select
phy mode in control driver and fixing the uninitialized dev by initializing
to platform device structure pointer.
Changes from Initial version
* Added back the missing patch (1/3)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Fri, 9 May 2014 13:37:35 +0000 (19:07 +0530)]
drivers: net: cpsw-phy-sel: add am43xx platform support
AM43xx phy mode selection is similar to AM33xx platform, so adding only
the compatibility string to the driver
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Fri, 9 May 2014 13:37:34 +0000 (19:07 +0530)]
drivers: net: cpsw-phy-sel: add dra7xx support for phy sel
Add dra7xx support for selecting the phy mode which is present in control
module of dra7xx SoC
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Fri, 9 May 2014 13:37:33 +0000 (19:07 +0530)]
drivers: net: cpsw-phy-sel: initialize priv->dev
priv->dev is uninitialized, initializing with pdev->dev
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Yingliang [Fri, 9 May 2014 08:49:05 +0000 (16:49 +0800)]
sch_hhf: fix comparison of qlen and limit
When I use the following command, eth0 cannot send any packets.
#tc qdisc add dev eth0 root handle 1: hhf limit 1
Because qlen need be smaller than limit, all packets were dropped.
Fix this by qlen *<=* limit.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Fri, 9 May 2014 06:58:05 +0000 (14:58 +0800)]
vlan: rename __vlan_find_dev_deep() to __vlan_find_dev_deep_rcu()
The __vlan_find_dev_deep should always called in RCU, according
David's suggestion, rename to __vlan_find_dev_deep_rcu looks more
reasonable.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Sun, 4 May 2014 23:39:18 +0000 (16:39 -0700)]
net: rename local_df to ignore_df
As suggested by several people, rename local_df to ignore_df,
since it means "ignore df bit if it is set".
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 12 May 2014 17:19:14 +0000 (13:19 -0400)]
Merge git://git./linux/kernel/git/davem/net
Conflicts:
drivers/net/ethernet/altera/altera_sgdma.c
net/netlink/af_netlink.c
net/sched/cls_api.c
net/sched/sch_api.c
The netlink conflict dealt with moving to netlink_capable() and
netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
in non-init namespaces. These were simple transformations from
netlink_capable to netlink_ns_capable.
The Altera driver conflict was simply code removal overlapping some
void pointer cast cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 12 May 2014 16:48:25 +0000 (12:48 -0400)]
Merge branch 'cxgb4'
Hariprasad Shenai says:
====================
Misc. fixes for cxgb4 and cxgb4vf driver
This series of patch provides fixes for cxgb4 and cxgb4vf driver related to
rx checksum counter and decodes module type a bit more for ethtool output.
The patches series is created against David Miller's 'net-next' tree.
We would like to request this patch series to get merged via David Miller's
'net-next' tree.
We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Wed, 7 May 2014 12:31:04 +0000 (18:01 +0530)]
cxgb4vf: Check if rx checksum offload is enabled, while reading hardware calculated checksum
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Wed, 7 May 2014 12:31:03 +0000 (18:01 +0530)]
cxgb4: Check if rx checksum offload is enabled, while reading hardware calculated checksum
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad Shenai [Wed, 7 May 2014 12:31:02 +0000 (18:01 +0530)]
cxgb4: Decode the firmware port and module type a bit more for ethtool
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Duan Jiong [Fri, 9 May 2014 05:31:43 +0000 (13:31 +0800)]
ipv6: remove parameter rt from fib6_prune_clones()
the parameter rt will be assigned to c.arg in function fib6_clean_tree(),
but function fib6_prune_clone() doesn't use c.arg, so we can remove it
safely.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Thu, 8 May 2014 22:14:10 +0000 (15:14 -0700)]
Add support for netvsc build without CONFIG_SYSFS flag
This change ensures the driver can be built successfully without the
CONFIG_SYSFS flag.
MS-TFS: 182270
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Thu, 8 May 2014 21:54:42 +0000 (14:54 -0700)]
ptp: fix kconfig dependency warnings
Fix kconfig warnings:
PTP_1588_CLOCK selects NET_PTP_CLASSIFY, which depends on NET,
so PTP_1588_CLOCK should also depend on NET.
PTP_1588_CLOCK_PCH selects PTP_1588_CLOCK so the former should
depend on NET.
warning: (IXP4XX_ETH && PTP_1588_CLOCK) selects NET_PTP_CLASSIFY which has unmet direct dependencies (NET)
warning: (SFC && TILE_NET && BFIN_MAC_USE_HWSTAMP && TIGON3 && FEC && E1000E && IGB && IXGBE && I40E && MLX4_EN && SXGBE_ETH && STMMAC_ETH && TI_CPTS && PTP_1588_CLOCK_GIANFAR && PTP_1588_CLOCK_IXP46X && DP83640_PHY && PTP_1588_CLOCK_PCH) selects PTP_1588_CLOCK which has unmet direct dependencies (NET)
[This warning is caused by the new 'depends on NET' in PTP_1588_CLOCK.]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 12 May 2014 04:25:51 +0000 (00:25 -0400)]
Merge branch 'filter-next'
Alexei Starovoitov says:
====================
BPF testsuite and cleanup
This patchset adds BPF testsuite and improves readability of classic
to internal BPF converter.
The testsuite helped to find 'negative offset bug' in x64 JIT that was
fixed by commit
fdfaf64e ("x86: bpf_jit: support negative offsets")
It can be very useful for classic and internal JIT compiler developers.
Also it serves as performance benchmark.
x86_64/i386 pass all tests with and without JIT. arm32 JIT is failing
negative offset tests which are unsupported.
Internal BPF tests are much larger than classic tests to cover different
combinations of registers. Negative tests check correctness of classic
BPF verifier which must reject them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 8 May 2014 21:10:53 +0000 (14:10 -0700)]
net: filter: additional BPF tests
All tests should pass with and without JIT.
Example output:
test_bpf: #0 TAX 35 16 16 PASS
test_bpf: #1 TXA 7 7 7 PASS
test_bpf: #2 ADD_SUB_MUL_K 10 PASS
test_bpf: #3 DIV_KX 33 PASS
test_bpf: #4 AND_OR_LSH_K 10 10 PASS
test_bpf: #5 LD_IND 8 8 8 PASS
test_bpf: #6 LD_ABS 8 8 8 PASS
test_bpf: #7 LD_ABS_LL 13 14 PASS
test_bpf: #8 LD_IND_LL 12 12 12 PASS
test_bpf: #9 LD_ABS_NET 10 12 PASS
test_bpf: #10 LD_IND_NET 11 12 12 PASS
...
Numbers are times in nsec per filter for given input data.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 8 May 2014 21:10:52 +0000 (14:10 -0700)]
net: filter: BPF testsuite
The testsuite covers classic and internal BPF instructions.
It is particularly useful for JIT compiler developers.
Adds to "net" selftest target.
The testsuite can be used as a set of micro-benchmarks.
It measures execution time of each BPF program in nsec.
This patch adds core framework.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Thu, 8 May 2014 21:10:51 +0000 (14:10 -0700)]
net: filter: make BPF conversion more readable
Introduce BPF helper macros to define instructions
(similar to old BPF_STMT/BPF_JUMP macros)
Use them while converting classic BPF to internal
and in BPF testsuite later.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 9 May 2014 20:46:53 +0000 (16:46 -0400)]
Merge branch 'for-davem' of git://git./linux/kernel/git/linville/wireless
John W. Linville says:
====================
pull request: wireless 2014-05-08
This one is all from Johannes:
"Here are a few small fixes for the current cycle: radiotap TX flags were
wrong (fix by Bob), Chun-Yeow fixes an SMPS issue with mesh interfaces,
Eliad fixes a locking bug and a cfg80211 state problem and finally
Henning sent me a fix for IBSS rate information."
Please let me know if there are problems!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 9 May 2014 20:41:16 +0000 (16:41 -0400)]
Merge branch 'sctp_sysctl'
Wang Weidong says:
====================
sctp: fix kfree static array pointer in sctp_sysctl_net_unregister
patch #1 revert the
efb842c45("sctp: optimize the sctp_sysctl_net_register")
patch #2 add a checking for sctp_sysctl_net_register
====================
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Thu, 8 May 2014 12:55:02 +0000 (20:55 +0800)]
sctp: add a checking for sctp_sysctl_net_register
When register_net_sysctl failed, we should free the
sysctl_table.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Thu, 8 May 2014 12:55:01 +0000 (20:55 +0800)]
Revert "sctp: optimize the sctp_sysctl_net_register"
This revert commit
efb842c45("sctp: optimize the sctp_sysctl_net_register"),
Since it doesn't kmemdup a sysctl_table for init_net, so the
init_net->sctp.sysctl_header->ctl_table_arg points to sctp_net_table
which is a static array pointer. So when doing sctp_sysctl_net_unregister,
it will free sctp_net_table, then we will get a NULL pointer dereference
like that:
[ 262.948220] BUG: unable to handle kernel NULL pointer dereference at
000000000000006c
[ 262.948232] IP: [<
ffffffff81144b70>] kfree+0x80/0x420
[ 262.948260] PGD
db80a067 PUD
dae12067 PMD 0
[ 262.948268] Oops: 0000 [#1] SMP
[ 262.948273] Modules linked in: sctp(-) crc32c_generic libcrc32c
...
[ 262.948338] task:
ffff8800db830190 ti:
ffff8800dad00000 task.ti:
ffff8800dad00000
[ 262.948344] RIP: 0010:[<
ffffffff81144b70>] [<
ffffffff81144b70>] kfree+0x80/0x420
[ 262.948353] RSP: 0018:
ffff8800dad01d88 EFLAGS:
00010046
[ 262.948358] RAX:
0100000000000000 RBX:
ffffffffa0227940 RCX:
ffffea0000707888
[ 262.948363] RDX:
ffffea0000707888 RSI:
0000000000000001 RDI:
ffffffffa0227940
[ 262.948369] RBP:
ffff8800dad01de8 R08:
0000000000000000 R09:
ffff8800d9e983a9
[ 262.948374] R10:
0000000000000000 R11:
0000000000000000 R12:
ffffffffa0227940
[ 262.948380] R13:
ffffffff8187cfc0 R14:
0000000000000000 R15:
ffffffff8187da10
[ 262.948386] FS:
00007fa2a2658700(0000) GS:
ffff880112800000(0000) knlGS:
0000000000000000
[ 262.948394] CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
[ 262.948400] CR2:
000000000000006c CR3:
00000000cddc0000 CR4:
00000000000006e0
[ 262.948410] Stack:
[ 262.948413]
ffff8800dad01da8 0000000000000286 0000000020227940 ffffffffa0227940
[ 262.948422]
ffff8800dad01dd8 ffffffff811b7fa1 ffffffffa0227940 ffffffffa0227940
[ 262.948431]
ffffffff8187d960 ffffffff8187cfc0 ffffffff8187d960 ffffffff8187da10
[ 262.948440] Call Trace:
[ 262.948457] [<
ffffffff811b7fa1>] ? unregister_sysctl_table+0x51/0xa0
[ 262.948476] [<
ffffffffa020d1a1>] sctp_sysctl_net_unregister+0x21/0x30 [sctp]
[ 262.948490] [<
ffffffffa020ef6d>] sctp_net_exit+0x12d/0x150 [sctp]
[ 262.948512] [<
ffffffff81394f49>] ops_exit_list+0x39/0x60
[ 262.948522] [<
ffffffff813951ed>] unregister_pernet_operations+0x3d/0x70
[ 262.948530] [<
ffffffff81395292>] unregister_pernet_subsys+0x22/0x40
[ 262.948544] [<
ffffffffa020efcc>] sctp_exit+0x3c/0x12d [sctp]
[ 262.948562] [<
ffffffff810c5e04>] SyS_delete_module+0x194/0x210
[ 262.948577] [<
ffffffff81240fde>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 262.948587] [<
ffffffff815217a2>] system_call_fastpath+0x16/0x1b
With this revert, it won't occur the Oops.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>