David S. Miller [Mon, 15 Jun 2015 22:53:50 +0000 (15:53 -0700)]
Merge branch 'bpf-share-helpers'
Alexei Starovoitov says:
====================
v1->v2: switched to init_user_ns from current_user_ns as suggested by Andy
Introduce new helpers to access 'struct task_struct'->pid, tgid, uid, gid, comm
fields in tracing and networking.
Share bpf_trace_printk() and bpf_get_smp_processor_id() helpers between
tracing and networking.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Sat, 13 Jun 2015 02:39:14 +0000 (19:39 -0700)]
bpf: let kprobe programs use bpf_get_smp_processor_id() helper
It's useful to do per-cpu histograms.
Suggested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Sat, 13 Jun 2015 02:39:13 +0000 (19:39 -0700)]
bpf: allow networking programs to use bpf_trace_printk() for debugging
bpf_trace_printk() is a helper function used to debug eBPF programs.
Let socket and TC programs use it as well.
Note, it's DEBUG ONLY helper. If it's used in the program,
the kernel will print warning banner to make sure users don't use
it in production.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Sat, 13 Jun 2015 02:39:12 +0000 (19:39 -0700)]
bpf: introduce current->pid, tgid, uid, gid, comm accessors
eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:
u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid
u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid
bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf
They can be used from the programs attached to TC as well to classify packets
based on current task fields.
Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 15 Jun 2015 21:30:32 +0000 (14:30 -0700)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
This a bit large (and late) patchset that contains Netfilter updates for
net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of
x_tables percpu ruleset copy and rework of the nf_tables netdev support. More
specifically, they are:
1) Warn the user when there is a better protocol conntracker available, from
Marcelo Ricardo Leitner.
2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard
Thaler. This comes with several patches to prepare the change in first place.
3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This
is not needed anymore since now we use the largest fragment size to
refragment, from Florian Westphal.
4) Restore vlan tag when refragmenting in br_netfilter, also from Florian.
5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another
follow up patch to refine it from Eric Dumazet.
6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik.
7) Get rid of parens in Netfilter Kconfig files.
8) Attach the net_device to the basechain as opposed to the initial per table
approach in the nf_tables netdev family.
9) Subscribe to netdev events to detect the removal and registration of a
device that is referenced by a basechain.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso [Mon, 15 Jun 2015 10:12:01 +0000 (12:12 +0200)]
netfilter: nf_tables_netdev: unregister hooks on net_device removal
In case the net_device is gone, we have to unregister the hooks and put back
the reference on the net_device object. Once it comes back, register them
again. This also covers the device rename case.
This patch also adds a new flag to indicate that the basechain is disabled, so
their hooks are not registered. This flag is used by the netdev family to
handle the case where the net_device object is gone. Currently this flag is not
exposed to userspace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Mon, 15 Jun 2015 00:42:31 +0000 (02:42 +0200)]
netfilter: nf_tables: add nft_register_basechain() and nft_unregister_basechain()
This wrapper functions take care of hook registration for basechains.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Fri, 12 Jun 2015 11:55:41 +0000 (13:55 +0200)]
netfilter: nf_tables: attach net_device to basechain
The device is part of the hook configuration, so instead of a global
configuration per table, set it to each of the basechain that we create.
This patch reworks
ebddf1a8d78a ("netfilter: nf_tables: allow to bind table to
net_device").
Note that this adds a dev_name field in the nft_base_chain structure which is
required the netdev notification subscription that follows up in a patch to
handle gone net_devices.
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Eric Dumazet [Mon, 15 Jun 2015 16:57:30 +0000 (09:57 -0700)]
netfilter: x_tables: remove XT_TABLE_INFO_SZ and a dereference.
After Florian patches, there is no need for XT_TABLE_INFO_SZ anymore :
Only one copy of table is kept, instead of one copy per cpu.
We also can avoid a dereference if we put table data right after
xt_table_info. It reduces register pressure and helps compiler.
Then, we attempt a kmalloc() if total size is under order-3 allocation,
to reduce TLB pressure, as in many cases, rules fit in 32 KB.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Mon, 15 Jun 2015 16:31:22 +0000 (18:31 +0200)]
Merge branch 'master' of git://blackhole.kfki.hu/nf-next
Jozsef Kadlecsik says:
====================
ipset patches for nf-next
Please consider to apply the next bunch of patches for ipset. First
comes the small changes, then the bugfixes and at the end the RCU
related patches.
* Use MSEC_PER_SEC consistently instead of the number.
* Use SET_WITH_*() helpers to test set extensions from Sergey Popovich.
* Check extensions attributes before getting extensions from Sergey Popovich.
* Permit CIDR equal to the host address CIDR in IPv6 from Sergey Popovich.
* Make sure we always return line number on batch in the case of error
from Sergey Popovich.
* Check CIDR value only when attribute is given from Sergey Popovich.
* Fix cidr handling for hash:*net* types, reported by Jonathan Johnson.
* Fix parallel resizing and listing of the same set so that the original
set is kept for the whole dumping.
* Make sure listing doesn't grab a set which is just being destroyed.
* Remove rbtree from ip_set_hash_netiface.c in order to introduce RCU.
* Replace rwlock_t with spinlock_t in "struct ip_set", change the locking
in the core and simplifications in the timeout routines.
* Introduce RCU locking in bitmap:* types with a slight modification in the
logic on how an element is added.
* Introduce RCU locking in hash:* types. This is the most complex part of
the changes.
* Introduce RCU locking in list type where standard rculist is used.
* Fix coding styles reported by checkpatch.pl.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Fri, 12 Jun 2015 11:58:52 +0000 (13:58 +0200)]
netfilter: Kconfig: get rid of parens around depends on
According to the reporter, they are not needed.
Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Kenneth Klette Jonassen [Fri, 12 Jun 2015 15:24:03 +0000 (17:24 +0200)]
tcp: cdg: use div_u64()
Fixes cross-compile to mips.
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jozsef Kadlecsik [Sat, 13 Jun 2015 17:45:33 +0000 (19:45 +0200)]
netfilter: ipset: Fix coding styles reported by checkpatch.pl
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Sat, 13 Jun 2015 14:56:02 +0000 (16:56 +0200)]
netfilter: ipset: Introduce RCU locking in list type
Standard rculist is used.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Sat, 13 Jun 2015 15:29:56 +0000 (17:29 +0200)]
netfilter: ipset: Introduce RCU locking in hash:* types
Three types of data need to be protected in the case of the hash types:
a. The hash buckets: standard rcu pointer operations are used.
b. The element blobs in the hash buckets are stored in an array and
a bitmap is used for book-keeping to tell which elements in the array
are used or free.
c. Networks per cidr values and the cidr values themselves are stored
in fix sized arrays and need no protection. The values are modified
in such an order that in the worst case an element testing is repeated
once with the same cidr value.
The ipset hash approach uses arrays instead of lists and therefore is
incompatible with rhashtable.
Performance is tested by Jesper Dangaard Brouer:
Simple drop in FORWARD
~~~~~~~~~~~~~~~~~~~~~~
Dropping via simple iptables net-mask match::
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -s 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple
Drop performance in "raw": 11.3Mpps
Generator: sending 12.2Mpps (tx:
12264083 pps)
Drop via original ipset in RAW table
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Create a set with lots of elements::
sudo ./ipset destroy test
echo "create test hash:ip hashsize 65536" > test.set
for x in `seq 0 255`; do
for y in `seq 0 255`; do
echo "add test 198.18.$x.$y" >> test.set
done
done
sudo ./ipset restore < test.set
Dropping via ipset::
iptables -t raw -F
iptables -t raw -N net198 || iptables -t raw -F net198
iptables -t raw -I net198 -m set --match-set test src -j DROP
iptables -t raw -I PREROUTING -j net198
Drop performance in "raw" with ipset: 8Mpps
Perf report numbers ipset drop in "raw"::
+ 24.65% ksoftirqd/1 [ip_set] [k] ip_set_test
- 21.42% ksoftirqd/1 [kernel.kallsyms] [k] _raw_read_lock_bh
- _raw_read_lock_bh
+ 99.88% ip_set_test
- 19.42% ksoftirqd/1 [kernel.kallsyms] [k] _raw_read_unlock_bh
- _raw_read_unlock_bh
+ 99.72% ip_set_test
+ 4.31% ksoftirqd/1 [ip_set_hash_ip] [k] hash_ip4_kadt
+ 2.27% ksoftirqd/1 [ixgbe] [k] ixgbe_fetch_rx_buffer
+ 2.18% ksoftirqd/1 [ip_tables] [k] ipt_do_table
+ 1.81% ksoftirqd/1 [ip_set_hash_ip] [k] hash_ip4_test
+ 1.61% ksoftirqd/1 [kernel.kallsyms] [k] __netif_receive_skb_core
+ 1.44% ksoftirqd/1 [kernel.kallsyms] [k] build_skb
+ 1.42% ksoftirqd/1 [kernel.kallsyms] [k] ip_rcv
+ 1.36% ksoftirqd/1 [kernel.kallsyms] [k] __local_bh_enable_ip
+ 1.16% ksoftirqd/1 [kernel.kallsyms] [k] dev_gro_receive
+ 1.09% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_unlock
+ 0.96% ksoftirqd/1 [ixgbe] [k] ixgbe_clean_rx_irq
+ 0.95% ksoftirqd/1 [kernel.kallsyms] [k] __netdev_alloc_frag
+ 0.88% ksoftirqd/1 [kernel.kallsyms] [k] kmem_cache_alloc
+ 0.87% ksoftirqd/1 [xt_set] [k] set_match_v3
+ 0.85% ksoftirqd/1 [kernel.kallsyms] [k] inet_gro_receive
+ 0.83% ksoftirqd/1 [kernel.kallsyms] [k] nf_iterate
+ 0.76% ksoftirqd/1 [kernel.kallsyms] [k] put_compound_page
+ 0.75% ksoftirqd/1 [kernel.kallsyms] [k] __rcu_read_lock
Drop via ipset in RAW table with RCU-locking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With RCU locking, the RW-lock is gone.
Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps
Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Sat, 13 Jun 2015 12:39:59 +0000 (14:39 +0200)]
netfilter: ipset: Introduce RCU locking in bitmap:* types
There's nothing much required because the bitmap types use atomic
bit operations. However the logic of adding elements slightly changed:
first the MAC address updated (which is not atomic), then the element
activated (added). The extensions may call kfree_rcu() therefore we
call rcu_barrier() at module removal.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Sat, 13 Jun 2015 12:22:25 +0000 (14:22 +0200)]
netfilter: ipset: Prepare the ipset core to use RCU at set level
Replace rwlock_t with spinlock_t in "struct ip_set" and change the locking
accordingly. Convert the comment extension into an rcu-avare object. Also,
simplify the timeout routines.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Sat, 13 Jun 2015 12:02:51 +0000 (14:02 +0200)]
netfilter:ipset Remove rbtree from hash:net,iface
Remove rbtree in order to introduce RCU instead of rwlock in ipset
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Sat, 13 Jun 2015 11:39:38 +0000 (13:39 +0200)]
netfilter: ipset: Make sure listing doesn't grab a set which is just being destroyed.
There was a small window when all sets are destroyed and a concurrent
listing of all sets could grab a set which is just being destroyed.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Sat, 13 Jun 2015 09:59:45 +0000 (11:59 +0200)]
netfilter: ipset: Fix parallel resizing and listing of the same set
When elements added to a hash:* type of set and resizing triggered,
parallel listing could start to list the original set (before resizing)
and "continue" with listing the new set. Fix it by references and
using the original hash table for listing. Therefore the destroying of
the original hash table may happen from the resizing or listing functions.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Fri, 12 Jun 2015 20:11:00 +0000 (22:11 +0200)]
netfilter: ipset: Fix cidr handling for hash:*net* types
Commit "Simplify cidr handling for hash:*net* types" broke the cidr
handling for the hash:*net* types when the sets were used by the SET
target: entries with invalid cidr values were added to the sets.
Reported by Jonathan Johnson.
Testsuite entry is added to verify the fix.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Sergey Popovich [Fri, 12 Jun 2015 19:30:57 +0000 (21:30 +0200)]
netfilter: ipset: Check CIDR value only when attribute is given
There is no reason to check CIDR value regardless attribute
specifying CIDR is given.
Initialize cidr array in element structure on element structure
declaration to let more freedom to the compiler to optimize
initialization right before element structure is used.
Remove local variables cidr and cidr2 for netnet and netportnet
hashes as we do not use packed cidr value for such set types and
can store value directly in e.cidr[].
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Sergey Popovich [Fri, 12 Jun 2015 19:26:43 +0000 (21:26 +0200)]
netfilter: ipset: Make sure we always return line number on batch
Even if we return with generic IPSET_ERR_PROTOCOL it is good idea
to return line number if we called in batch mode.
Moreover we are not always exiting with IPSET_ERR_PROTOCOL. For
example hash:ip,port,net may return IPSET_ERR_HASH_RANGE_UNSUPPORTED
or IPSET_ERR_INVALID_CIDR.
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Sergey Popovich [Fri, 12 Jun 2015 19:23:31 +0000 (21:23 +0200)]
netfilter: ipset: Permit CIDR equal to the host address CIDR in IPv6
Permit userspace to supply CIDR length equal to the host address CIDR
length in netlink message. Prohibit any other CIDR length for IPv6
variant of the set.
Also return -IPSET_ERR_HASH_RANGE_UNSUPPORTED instead of generic
-IPSET_ERR_PROTOCOL in IPv6 variant of hash:ip,port,net when
IPSET_ATTR_IP_TO attribute is given.
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Sergey Popovich [Fri, 12 Jun 2015 19:14:09 +0000 (21:14 +0200)]
netfilter: ipset: Check extensions attributes before getting extensions.
Make all extensions attributes checks within ip_set_get_extensions()
and reduce number of duplicated code.
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Sergey Popovich [Fri, 12 Jun 2015 19:11:54 +0000 (21:11 +0200)]
netfilter: ipset: Use SET_WITH_*() helpers to test set extensions
Signed-off-by: Sergey Popovich <popovich_sergei@mail.ua>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Jozsef Kadlecsik [Fri, 12 Jun 2015 19:07:54 +0000 (21:07 +0200)]
netfilter: ipset: Use MSEC_PER_SEC consistently
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
David S. Miller [Sun, 14 Jun 2015 06:56:52 +0000 (23:56 -0700)]
Merge git://git./linux/kernel/git/davem/net
Linus Torvalds [Sat, 13 Jun 2015 06:54:16 +0000 (20:54 -1000)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix uninitialized struct station_info in cfg80211_wireless_stats(),
from Johannes Berg.
2) Revert commit attempt to fix ipv6 protocol resubmission, it adds
regressions.
3) Endless loops can be created in bridge port lists, fix from Nikolay
Aleksandrov.
4) Don't WARN_ON() if sk->sk_forward_alloc is non-zero in
sk_clear_memalloc, it is a legal situation during swap deactivation.
Fix from Mel Gorman.
5) Fix order of disabling interrupts and unlocking NAPI in enic driver
to avoid a race. From Govindarajulu Varadarajan.
6) High and low register writes are swapped when programming the start
of periodic output in igb driver. From Richard Cochran.
7) Fix device rename handling in mpls stack, from Robert Shearman.
8) Do not trigger compaction synchronously when optimistically trying
to allocate an order 3 page in alloc_skb_with_frags() and
skb_page_frag_refill(). From Shaohua Li.
9) Authentication with COOKIE_ECHO is not handled properly in SCTP, fix
from Marcelo Ricardo Leitner.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
Doc: networking: Fix URL for wiki.wireshark.org in udplite.txt
sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHO
net: don't wait for order-3 page allocation
mpls: handle device renames for per-device sysctls
net: igb: fix the start time for periodic output signals
enic: fix memory leak in rq_clean
enic: check return value for stat dump
enic: unlock napi busy poll before unmasking intr
net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap
bridge: fix multicast router rlist endless loop
tipc: disconnect socket directly after probe failure
Revert "ipv6: Fix protocol resubmission"
cfg80211: wext: clear sinfo struct before calling driver
Eric Dumazet [Sat, 13 Jun 2015 02:34:03 +0000 (19:34 -0700)]
tcp: tcp_v6_connect() cleanup
Remove dead code from tcp_v6_connect()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 13 Jun 2015 02:31:32 +0000 (19:31 -0700)]
flow_dissector: fix ipv6 dst, hop-by-hop and routing ext hdrs
__skb_header_pointer() returns a pointer that must be checked.
Fixes infinite loop reported by Alexei, and add __must_check to
catch these errors earlier.
Fixes: 6a74fcf426f5 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs")
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Raghu Vatsavayi [Sat, 13 Jun 2015 01:11:50 +0000 (18:11 -0700)]
Fix Cavium Liquidio build related errors and warnings
1) Fixed following sparse warnings:
lio_main.c:213:6: warning: symbol 'octeon_droq_bh' was not
declared. Should it be static?
lio_main.c:233:5: warning: symbol 'lio_wait_for_oq_pkts' was
not declared. Should it be static?
lio_main.c:3083:5: warning: symbol 'lio_nic_info' was not
declared. Should it be static?
lio_main.c:2618:16: warning: cast from restricted __be16
octeon_device.c:466:6: warning: symbol 'oct_set_config_info'
was not declared. Should it be static?
octeon_device.c:573:25: warning: cast to restricted __be32
octeon_device.c:582:29: warning: cast to restricted __be32
octeon_device.c:584:39: warning: cast to restricted __be32
octeon_device.c:594:13: warning: cast to restricted __be32
octeon_device.c:596:25: warning: cast to restricted __be32
octeon_device.c:613:25: warning: cast to restricted __be32
octeon_device.c:614:29: warning: cast to restricted __be64
octeon_device.c:615:29: warning: cast to restricted __be32
octeon_device.c:619:37: warning: cast to restricted __be32
octeon_device.c:623:33: warning: cast to restricted __be32
cn66xx_device.c:540:6: warning: symbol
'lio_cn6xxx_get_pcie_qlmport' was not declared. Should it be s
octeon_mem_ops.c:181:16: warning: cast to restricted __be64
octeon_mem_ops.c:190:16: warning: cast to restricted __be32
octeon_mem_ops.c:196:17: warning: incorrect type in initializer
2) Fix build errors corresponding to vmalloc on linux-next 4.1.
3) Liquidio now supports 64 bit only, modified Kconfig accordingly.
4) Fix some code alignment issues based on kernel build warnings.
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 12 Jun 2015 21:24:28 +0000 (14:24 -0700)]
Merge branch 'flow_dissector-next'
Tom Herbert says:
====================
flow_dissector: Fix MPLS parsing and add ext hdr support
Need to shift label. Added parsing of dst, hop-by-hop, and routing
extension headers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 12 Jun 2015 16:01:06 +0000 (09:01 -0700)]
flow_dissector: add support for dst, hop-by-hop and routing ext hdrs
If dst, hop-by-hop or routing extension headers are present determine
length of the options and skip over them in flow dissection.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 12 Jun 2015 16:01:05 +0000 (09:01 -0700)]
flow_dissector: Fix MPLS entropy label handling in flow dissector
Need to shift after masking to get label value for comparison.
Fixes: b3baa0fbd02a1a9d493d8 ("mpls: Add MPLS entropy label in flow_keys")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Masanari Iida [Fri, 12 Jun 2015 15:23:21 +0000 (00:23 +0900)]
Doc: networking: Fix URL for wiki.wireshark.org in udplite.txt
This patch fix URL (http to https) for wiki.wireshark.org.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Fri, 12 Jun 2015 10:12:22 +0000 (12:12 +0200)]
net: ipv4: un-inline ip_finish_output2
text data bss dec hex filename
old: 16527 44 0 16571 40bb net/ipv4/ip_output.o
new: 14935 44 0 14979 3a83 net/ipv4/ip_output.o
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner [Thu, 11 Jun 2015 17:49:46 +0000 (14:49 -0300)]
sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHO
Currently, we can ask to authenticate DATA chunks and we can send DATA
chunks on the same packet as COOKIE_ECHO, but if you try to combine
both, the DATA chunk will be sent unauthenticated and peer won't accept
it, leading to a communication failure.
This happens because even though the data was queued after it was
requested to authenticate DATA chunks, it was also queued before we
could know that remote peer can handle authenticating, so
sctp_auth_send_cid() returns false.
The fix is whenever we set up an active key, re-check send queue for
chunks that now should be authenticated. As a result, such packet will
now contain COOKIE_ECHO + AUTH + DATA chunks, in that order.
Reported-by: Liu Wei <weliu@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 12 Jun 2015 18:35:19 +0000 (11:35 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"Remember about a week ago when I sent the last pull request for 4.1?
Well, I lied. Now, I don't want to shift the blame, but Dan, Ming,
and Richard made a liar out of me.
Here are three small patches that should go into 4.1. More
specifically, this pull request contains:
- A Kconfig dependency for the pmem block driver, so it can't be
selected if HAS_IOMEM isn't availble. From Richard Weinberger.
- A fix for genhd, making the ext_devt_lock softirq safe. This makes
lockdep happier, since we also end up grabbing this lock on release
off the softirq path. From Dan Williams.
- A blk-mq software queue release fix from Ming Lei.
Last two are headed to stable, first fixes an issue introduced in this
cycle"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: pmem: Add dependency on HAS_IOMEM
block: fix ext_dev_lock lockdep report
blk-mq: free hctx->ctxs in queue's release handler
Linus Torvalds [Fri, 12 Jun 2015 18:33:03 +0000 (11:33 -0700)]
Merge tag 'md/4.1-rc7-fixes' of git://neil.brown.name/md
Pull three more md fixes from Neil Brown:
"Hasn't been a good cycle for md has it :-(
The main issue fixed here is a rare race which can result in two
reshape threads running at once, which doesn't end well.
Also a minor issue with a write to a sysfs file returning the wrong
value. Backports to 4.0-stable are indicated"
* tag 'md/4.1-rc7-fixes' of git://neil.brown.name/md:
md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync
md: Close race when setting 'action' to 'idle'.
md: don't return 0 from array_state_store
Linus Torvalds [Fri, 12 Jun 2015 18:28:57 +0000 (11:28 -0700)]
Merge git://git.infradead.org/intel-iommu
Pull VT-d hardware workarounds from David Woodhouse:
"This contains a workaround for hardware issues which I *thought* were
never going to be seen on production hardware. I'm glad I checked
that before the 4.1 release...
Firstly, PASID support is so broken on existing chips that we're just
going to declare the old capability bit 28 as 'reserved' and change
the VT-d spec to move PASID support to another bit. So any existing
hardware doesn't support SVM; it only sets that (now) meaningless bit
28.
That patch *wasn't* imperative for 4.1 because we don't have PASID
support yet. But *even* the extended context tables are broken — if
you just enable the wider tables and use none of the new bits in them,
which is precisely what 4.1 does, you find that translations don't
work. It's this problem which I thought was caught in time to be
fixed before production, but wasn't.
To avoid triggering this issue, we now *only* enable the extended
context tables on hardware which also advertises "we have PASID
support and we actually tested it this time" with the new PASID
feature bit.
In addition, I've added an 'intel_iommu=ecs_off' command line
parameter to allow us to disable it manually if we need to"
* git://git.infradead.org/intel-iommu:
iommu/vt-d: Only enable extended context tables if PASID is supported
iommu/vt-d: Change PASID support to bit 40 of Extended Capability Register
Florian Westphal [Wed, 10 Jun 2015 23:34:55 +0000 (01:34 +0200)]
netfilter: xtables: avoid percpu ruleset duplication
We store the rule blob per (possible) cpu. Unfortunately this means we can
waste lot of memory on big smp machines. ipt_entry structure ('rule head')
is 112 byte, so e.g. with maxcpu=64 one single rule eats
close to 8k RAM.
Since previous patch made counters percpu it appears there is nothing
left in the rule blob that needs to be percpu.
On my test system (144 possible cpus, 400k dummy rules) this
change saves close to 9 Gigabyte of RAM.
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Wed, 10 Jun 2015 23:34:54 +0000 (01:34 +0200)]
netfilter: xtables: use percpu rule counters
The binary arp/ip/ip6tables ruleset is stored per cpu.
The only reason left as to why we need percpu duplication are the rule
counters embedded into ipt_entry et al -- since each cpu has its own copy
of the rules, all counters can be lockless.
The downside is that the more cpus are supported, the more memory is
required. Rules are not just duplicated per online cpu but for each
possible cpu, i.e. if maxcpu is 144, then rule is duplicated 144 times,
not for the e.g. 64 cores present.
To save some memory and also improve utilization of shared caches it
would be preferable to only store the rule blob once.
So we first need to separate counters and the rule blob.
Instead of using entry->counters, allocate this percpu and store the
percpu address in entry->counters.pcnt on CONFIG_SMP.
This change makes no sense as-is; it is merely an intermediate step to
remove the percpu duplication of the rule set in a followup patch.
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Fri, 5 Jun 2015 11:27:13 +0000 (13:27 +0200)]
netfilter: bridge: restore vlan tag when refragmenting
If bridge netfilter is used with both
bridge-nf-call-iptables and bridge-nf-filter-vlan-tagged enabled
then ip fragments in VLAN frames are sent without the vlan header.
This has never worked reliably. Turns out this relied on pre-3.5
behaviour where skb frag_list was used to store ip fragments;
ip_fragment() then re-used these skbs.
But since commit
3cc4949269e01f39443d0fcfffb5bc6b47878d45
("ipv4: use skb coalescing in defragmentation") this is no longer
the case. ip_do_fragment now needs to allocate new skbs, but these
don't contain the vlan tag information anymore.
Fix it by storing vlan information of the ressembled skb in the
br netfilter percpu frag area, and restore them for each of the
fragments.
Fixes: 3cc4949269e01f3 ("ipv4: use skb coalescing in defragmentation")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Fri, 5 Jun 2015 11:28:38 +0000 (13:28 +0200)]
net: ip_fragment: remove BRIDGE_NETFILTER mtu special handling
since commit
d6b915e29f4adea9
("ip_fragment: don't forward defragmented DF packet") the largest
fragment size is available in the IPCB.
Therefore we no longer need to care about 'encapsulation'
overhead of stripped PPPOE/VLAN headers since ip_do_fragment
doesn't use device mtu in such cases.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bernhard Thaler [Sat, 30 May 2015 13:30:16 +0000 (15:30 +0200)]
netfilter: bridge: forward IPv6 fragmented packets
IPv6 fragmented packets are not forwarded on an ethernet bridge
with netfilter ip6_tables loaded. e.g. steps to reproduce
1) create a simple bridge like this
modprobe br_netfilter
brctl addbr br0
brctl addif br0 eth0
brctl addif br0 eth2
ifconfig eth0 up
ifconfig eth2 up
ifconfig br0 up
2) place a host with an IPv6 address on each side of the bridge
set IPv6 address on host A:
ip -6 addr add fd01:2345:6789:1::1/64 dev eth0
set IPv6 address on host B:
ip -6 addr add fd01:2345:6789:1::2/64 dev eth0
3) run a simple ping command on host A with packets > MTU
ping6 -s 4000 fd01:2345:6789:1::2
4) wait some time and run e.g. "ip6tables -t nat -nvL" on the bridge
IPv6 fragmented packets traverse the bridge cleanly until somebody runs.
"ip6tables -t nat -nvL". As soon as it is run (and netfilter modules are
loaded) IPv6 fragmented packets do not traverse the bridge any more (you
see no more responses in ping's output).
After applying this patch IPv6 fragmented packets traverse the bridge
cleanly in above scenario.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
[pablo@netfilter.org: small changes to br_nf_dev_queue_xmit]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bernhard Thaler [Sat, 30 May 2015 13:29:38 +0000 (15:29 +0200)]
netfilter: bridge: re-order check_hbh_len()
Prepare check_hbh_len() to be called from newly introduced
br_validate_ipv6() in next commit.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bernhard Thaler [Sat, 30 May 2015 13:29:02 +0000 (15:29 +0200)]
netfilter: bridge: rename br_parse_ip_options
br_parse_ip_options() does not parse any IP options, it validates IP
packets as a whole and the function name is misleading.
Rename br_parse_ip_options() to br_validate_ipv4() and remove unneeded
commments.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bernhard Thaler [Sat, 30 May 2015 13:28:28 +0000 (15:28 +0200)]
netfilter: bridge: refactor frag_max_size
Currently frag_max_size is member of br_input_skb_cb and copied back and
forth using IPCB(skb) and BR_INPUT_SKB_CB(skb) each time it is changed or
used.
Attach frag_max_size to nf_bridge_info and set value in pre_routing and
forward functions. Use its value in forward and xmit functions.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bernhard Thaler [Sat, 30 May 2015 13:27:40 +0000 (15:27 +0200)]
netfilter: bridge: detect NAT66 correctly and change MAC address
IPv4 iptables allows to REDIRECT/DNAT/SNAT any traffic over a bridge.
e.g. REDIRECT
$ sysctl -w net.bridge.bridge-nf-call-iptables=1
$ iptables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
-j REDIRECT --to-ports 81
This does not work with ip6tables on a bridge in NAT66 scenario
because the REDIRECT/DNAT/SNAT is not correctly detected.
The bridge pre-routing (finish) netfilter hook has to check for a possible
redirect and then fix the destination mac address. This allows to use the
ip6tables rules for local REDIRECT/DNAT/SNAT REDIRECT similar to the IPv4
iptables version.
e.g. REDIRECT
$ sysctl -w net.bridge.bridge-nf-call-ip6tables=1
$ ip6tables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 \
-j REDIRECT --to-ports 81
This patch makes it possible to use IPv6 NAT66 on a bridge. It was tested
on a bridge with two interfaces using SNAT/DNAT NAT66 rules.
Reported-by: Artie Hamilton <artiemhamilton@yahoo.com>
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
[bernhard.thaler@wvnet.at: rebased, add indirect call to ip6_route_input()]
[bernhard.thaler@wvnet.at: rebased, split into separate patches]
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bernhard Thaler [Sat, 30 May 2015 13:26:57 +0000 (15:26 +0200)]
netfilter: bridge: re-order br_nf_pre_routing_finish_ipv6()
Put br_nf_pre_routing_finish_ipv6() after daddr_was_changed() and
br_nf_pre_routing_finish_bridge() to prepare calling these functions
from there.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bernhard Thaler [Sat, 30 May 2015 13:26:13 +0000 (15:26 +0200)]
netfilter: bridge: refactor clearing BRNF_NF_BRIDGE_PREROUTING
use binary AND on complement of BRNF_NF_BRIDGE_PREROUTING to unset
bit in nf_bridge->mask.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Marcelo Ricardo Leitner [Thu, 21 May 2015 13:57:12 +0000 (10:57 -0300)]
netfilter: conntrack: warn the user if there is a better helper to use
After
db29a9508a92 ("netfilter: conntrack: disable generic tracking for
known protocols"), if the specific helper is built but not loaded
(a standard for most distributions) systems with a restrictive firewall
but weak configuration regarding netfilter modules to load, will
silently stop working.
This patch then puts a warning message so the sysadmin knows where to
start looking into. It's a pr_warn_once regardless of protocol itself
but it should be enough to give a hint on where to look.
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David Woodhouse [Fri, 12 Jun 2015 09:15:49 +0000 (10:15 +0100)]
iommu/vt-d: Only enable extended context tables if PASID is supported
Although the extended tables are theoretically a completely orthogonal
feature to PASID and anything else that *uses* the newly-available bits,
some of the early hardware has problems even when all we do is enable
them and use only the same bits that were in the old context tables.
For now, there's no motivation to support extended tables unless we're
going to use PASID support to do SVM. So just don't use them unless
PASID support is advertised too. Also add a command-line bailout just in
case later chips also have issues.
The equivalent problem for PASID support has already been fixed with the
upcoming VT-d spec update and commit
bd00c606a ("iommu/vt-d: Change
PASID support to bit 40 of Extended Capability Register"), because the
problematic platforms use the old definition of the PASID-capable bit,
which is now marked as reserved and meaningless.
So with this change, we'll magically start using ECS again only when we
see the new hardware advertising "hey, we have PASID support and we
actually tested it this time" on bit 40.
The VT-d hardware architect has promised that we are not going to have
any reason to support ECS *without* PASID any time soon, and he'll make
sure he checks with us before changing that.
In the future, if hypothetical new features also use new bits in the
context tables and can be seen on implementations *without* PASID support,
we might need to add their feature bits to the ecs_enabled() macro.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
NeilBrown [Fri, 12 Jun 2015 10:05:04 +0000 (20:05 +1000)]
md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync
MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
resync etc finished. However it is possible for raid5_start_reshape
to race and start a reshape before MD_RECOVERY_DONE is cleared. This
can lean to multiple reshapes running at the same time, which isn't
good.
To make sure it is cleared before starting a reshape, and also clear
it when reaping a thread, just to be safe.
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Fri, 12 Jun 2015 09:51:27 +0000 (19:51 +1000)]
md: Close race when setting 'action' to 'idle'.
Checking ->sync_thread without holding the mddev_lock()
isn't really safe, even after flushing the workqueue which
ensures md_start_sync() has been run.
While this code is waiting for the lock, md_check_recovery could reap
the thread itself, and then start another thread (e.g. recovery might
finish, then reshape starts). When this thread gets the lock
md_start_sync() hasn't run so it doesn't get reaped, but
MD_RECOVERY_RUNNING gets cleared. This allows two threads to start
which leads to confusion.
So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
the flush and the test and the reap all under the mddev_lock to
avoid any race with md_check_recovery.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.")
Cc: stable@vger.kernel.org (v4.0+)
NeilBrown [Fri, 12 Jun 2015 09:46:44 +0000 (19:46 +1000)]
md: don't return 0 from array_state_store
Returning zero from a 'store' function is bad.
The return value should be either len length of the string
or an error.
So use 'len' if 'err' is zero.
Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.")
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel (v4.0+)
Linus Torvalds [Fri, 12 Jun 2015 00:35:14 +0000 (17:35 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"i915 and radeon fixes:
i915:
fix for connector oops regression
DDC probing fix
radeon:
two radeon reverts, along with a freeze workaround and a fix"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: Make sure radeon_vm_bo_set_addr always unreserves the BO
Revert "drm/radeon: adjust pll when audio is not enabled"
Revert "drm/radeon: don't share plls if monitors differ in audio support"
drm/radeon: fix freeze for laptop with Turks/Thames GPU.
drm/i915: Fix DDC probe for passive adapters
drm/i915: Properly initialize SDVO analog connectors
Shaohua Li [Thu, 11 Jun 2015 23:50:48 +0000 (16:50 -0700)]
net: don't wait for order-3 page allocation
We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit
5640f7685831e0
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.
This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.
alloc_skb_with_frags is the same.
The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.
V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer
Cc: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Debabrata Banerjee <dbavatar@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Fri, 12 Jun 2015 00:11:50 +0000 (10:11 +1000)]
Merge tag 'drm-intel-fixes-2015-06-11' of git://anongit.freedesktop.org/drm-intel into drm-fixes
Fix for the regression Linus called out, and another for probing
dongles.
* tag 'drm-intel-fixes-2015-06-11' of git://anongit.freedesktop.org/drm-intel:
drm/i915: Fix DDC probe for passive adapters
drm/i915: Properly initialize SDVO analog connectors
Dave Airlie [Fri, 12 Jun 2015 00:11:14 +0000 (10:11 +1000)]
Merge branch 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Two regression reverts, and two fixes, one for a dpm boot freeze.
* 'drm-fixes-4.1' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: Make sure radeon_vm_bo_set_addr always unreserves the BO
Revert "drm/radeon: adjust pll when audio is not enabled"
Revert "drm/radeon: don't share plls if monitors differ in audio support"
drm/radeon: fix freeze for laptop with Turks/Thames GPU.
Robert Shearman [Thu, 11 Jun 2015 18:58:26 +0000 (19:58 +0100)]
mpls: handle device renames for per-device sysctls
If a device is renamed and the original name is subsequently reused
for a new device, the following warning is generated:
sysctl duplicate entry: /net/mpls/conf/veth0//input
CPU: 3 PID: 1379 Comm: ip Not tainted 4.1.0-rc4+ #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
0000000000000000 0000000000000000 ffffffff81566aaf 0000000000000000
ffffffff81236279 ffff88002f7d7f00 0000000000000000 ffff88000db336d8
ffff88000db33698 0000000000000005 ffff88002e046000 ffff8800168c9280
Call Trace:
[<
ffffffff81566aaf>] ? dump_stack+0x40/0x50
[<
ffffffff81236279>] ? __register_sysctl_table+0x289/0x5a0
[<
ffffffffa051a24f>] ? mpls_dev_notify+0x1ff/0x300 [mpls_router]
[<
ffffffff8108db7f>] ? notifier_call_chain+0x4f/0x70
[<
ffffffff81470e72>] ? register_netdevice+0x2b2/0x480
[<
ffffffffa0524748>] ? veth_newlink+0x178/0x2d3 [veth]
[<
ffffffff8147f84c>] ? rtnl_newlink+0x73c/0x8e0
[<
ffffffff8147f27a>] ? rtnl_newlink+0x16a/0x8e0
[<
ffffffff81459ff2>] ? __kmalloc_reserve.isra.30+0x32/0x90
[<
ffffffff8147ccfd>] ? rtnetlink_rcv_msg+0x8d/0x250
[<
ffffffff8145b027>] ? __alloc_skb+0x47/0x1f0
[<
ffffffff8149badb>] ? __netlink_lookup+0xab/0xe0
[<
ffffffff8147cc70>] ? rtnetlink_rcv+0x30/0x30
[<
ffffffff8149e7a0>] ? netlink_rcv_skb+0xb0/0xd0
[<
ffffffff8147cc64>] ? rtnetlink_rcv+0x24/0x30
[<
ffffffff8149df17>] ? netlink_unicast+0x107/0x1a0
[<
ffffffff8149e4be>] ? netlink_sendmsg+0x50e/0x630
[<
ffffffff8145209c>] ? sock_sendmsg+0x3c/0x50
[<
ffffffff81452beb>] ? ___sys_sendmsg+0x27b/0x290
[<
ffffffff811bd258>] ? mem_cgroup_try_charge+0x88/0x110
[<
ffffffff811bd5b6>] ? mem_cgroup_commit_charge+0x56/0xa0
[<
ffffffff811d7700>] ? do_filp_open+0x30/0xa0
[<
ffffffff8145336e>] ? __sys_sendmsg+0x3e/0x80
[<
ffffffff8156c3f2>] ? system_call_fastpath+0x16/0x75
Fix this by unregistering the previous sysctl table (registered for
the path containing the original device name) and re-registering the
table for the path containing the new device name.
Fixes: 37bde79979c3 ("mpls: Per-device enabling of packet input")
Reported-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Jun 2015 23:33:11 +0000 (16:33 -0700)]
Merge branch 'tcp-gso-settings-defer'
Eric Dumazet says:
====================
tcp: defer shinfo->gso_size|type settings
We put shinfo->gso_segs in TCP_SKB_CB(skb) a while back for performance
reasons.
This was in commit
cd7d8498c9a5 ("tcp: change tcp_skb_pcount() location")
This patch series complete the job for gso_size and gso_type, so that
we do not bring 2 extra cache lines in tcp write xmit fast path,
and making tcp_init_tso_segs() simpler and faster.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Jun 2015 16:15:19 +0000 (09:15 -0700)]
tcp: remove obsolete check in tcp_set_skb_tso_segs()
We had various issues in the past when TCP stack was modifying
gso_size/gso_segs while clones were in flight.
Commit
c52e2421f73 ("tcp: must unclone packets before mangling them")
fixed these bugs and added a WARN_ON_ONCE(skb_cloned(skb)); in
tcp_set_skb_tso_segs()
These bugs are now fixed, and because TCP stack now only sets
shinfo->gso_size|segs on the clone itself, the check can be removed.
As a result of this change, compiler inlines tcp_set_skb_tso_segs() in
tcp_init_tso_segs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Jun 2015 16:15:18 +0000 (09:15 -0700)]
tcp: fill shinfo->gso_size at last moment
In commit
cd7d8498c9a5 ("tcp: change tcp_skb_pcount() location") we stored
gso_segs in a temporary cache hot location.
This patch does the same for gso_size.
This allows to save 2 cache line misses in tcp xmit path for
the last packet that is considered but not sent because of
various conditions (cwnd, tso defer, receiver window, TSQ...)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Jun 2015 16:15:17 +0000 (09:15 -0700)]
tcp: tcp_set_skb_tso_segs() no longer need struct sock parameter
tcp_set_skb_tso_segs() & tcp_init_tso_segs() no longer
use the sock pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Jun 2015 16:15:16 +0000 (09:15 -0700)]
tcp: fill shinfo->gso_type at last moment
Our goal is to touch skb_shinfo(skb) only when absolutely needed,
to avoid two cache line misses in TCP output path for last skb
that is considered but not sent because of various conditions
(cwnd, tso defer, receiver window, TSQ...)
A packet is GSO only when skb_shinfo(skb)->gso_size is not zero.
We can set skb_shinfo(skb)->gso_type to sk->sk_gso_type even for
non GSO packets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Jun 2015 16:15:15 +0000 (09:15 -0700)]
tcp: reserve tcp_skb_mss() to tcp stack
tcp_gso_segment() and tcp_gro_receive() are not strictly
part of TCP stack. They should not assume tcp_skb_mss(skb)
is in fact skb_shinfo(skb)->gso_size.
This will allow us to change tcp_skb_mss() in following patches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Thu, 11 Jun 2015 15:19:01 +0000 (08:19 -0700)]
switchdev: fix BUG when port driver doesn't support set attr op
Fix a BUG_ON() where CONFIG_NET_SWITCHDEV is set but the driver for a
bridged port does not support switchdev_port_attr_set op. Don't BUG_ON()
if -EOPNOTSUPP is returned.
Also change BUG_ON() to netdev_err since this is a normal error path and
does not warrant the use of BUG_ON(), which is reserved for unrecoverable
errs.
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reported-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran [Thu, 11 Jun 2015 12:51:30 +0000 (14:51 +0200)]
net: igb: fix the start time for periodic output signals
When programming the start of a periodic output, the code wrongly places
the seconds value into the "low" register and the nanoseconds into the
"high" register. Even though this is backwards, it slipped through my
testing, because the re-arming code in the interrupt service routine is
correct, and the signal does appear starting with the second edge.
This patch fixes the issue by programming the registers correctly.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Jun 2015 22:57:18 +0000 (15:57 -0700)]
Merge branch 'bna-next'
Ivan Vecera says:
====================
bna: clean-up
The patches clean the bna driver.
v2: changes & comments requested by Joe
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:31 +0000 (15:52 +0200)]
bna: use netdev_* and dev_* instead of printk and pr_*
...and remove some of them. It is not necessary to log when .probe() and
.remove() are called or when TxQ is started or stopped. Also log level
of some of them was changed to more appropriate one (link up/down,
firmware loading failure.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:30 +0000 (15:52 +0200)]
bna: fix timeout API argument type
Timeout functions are defined with 'void *' ptr argument. They should
be defined directly with 'struct bfa_ioc *' type to avoid type conversions.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:29 +0000 (15:52 +0200)]
bna: use list_for_each_entry where appropriate
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:28 +0000 (15:52 +0200)]
bna: get rid of private macros for manipulation with lists
Remove macros for manipulation with struct list_head and replace them
with standard ones.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:27 +0000 (15:52 +0200)]
bna: remove useless pointer assignment
Pointer cmpl used to iterate through completion entries is updated at
the beginning of while loop as well as at the end. The update at the end
of the loop is useless.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:26 +0000 (15:52 +0200)]
bna: use memdup_user to copy userspace buffers
Patch converts kzalloc->copy_from_user sequence to memdup_user. There
is also one useless assignment of NULL to bnad->regdata as it is followed
by assignment of kzalloc output.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:25 +0000 (15:52 +0200)]
bna: correct comparisons/assignments to bool
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:24 +0000 (15:52 +0200)]
bna: remove TX_E_PRIO_CHANGE event and BNA_TX_F_PRIO_CHANGED flag
TX_E_PRIO_CHANGE event is never sent for bna_tx so it doesn't need to be
handled. After this change bna_tx->flags cannot contain
BNA_TX_F_PRIO_CHANGED flag and it can be also eliminated.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:23 +0000 (15:52 +0200)]
bna: remove paused from bna_rx_config and flags from bna_rxf
The bna_rx_config struct member paused can be removed as it is never
written and as it cannot have non-zero value the bna_rxf struct member
flags also cannot have BNA_RXF_F_PAUSED value and is always zero.
So the flags member can be removed as well as bna_rxf_flags enum and
the code-paths that needs to have non-zero bna_rxf->flags.
This clean-up makes bna_rxf_sm_paused state unsed and can be also removed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:22 +0000 (15:52 +0200)]
bna: remove RXF_E_PAUSE and RXF_E_RESUME events
RXF_E_PAUSE & RXF_E_RESUME events are never sent for bna_rxf object so
they needn't to be handled. The bna_rxf's state bna_rxf_sm_fltr_clr_wait
and function bna_rxf_fltr_clear are unused after this so remove them also.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:21 +0000 (15:52 +0200)]
bna: remove prio_change_cbfn oper_state_cbfn from struct bna_tx
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:20 +0000 (15:52 +0200)]
bna: remove oper_state_cbfn from struct bna_rxf
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:19 +0000 (15:52 +0200)]
bna: remove pause_cbfn from struct bna_enet
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:18 +0000 (15:52 +0200)]
bna: remove unused cbfn parameter
removed:
bna_rx_ucast_add
bna_rx_ucast_del
simplified:
bna_enet_pause_config
bna_rx_mcast_delall
bna_rx_mcast_listset
bna_rx_mode_set
bna_rx_ucast_listset
bna_rx_ucast_set
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:17 +0000 (15:52 +0200)]
bna: use BIT(x) instead of (1 << x)
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:16 +0000 (15:52 +0200)]
bna: get rid of duplicate and unused macros
replaced macros:
BNA_MAC_IS_EQUAL -> ether_addr_equal
BNA_POWER_OF_2 -> is_power_of_2
BNA_TO_POWER_OF_2_HIGH -> roundup_pow_of_two
removed unused macros:
bfa_fsm_get_state
bfa_ioc_clr_stats
bfa_ioc_fetch_stats
bfa_ioc_get_alt_ioc_fwstate
bfa_ioc_isr_mode_set
bfa_ioc_maxfrsize
bfa_ioc_mbox_cmd_pending
bfa_ioc_ownership_reset
bfa_ioc_rx_bbcredit
bfa_ioc_state_disabled
bfa_sm_cmp_state
bfa_sm_get_state
bfa_sm_send_event
bfa_sm_set_state
bfa_sm_state_decl
BFA_STRING_32
BFI_ADAPTER_IS_{PROTO,TTV,UNSUPP)
BFI_IOC_ENDIAN_SIG
BNA_{C,RX,TX}Q_PAGE_INDEX_MAX
BNA_{C,RX,TX}Q_PAGE_INDEX_MAX_SHIFT
BNA_{C,RX,TX}Q_QPGE_PTR_GET
BNA_IOC_TIMER_FREQ
BNA_MESSAGE_SIZE
BNA_QE_INDX_2_PTR
BNA_QE_INDX_RANGE
BNA_Q_GET_{C,P}I
BNA_Q_{C,P}I_ADD
BNA_Q_FREE_COUNT
BNA_Q_IN_USE_COUNT
BNA_TO_POWER_OF_2
containing_rec
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:15 +0000 (15:52 +0200)]
bna: replace pragma(pack) with attribute __packed
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:14 +0000 (15:52 +0200)]
bna: get rid of mac_t
The patch converts mac_t type to widely used 'u8 [ETH_ALEN]'.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Jun 2015 13:52:13 +0000 (15:52 +0200)]
bna: use ether_addr_copy instead of memcpy
Parameters of all ether_addr_copy instances were checked for proper
alignment. Alignment of bnad_bcast_addr is forced to 2 as the implicit
alignment is 1.
I have also renamed address parameter of bnad_set_mac_address() to addr.
The name mac_addr was a little bit confusing as the real parameter is
struct sockaddr *.
v2: added __aligned directive to bnad_bcast_addr, renamed parameter of
bnad_set_mac_address() (thx joe@perches.com)
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Jun 2015 22:55:26 +0000 (15:55 -0700)]
Merge branch 'mlx5-next'
Or Gerlitz says:
====================
mlx5 Ethernet driver update - Jun 11 2015
This series from Saeed, Achiad and Gal contains few fixes
to the recently introduced mlx5 Ethernet functionality.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Thu, 11 Jun 2015 11:47:33 +0000 (14:47 +0300)]
net/mlx5e: Add transport domain to the ethernet TIRs/TISs
Allocate and use transport domain by the Ethernet driver code.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Achiad Shochat [Thu, 11 Jun 2015 11:47:32 +0000 (14:47 +0300)]
net/mlx5_core: Add transport domain alloc/dealloc support
Each transport object, namely TIR and TIS, must have a transport domain
number (TDN) identifier.
The driver wrongly assumed that it is OK to use TDN=0 without explicit
TDN allocation from the device.
The TDN will also be used for isolating different processes once user
mode Ethernet will be supported.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 11 Jun 2015 11:47:31 +0000 (14:47 +0300)]
net/mlx5e: Support NETIF_F_SG
When NETIF_F_SG is set, each send WQE may have a different size since
each skb can have different number of fragments as of LSO header etc.
This implies that a given WQE may wrap around the send queue, i.e begin
at its end and continue at its start. While it is legal by the device spec,
we preferred a solution that avoids it - when building of current WQE is
done, if the next WQE may wrap around the send queue, fill the send queue
with NOPs WQEs till its end, so that the next WQE will begin at send queue
start.
NOP WQE for itself cannot wrap around the send queue since it is of
minimal size - 64 bytes, and all send WQEs are a multiple of that size.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Thu, 11 Jun 2015 11:47:30 +0000 (14:47 +0300)]
net/mlx5e: Enforce max flow-tables level >= 3
The Ethernet driver requires at least 3 flow table levels to
operate, enforce that.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 11 Jun 2015 11:47:29 +0000 (14:47 +0300)]
net/mlx5e: Disable client vlan TX acceleration
We need to resolve a HW configuration issue for enabling HW CVLAN
insertion. Meanwhile, no need to implement the VLAN insertion in
the driver, rather use the generic kernel VLAN insertion method.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 11 Jun 2015 11:47:28 +0000 (14:47 +0300)]
net/mlx5e: Add HW cacheline start padding
Enable HW cacheline start padding and align RX WQE size to cacheline
while considering HW start padding. Also, fix dma_unmap call to use
the correct SKB data buffer size.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed [Thu, 11 Jun 2015 11:47:27 +0000 (14:47 +0300)]
net/mlx5e: Fix HW MTU settings
Previously we configured HW MTU to be netdev->mtu, actually we
need to configure netdev->mtu + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN).
Also, query MTU can not fail, hence make the relevant helper a
void functionm, add mlx5e_set_dev_port_mtu, helper function to
handle MTU setting.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Thu, 11 Jun 2015 08:50:01 +0000 (11:50 +0300)]
net/mlx5_core: fix an error code
We return success if mlx5e_alloc_sq_db() fails but we should return an
error code.
Fixes: f62b8bb8f2d3 ('net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet functionality')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fabian Frederick [Wed, 10 Jun 2015 16:33:26 +0000 (18:33 +0200)]
vxge: use swap() in vxge_hw_channel_dtr_alloc()
Use kernel.h macro definition.
Thanks to Julia Lawall for Coccinelle scripting support.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David S. Miller <davem@davemloft.net>