Petr Machata [Sun, 29 Dec 2019 11:48:27 +0000 (13:48 +0200)]
mlxsw: reg: Add QoS Port DSCP to Priority Mapping Register
Add QPDP. This register controls the port default Switch Priority and
Color. The default Switch Priority and Color are used for frames where the
trust state uses default values. Currently there are two cases where this
applies: a port is in trust-PCP state, but a packet arrives untagged; and a
port is in trust-DSCP state, but a non-IP packet arrives.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 2 Jan 2020 23:37:53 +0000 (15:37 -0800)]
Merge branch 'page_pool-NUMA-node-handling-fixes'
Jesper Dangaard Brouer says:
====================
page_pool: NUMA node handling fixes
The recently added NUMA changes (merged for v5.5) to page_pool, it both
contains a bug in handling NUMA_NO_NODE condition, and added code to
the fast-path.
This patchset fixes the bug and moves code out of fast-path. The first
patch contains a fix that should be considered for 5.5. The second
patch reduce code size and overhead in case CONFIG_NUMA is disabled.
Currently the NUMA_NO_NODE setting bug only affects driver 'ti_cpsw'
(drivers/net/ethernet/ti/), but after this patchset, we plan to move
other drivers (netsec and mvneta) to use NUMA_NO_NODE setting.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Fri, 27 Dec 2019 17:13:24 +0000 (18:13 +0100)]
page_pool: help compiler remove code in case CONFIG_NUMA=n
When kernel is compiled without NUMA support, then page_pool NUMA
config setting (pool->p.nid) doesn't make any practical sense. The
compiler cannot see that it can remove the code paths.
This patch avoids reading pool->p.nid setting in case of !CONFIG_NUMA,
in allocation and numa check code, which helps compiler to see the
optimisation potential. It leaves update code intact to keep API the
same.
$ ./scripts/bloat-o-meter net/core/page_pool.o-numa-enabled \
net/core/page_pool.o-numa-disabled
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-113 (-113)
Function old new delta
page_pool_create 401 398 -3
__page_pool_alloc_pages_slow 439 426 -13
page_pool_refill_alloc_cache 425 328 -97
Total: Before=3611, After=3498, chg -3.13%
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Fri, 27 Dec 2019 17:13:18 +0000 (18:13 +0100)]
page_pool: handle page recycle for NUMA_NO_NODE condition
The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
The goal of the NUMA changes in commit
d5394610b1ba ("page_pool: Don't
recycle non-reusable pages"), were to have RX-pages that belongs to the
same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
illustrated by the performance measurements.
This patch moves the NAPI checks out of fast-path, and at the same time
solves the NUMA_NO_NODE issue.
First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
as the the preferred nid. It is only in rare situations, where
e.g. NUMA zone runs dry, that page gets doesn't get allocated from
preferred nid. The page_pool API allows drivers to control the nid
themselves via controlling pool->p.nid.
This patch moves the NAPI check to when alloc cache is refilled, via
dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
pages from remote NUMA into the ptr_ring, as the dequeue/consume step
will check the NUMA node. All current drivers using page_pool will
alloc/refill RX-ring from same CPU running softirq/NAPI process.
Drivers that control the nid explicitly, also use page_pool_update_nid
when changing nid runtime. To speed up transision to new nid the alloc
cache is now flushed on nid changes. This force pages to come from
ptr_ring, which does the appropate nid check.
For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
node, we accept that transitioning the alloc cache doesn't happen
immediately. The preferred nid change runtime via consulting
numa_mem_id() based on the CPU processing RX-packets.
Notice, to avoid stressing the page buddy allocator and avoid doing too
much work under softirq with preempt disabled, the NUMA check at
ptr_ring dequeue will break the refill cycle, when detecting a NUMA
mismatch. This will cause a slower transition, but its done on purpose.
Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
Reported-by: Li RongQing <lirongqing@baidu.com>
Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 1 Jan 2020 05:43:31 +0000 (21:43 -0800)]
Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
1GbE Intel Wired LAN Driver Updates 2019-12-31
This series contains updates to e1000e, igb and igc only.
Robert Beckett provide an igb change to assist in keeping packets from
being dropped due to receive descriptor ring being full when receive
flow control is enabled. Create a separate function to setup SRRCTL to
ease in reuse and ensure that setting of the drop enable bit only if
receive flow control is not enabled.
Sasha adds support for scatter gather support in igc. Improve the
direct memory address mapping flow by optimizing/simplifying and more
clear. Update igc to use pci_release_mem_regions() instead of
pci_release_selected_regions(). Clean up function header comments to
align with the actual code. Adds support for 64 bit DMA access, to help
handle socket buffer fragments in high memory. Adds legacy power
management support in igc by implementing suspend, resume,
runtime_suspend/resume, and runtime_idle callbacks. Clean up references
to Serdes interface in igc since that interface is not supported for
i225 devices.
Alex replaces the pr_info calls with netdev_info in all cases related to
netdev link state, as suggested by Joe Perches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sasha Neftin [Mon, 2 Dec 2019 07:56:26 +0000 (09:56 +0200)]
igc: Remove serdes comments from a description of methods
Serdes interface is not applicable for i225 devices.
Remove this from comments and make comments more clearly.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Thu, 31 Oct 2019 16:58:51 +0000 (09:58 -0700)]
e1000e: Use netdev_info instead of pr_info for link messages
Replace the pr_info calls with netdev_info in all cases related to the
netdevice link state.
As a result of this patch the link messages will change as shown below.
Before:
e1000e: ens3 NIC Link is Down
e1000e: ens3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
After:
e1000e 0000:00:03.0 ens3: NIC Link is Down
e1000e 0000:00:03.0 ens3: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Thu, 14 Nov 2019 07:54:46 +0000 (09:54 +0200)]
igc: Add legacy power management support
Add suspend, resume, runtime_suspend, runtime_resume and
runtime_idle callbacks implementation.
Reported-by: kbuild test robot <lpk@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Tue, 31 Dec 2019 21:37:13 +0000 (13:37 -0800)]
Merge git://git./linux/kernel/git/netdev/net
Simple overlapping changes in bpf land wrt. bpf_helper_defs.h
handling.
Signed-off-by: David S. Miller <davem@davemloft.net>
Sasha Neftin [Wed, 13 Nov 2019 09:27:29 +0000 (11:27 +0200)]
igc: Add 64 bit DMA access support
On relevant platforms ndo_start_xmit can handle socket buffer
fragments in high memory
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Tue, 12 Nov 2019 15:13:32 +0000 (17:13 +0200)]
igc: Fix parameter descriptions for a several functions
igc_watchdog, igc_set_interrupt_capability, igc_init_interrupt_scheme,
__igc_open and __igc_close parameter descriptions has not reflected
functions meaning. Add meaningful description.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Mon, 11 Nov 2019 07:04:12 +0000 (09:04 +0200)]
igc: Fix the parameter description for igc_alloc_rx_buffers
The function description for igc_alloc_rx_buffers has not reflected
the function meaning. Add meaningful description.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Sun, 10 Nov 2019 15:57:44 +0000 (17:57 +0200)]
igc: Remove excess parameter description from igc_is_non_eop
The function description for igc_is_non_eop includes an extra @skb
parameter description. This parameter doesn't exist on the function, so
remove it.
Suggested-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Sun, 10 Nov 2019 14:05:20 +0000 (16:05 +0200)]
igc: Prefer to use the pci_release_mem_regions method
Use the pci_release_mem_regions method instead of the
pci_release_selected_regions method
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Tue, 5 Nov 2019 09:44:13 +0000 (11:44 +0200)]
igc: Improve the DMA mapping flow
Improve the probe flow and set both the DMA mask and the coherent
to the same thing. Make the flow optimized and cleared.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Thu, 31 Oct 2019 06:06:07 +0000 (08:06 +0200)]
igc: Add scatter gather support
Scatter gather is used to do DMA data transfers of data that is written to
noncontiguous areas of memory.
This patch enables scatter gather support.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Robert Beckett [Tue, 22 Oct 2019 15:31:41 +0000 (16:31 +0100)]
igb: dont drop packets if rx flow control is enabled
If Rx flow control has been enabled (via autoneg or forced), packets
should not be dropped due to Rx descriptor ring exhaustion. Instead
pause frames should be used to apply back pressure. This only applies
if VFs are not in use.
Move SRRCTL setup to its own function for easy reuse and only set drop
enable bit if Rx flow control is not enabled.
Since v1: always enable dropping of packets if VFs in use.
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Linus Torvalds [Tue, 31 Dec 2019 19:14:58 +0000 (11:14 -0800)]
Merge git://git./linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
1) Fix big endian overflow in nf_flow_table, from Arnd Bergmann.
2) Fix port selection on big endian in nft_tproxy, from Phil Sutter.
3) Fix precision tracking for unbound scalars in bpf verifier, from
Daniel Borkmann.
4) Fix integer overflow in socket rcvbuf check in UDP, from Antonio
Messina.
5) Do not perform a neigh confirmation during a pmtu update over a
tunnel, from Hangbin Liu.
6) Fix DMA mapping leak in dpaa_eth driver, from Madalin Bucur.
7) Various PTP fixes for sja1105 dsa driver, from Vladimir Oltean.
8) Add missing to dummy definition of of_mdiobus_child_is_phy(), from
Geert Uytterhoeven
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
net/sched: add delete_empty() to filters and use it in cls_flower
tcp: Fix highest_sack and highest_sack_seq
ptp: fix the race between the release of ptp_clock and cdev
net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S
Documentation: net: dsa: sja1105: Remove text about taprio base-time limitation
net: dsa: sja1105: Remove restriction of zero base-time for taprio offload
net: dsa: sja1105: Really make the PTP command read-write
net: dsa: sja1105: Take PTP egress timestamp by port, not mgmt slot
cxgb4/cxgb4vf: fix flow control display for auto negotiation
mlxsw: spectrum: Use dedicated policer for VRRP packets
mlxsw: spectrum_router: Skip loopback RIFs during MAC validation
net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs
net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit device
net_sched: sch_fq: properly set sk->sk_pacing_status
bnx2x: Fix accounting of vlan resources among the PFs
bnx2x: Use appropriate define for vlan credit
of: mdio: Add missing inline to of_mdiobus_child_is_phy() dummy
net: phy: aquantia: add suspend / resume ops for AQR105
dpaa_eth: fix DMA mapping leak
...
Linus Torvalds [Tue, 31 Dec 2019 18:51:27 +0000 (10:51 -0800)]
Merge tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1
Pull tomoyo fixes from Tetsuo Handa:
"Two bug fixes:
- Suppress RCU warning at list_for_each_entry_rcu()
- Don't use fancy names on sockets"
* tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
tomoyo: Suppress RCU warning at list_for_each_entry_rcu().
tomoyo: Don't use nifty names on sockets.
Taehee Yoo [Sat, 28 Dec 2019 16:28:09 +0000 (16:28 +0000)]
hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
hsr slave interfaces don't have debugfs directory.
So, hsr_debugfs_rename() shouldn't be called when hsr slave interface name
is changed.
Test commands:
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
ip link set dummy0 name ap
Splat looks like:
[21071.899367][T22666] ap: renamed from dummy0
[21071.914005][T22666] ==================================================================
[21071.919008][T22666] BUG: KASAN: slab-out-of-bounds in hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.923640][T22666] Read of size 8 at addr
ffff88805febcd98 by task ip/22666
[21071.926941][T22666]
[21071.927750][T22666] CPU: 0 PID: 22666 Comm: ip Not tainted 5.5.0-rc2+ #240
[21071.929919][T22666] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[21071.935094][T22666] Call Trace:
[21071.935867][T22666] dump_stack+0x96/0xdb
[21071.936687][T22666] ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.937774][T22666] print_address_description.constprop.5+0x1be/0x360
[21071.939019][T22666] ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940081][T22666] ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940949][T22666] __kasan_report+0x12a/0x16f
[21071.941758][T22666] ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.942674][T22666] kasan_report+0xe/0x20
[21071.943325][T22666] hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.944187][T22666] hsr_netdev_notify+0x1fe/0x9b0 [hsr]
[21071.945052][T22666] ? __module_text_address+0x13/0x140
[21071.945897][T22666] notifier_call_chain+0x90/0x160
[21071.946743][T22666] dev_change_name+0x419/0x840
[21071.947496][T22666] ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.948600][T22666] ? netdev_adjacent_rename_links+0x280/0x280
[21071.949577][T22666] ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.950672][T22666] ? lock_downgrade+0x6e0/0x6e0
[21071.951345][T22666] ? do_setlink+0x811/0x2ef0
[21071.951991][T22666] do_setlink+0x811/0x2ef0
[21071.952613][T22666] ? is_bpf_text_address+0x81/0xe0
[ ... ]
Reported-by: syzbot+9328206518f08318a5fd@syzkaller.appspotmail.com
Fixes: 4c2d5e33dcd3 ("hsr: rename debugfs file when interface name is changed")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Sat, 28 Dec 2019 15:36:58 +0000 (16:36 +0100)]
net/sched: add delete_empty() to filters and use it in cls_flower
Revert "net/sched: cls_u32: fix refcount leak in the error path of
u32_change()", and fix the u32 refcount leak in a more generic way that
preserves the semantic of rule dumping.
On tc filters that don't support lockless insertion/removal, there is no
need to guard against concurrent insertion when a removal is in progress.
Therefore, for most of them we can avoid a full walk() when deleting, and
just decrease the refcount, like it was done on older Linux kernels.
This fixes situations where walk() was wrongly detecting a non-empty
filter, like it happened with cls_u32 in the error path of change(), thus
leading to failures in the following tdc selftests:
6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id
On cls_flower, and on (future) lockless filters, this check is necessary:
move all the check_empty() logic in a callback so that each filter
can have its own implementation. For cls_flower, it's sufficient to check
if no IDRs have been allocated.
This reverts commit
275c44aa194b7159d1191817b20e076f55f0e620.
Changes since v1:
- document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
is used, thanks to Vlad Buslov
- implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
- squash revert and new fix in a single patch, to be nice with bisect
tests that run tdc on u32 filter, thanks to Dave Miller
Fixes: 275c44aa194b ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty")
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vijay Khemka [Fri, 27 Dec 2019 22:43:49 +0000 (14:43 -0800)]
net/ncsi: Fix gma flag setting after response
gma_flag was set at the time of GMA command request but it should
only be set after getting successful response. Movinng this flag
setting in GMA response handler.
This flag is used mainly for not repeating GMA command once
received MAC address.
Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kevin Kou [Fri, 27 Dec 2019 13:11:16 +0000 (13:11 +0000)]
sctp: add enabled check for path tracepoint loop.
sctp_outq_sack is the main function handles SACK, it is called very
frequently. As the commit "move trace_sctp_probe_path into sctp_outq_sack"
added below code to this function, sctp tracepoint is disabled most of time,
but the loop of transport list will be always called even though the
tracepoint is disabled, this is unnecessary.
+ /* SCTP path tracepoint for congestion control debugging. */
+ list_for_each_entry(transport, transport_list, transports) {
+ trace_sctp_probe_path(transport, asoc);
+ }
This patch is to add tracepoint enabled check at outside of the loop of
transport list, and avoid traversing the loop when trace is disabled,
it is a small optimization.
Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 31 Dec 2019 04:31:41 +0000 (20:31 -0800)]
Merge branch 'Improvements-to-SJA1105-DSA-RX-timestamping'
Vladimir Oltean says:
====================
Improvements to SJA1105 DSA RX timestamping
This series makes the sja1105 DSA driver use a dedicated kernel thread
for RX timestamping, a process which is time-sensitive and otherwise a
bit fragile. This allows users to customize their system (probabil an
embedded PTP switch) fully and allocate the CPU bandwidth for the driver
to expedite the RX timestamps as quickly as possible.
While doing this conversion, add a function to the PTP core for
cancelling this kernel thread (function which I found rather strange to
be missing).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 27 Dec 2019 13:02:30 +0000 (15:02 +0200)]
net: dsa: sja1105: Empty the RX timestamping queue on PTP settings change
When disabling PTP timestamping, don't reset the switch with the new
static config until all existing PTP frames have been timestamped on the
RX path or dropped. There's nothing we can do with these afterwards.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 27 Dec 2019 13:02:29 +0000 (15:02 +0200)]
net: dsa: sja1105: Use PTP core's dedicated kernel thread for RX timestamping
And move the queue of skb's waiting for RX timestamps into the ptp_data
structure, since it isn't needed if PTP is not compiled.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 27 Dec 2019 13:02:28 +0000 (15:02 +0200)]
ptp: introduce ptp_cancel_worker_sync
In order to effectively use the PTP kernel thread for tasks such as
timestamping packets, allow the user control over stopping it, which is
needed e.g. when the timestamping queues must be drained.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cambda Zhu [Fri, 27 Dec 2019 08:52:37 +0000 (16:52 +0800)]
tcp: Fix highest_sack and highest_sack_seq
>From commit
50895b9de1d3 ("tcp: highest_sack fix"), the logic about
setting tp->highest_sack to the head of the send queue was removed.
Of course the logic is error prone, but it is logical. Before we
remove the pointer to the highest sack skb and use the seq instead,
we need to set tp->highest_sack to NULL when there is no skb after
the last sack, and then replace NULL with the real skb when new skb
inserted into the rtx queue, because the NULL means the highest sack
seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
the next ACK with sack option will increase tp->reordering unexpectedly.
This patch sets tp->highest_sack to the tail of the rtx queue if
it's NULL and new data is sent. The patch keeps the rule that the
highest_sack can only be maintained by sack processing, except for
this only case.
Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladis Dronov [Fri, 27 Dec 2019 02:26:27 +0000 (03:26 +0100)]
ptp: fix the race between the release of ptp_clock and cdev
In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
device is removed, closing this file leads to a race. This reproduces
easily in a kvm virtual machine:
ts# cat openptp0.c
int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
ts# uname -r
5.5.0-rc3-
46cf053e
ts# cat /proc/cmdline
... slub_debug=FZP
ts# modprobe ptp_kvm
ts# ./openptp0 &
[1] 670
opened /dev/ptp0, sleeping 10s...
ts# rmmod ptp_kvm
ts# ls /dev/ptp*
ls: cannot access '/dev/ptp*': No such file or directory
ts# ...woken up
[ 48.010809] general protection fault: 0000 [#1] SMP
[ 48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-
46cf053e #25
[ 48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
[ 48.016270] RIP: 0010:module_put.part.0+0x7/0x80
[ 48.017939] RSP: 0018:
ffffb3850073be00 EFLAGS:
00010202
[ 48.018339] RAX:
000000006b6b6b6b RBX:
6b6b6b6b6b6b6b6b RCX:
ffff89a476c00ad0
[ 48.018936] RDX:
fffff65a08d3ea08 RSI:
0000000000000247 RDI:
6b6b6b6b6b6b6b6b
[ 48.019470] ... ^^^ a slub poison
[ 48.023854] Call Trace:
[ 48.024050] __fput+0x21f/0x240
[ 48.024288] task_work_run+0x79/0x90
[ 48.024555] do_exit+0x2af/0xab0
[ 48.024799] ? vfs_write+0x16a/0x190
[ 48.025082] do_group_exit+0x35/0x90
[ 48.025387] __x64_sys_exit_group+0xf/0x10
[ 48.025737] do_syscall_64+0x3d/0x130
[ 48.026056] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 48.026479] RIP: 0033:0x7f53b12082f6
[ 48.026792] ...
[ 48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
[ 48.045001] Fixing recursive fault but reboot is needed!
This happens in:
static void __fput(struct file *file)
{ ...
if (file->f_op->release)
file->f_op->release(inode, file); <<< cdev is kfree'd here
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
!(mode & FMODE_PATH))) {
cdev_put(inode->i_cdev); <<< cdev fields are accessed here
Namely:
__fput()
posix_clock_release()
kref_put(&clk->kref, delete_clock) <<< the last reference
delete_clock()
delete_ptp_clock()
kfree(ptp) <<< cdev is embedded in ptp
cdev_put
module_put(p->owner) <<< *p is kfree'd, bang!
Here cdev is embedded in posix_clock which is embedded in ptp_clock.
The race happens because ptp_clock's lifetime is controlled by two
refcounts: kref and cdev.kobj in posix_clock. This is wrong.
Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
created especially for such cases. This way the parent device with its
ptp_clock is not released until all references to the cdev are released.
This adds a requirement that an initialized but not exposed struct
device should be provided to posix_clock_register() by a caller instead
of a simple dev_t.
This approach was adopted from the commit
72139dfa2464 ("watchdog: Fix
the race between the release of watchdog_core_data and cdev"). See
details of the implementation in the commit
233ed09d7fda ("chardev: add
helper function to register char devs with a struct device").
Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#u
Analyzed-by: Stephen Johnston <sjohnsto@redhat.com>
Analyzed-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 27 Dec 2019 01:11:13 +0000 (03:11 +0200)]
net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S
For first-generation switches (SJA1105E and SJA1105T):
- TPID means C-Tag (typically 0x8100)
- TPID2 means S-Tag (typically 0x88A8)
While for the second generation switches (SJA1105P, SJA1105Q, SJA1105R,
SJA1105S) it is the other way around:
- TPID means S-Tag (typically 0x88A8)
- TPID2 means C-Tag (typically 0x8100)
In other words, E/T tags untagged traffic with TPID, and P/Q/R/S with
TPID2.
So the patch mentioned below fixed VLAN filtering for P/Q/R/S, but broke
it for E/T.
We strive for a common code path for all switches in the family, so just
lie in the static config packing functions that TPID and TPID2 are at
swapped bit offsets than they actually are, for P/Q/R/S. This will make
both switches understand TPID to be ETH_P_8021Q and TPID2 to be
ETH_P_8021AD. The meaning from the original E/T was chosen over P/Q/R/S
because E/T is actually the one with public documentation available
(UM10944.pdf).
Fixes: f9a1a7646c0d ("net: dsa: sja1105: Reverse TPID and TPID2")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 27 Dec 2019 01:08:07 +0000 (03:08 +0200)]
Documentation: net: dsa: sja1105: Remove text about taprio base-time limitation
Since commit
86db36a347b4 ("net: dsa: sja1105: Implement state machine
for TAS with PTP clock source"), this paragraph is no longer true. So
remove it.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 27 Dec 2019 01:03:54 +0000 (03:03 +0200)]
net: dsa: sja1105: Remove restriction of zero base-time for taprio offload
The check originates from the initial implementation which was not based
on PTP time but on a standalone clock source. In the meantime we can now
program the PTPSCHTM register at runtime with the dynamic base time
(actually with a value that is 200 ns smaller, to avoid writing DELTA=0
in the Schedule Entry Points Parameters Table). And we also have logic
for moving the actual base time in the future of the PHC's current time
base, so the check for zero serves no purpose, since even if the user
will specify zero, that's not what will end up in the static config
table where the limitation is.
Fixes: 86db36a347b4 ("net: dsa: sja1105: Implement state machine for TAS with PTP clock source")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 27 Dec 2019 01:01:50 +0000 (03:01 +0200)]
net: dsa: sja1105: Really make the PTP command read-write
When activating tc-taprio offload on the switch ports, the TAS state
machine will try to check whether it is running or not, but will find
both the STARTED and STOPPED bits as false in the
sja1105_tas_check_running function. So the function will return -EINVAL
(an abnormal situation) and the kernel will keep printing this from the
TAS FSM workqueue:
[ 37.691971] sja1105 spi0.1: An operation returned -22
The reason is that the underlying function that gets called,
sja1105_ptp_commit, does not actually do a SPI_READ, but a SPI_WRITE. So
the command buffer remains initialized with zeroes instead of retrieving
the hardware state. Fix that.
Fixes: 41603d78b362 ("net: dsa: sja1105: Make the PTP command read-write")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 27 Dec 2019 00:59:54 +0000 (02:59 +0200)]
net: dsa: sja1105: Take PTP egress timestamp by port, not mgmt slot
The PTP egress timestamp N must be captured from register PTPEGR_TS[n],
where n = 2 * PORT + TSREG. There are 10 PTPEGR_TS registers, 2 per
port. We are only using TSREG=0.
As opposed to the management slots, which are 4 in number
(SJA1105_NUM_PORTS, minus the CPU port). Any management frame (which
includes PTP frames) can be sent to any non-CPU port through any
management slot. When the CPU port is not the last port (#4), there will
be a mismatch between the slot and the port number.
Luckily, the only mainline occurrence with this switch
(arch/arm/boot/dts/ls1021a-tsn.dts) does have the CPU port as #4, so the
issue did not manifest itself thus far.
Fixes: 47ed985e97f5 ("net: dsa: sja1105: Add logic for TX timestamping")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christophe JAILLET [Thu, 26 Dec 2019 15:02:24 +0000 (16:02 +0100)]
sfc: avoid duplicate error handling code in 'efx_ef10_sriov_set_vf_mac()'
'eth_zero_addr()' is already called in the error handling path. This is
harmless, but there is no point in calling it twice, so remove one.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 30 Dec 2019 14:06:19 +0000 (06:06 -0800)]
tcp_cubic: refactor code to perform a divide only when needed
Neal Cardwell suggested to not change ca->delay_min
and apply the ack delay cushion only when Hystart ACK train
is still under consideration. This should avoid a 64bit
divide unless needed.
Tested:
40Gbit(mlx4) testbed (with sch_fq as packet scheduler)
$ echo -n 'file tcp_cubic.c +p' >/sys/kernel/debug/dynamic_debug/control
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -
4000000; done;nstat|egrep "Hystart"
14815
16280
15293
15563
11574
15145
14789
18548
16972
12520
TcpExtTCPHystartTrainDetect 10 0.0
TcpExtTCPHystartTrainCwnd 1396 0.0
$ dmesg | tail -10
[ 4873.951350] hystart_ack_train (116 > 93) delay_min 24 (+ ack_delay 69) cwnd 80
[ 4875.155379] hystart_ack_train (55 > 50) delay_min 21 (+ ack_delay 29) cwnd 160
[ 4876.333921] hystart_ack_train (69 > 62) delay_min 23 (+ ack_delay 39) cwnd 130
[ 4877.519037] hystart_ack_train (69 > 60) delay_min 22 (+ ack_delay 38) cwnd 130
[ 4878.701559] hystart_ack_train (87 > 63) delay_min 24 (+ ack_delay 39) cwnd 160
[ 4879.844597] hystart_ack_train (93 > 50) delay_min 21 (+ ack_delay 29) cwnd 216
[ 4880.956650] hystart_ack_train (74 > 67) delay_min 20 (+ ack_delay 47) cwnd 108
[ 4882.098500] hystart_ack_train (61 > 57) delay_min 23 (+ ack_delay 34) cwnd 130
[ 4883.262056] hystart_ack_train (72 > 67) delay_min 21 (+ ack_delay 46) cwnd 130
[ 4884.418760] hystart_ack_train (74 > 67) delay_min 29 (+ ack_delay 38) cwnd 152
10Gbit(bnx2x) testbed (with sch_fq as packet scheduler)
$ echo -n 'file tcp_cubic.c +p' >/sys/kernel/debug/dynamic_debug/control
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpk52 -l -
4000000; done;nstat|egrep "Hystart"
7050
7065
7100
6900
7202
7263
7189
6869
7463
7034
TcpExtTCPHystartTrainDetect 10 0.0
TcpExtTCPHystartTrainCwnd 3199 0.0
$ dmesg | tail -10
[ 176.920012] hystart_ack_train (161 > 141) delay_min 83 (+ ack_delay 58) cwnd 264
[ 179.144645] hystart_ack_train (164 > 159) delay_min 120 (+ ack_delay 39) cwnd 444
[ 181.354527] hystart_ack_train (214 > 168) delay_min 125 (+ ack_delay 43) cwnd 436
[ 183.539565] hystart_ack_train (170 > 147) delay_min 96 (+ ack_delay 51) cwnd 326
[ 185.727309] hystart_ack_train (177 > 160) delay_min 61 (+ ack_delay 99) cwnd 128
[ 187.947142] hystart_ack_train (184 > 167) delay_min 123 (+ ack_delay 44) cwnd 367
[ 190.166680] hystart_ack_train (230 > 153) delay_min 116 (+ ack_delay 37) cwnd 444
[ 192.327285] hystart_ack_train (210 > 206) delay_min 86 (+ ack_delay 120) cwnd 152
[ 194.511392] hystart_ack_train (173 > 151) delay_min 94 (+ ack_delay 57) cwnd 239
[ 196.736023] hystart_ack_train (149 > 146) delay_min 105 (+ ack_delay 41) cwnd 399
Fixes: 42f3a8aaae66 ("tcp_cubic: tweak Hystart detection for short RTT flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Link: https://www.spinics.net/lists/netdev/msg621886.html
Link: https://www.spinics.net/lists/netdev/msg621797.html
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Mon, 30 Dec 2019 12:44:08 +0000 (18:14 +0530)]
cxgb4/cxgb4vf: fix flow control display for auto negotiation
As per 802.3-2005, Section Two, Annex 28B, Table 28B-2 [1], when
_only_ Rx pause is enabled, both symmetric and asymmetric pause
towards local device must be enabled. Also, firmware returns the local
device's flow control pause params as part of advertised capabilities
and negotiated params as part of current link attributes. So, fix up
ethtool's flow control pause params fetch logic to read from acaps,
instead of linkattr.
[1] https://standards.ieee.org/standard/802_3-2005.html
Fixes: c3168cabe1af ("cxgb4/cxgbvf: Handle 32-bit fw port capabilities")
Signed-off-by: Surendra Mobiya <surendra@chelsio.com>
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 30 Dec 2019 22:22:11 +0000 (14:22 -0800)]
Merge git://git./linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Remove #ifdef pollution around nf_ingress(), from Lukas Wunner.
2) Document ingress hook in netdevice, also from Lukas.
3) Remove htons() in tunnel metadata port netlink attributes,
from Xin Long.
4) Missing erspan netlink attribute validation also from Xin Long.
5) Missing erspan version in tunnel, from Xin Long.
6) Missing attribute nest in NFTA_TUNNEL_KEY_OPTS_{VXLAN,ERSPAN}
Patch from Xin Long.
7) Missing nla_nest_cancel() in tunnel netlink dump path,
from Xin Long.
8) Remove two exported conntrack symbols with no clients,
from Florian Westphal.
9) Add nft_meta_get_eval_time() helper to nft_meta, from Florian.
10) Add nft_meta_pkttype helper for loopback, also from Florian.
11) Add nft_meta_socket uid helper, from Florian Westphal.
12) Add nft_meta_cgroup helper, from Florian.
13) Add nft_meta_ifkind helper, from Florian.
14) Group all interface related meta selector, from Florian.
15) Add nft_prandom_u32() helper, from Florian.
16) Add nft_meta_rtclassid helper, from Florian.
17) Add support for matching on the slave device index,
from Florian.
This batch, among other things, contains updates for the netfilter
tunnel netlink interface: This extension is still incomplete and lacking
proper userspace support which is actually my fault, I did not find the
time to go back and finish this. This update is breaking tunnel UAPI in
some aspects to fix it but do it better sooner than never.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 29 Dec 2019 23:29:16 +0000 (15:29 -0800)]
Linux 5.5-rc4
David S. Miller [Sun, 29 Dec 2019 20:29:14 +0000 (12:29 -0800)]
Merge branch 'mlxsw-fixes'
Ido Schimmel says:
====================
mlxsw: Couple of fixes
This patch set contains two fixes for mlxsw. Please consider both for
stable.
Patch #1 from Amit fixes a wrong check during MAC validation when
creating router interfaces (RIFs). Given a particular order of
configuration this can result in the driver refusing to create new RIFs.
Patch #2 fixes a wrong trap configuration in which VRRP packets and
routing exceptions were policed by the same policer towards the CPU. In
certain situations this can prevent VRRP packets from reaching the CPU.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 29 Dec 2019 11:40:23 +0000 (13:40 +0200)]
mlxsw: spectrum: Use dedicated policer for VRRP packets
Currently, VRRP packets and packets that hit exceptions during routing
(e.g., MTU error) are policed using the same policer towards the CPU.
This means, for example, that misconfiguration of the MTU on a routed
interface can prevent VRRP packets from reaching the CPU, which in turn
can cause the VRRP daemon to assume it is the Master router.
Fix this by using a dedicated policer for VRRP packets.
Fixes: 11566d34f895 ("mlxsw: spectrum: Add VRRP traps")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alex Veber <alexve@mellanox.com>
Tested-by: Alex Veber <alexve@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amit Cohen [Sun, 29 Dec 2019 11:40:22 +0000 (13:40 +0200)]
mlxsw: spectrum_router: Skip loopback RIFs during MAC validation
When a router interface (RIF) is created the MAC address of the backing
netdev is verified to have the same MSBs as existing RIFs. This is
required in order to avoid changing existing RIF MAC addresses that all
share the same MSBs.
Loopback RIFs are special in this regard as they do not have a MAC
address, given they are only used to loop packets from the overlay to
the underlay.
Without this change, an error is returned when trying to create a RIF
after the creation of a GRE tunnel that is represented by a loopback
RIF. 'rif->dev->dev_addr' points to the GRE device's local IP, which
does not share the same MSBs as physical interfaces. Adding an IP
address to any physical interface results in:
Error: mlxsw_spectrum: All router interface MAC addresses must have the
same prefix.
Fix this by skipping loopback RIFs during MAC validation.
Fixes: 74bc99397438 ("mlxsw: spectrum_router: Veto unsupported RIF MAC addresses")
Signed-off-by: Amit Cohen <amitc@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 29 Dec 2019 18:01:27 +0000 (10:01 -0800)]
Merge tag 'riscv/for-v5.5-rc4' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
"One important fix for RISC-V:
- Redirect any incoming syscall with an ID less than -1 to
sys_ni_syscall, rather than allowing them to fall through into the
syscall handler.
and two minor build fixes:
- Export __asm_copy_{from,to}_user() from where they are defined.
This fixes a build error triggered by some randconfigs.
- Export flush_icache_all(). I'd resisted this before, since
historically we didn't want modules to be able to flush the I$
directly; but apparently everyone else is doing it now"
* tag 'riscv/for-v5.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: export flush_icache_all to modules
riscv: reject invalid syscalls below -1
riscv: fix compile failure with EXPORT_SYMBOL() & !MMU
Linus Torvalds [Sun, 29 Dec 2019 17:50:57 +0000 (09:50 -0800)]
Merge tag 'locks-v5.5-1' of git://git./linux/kernel/git/jlayton/linux
Pull /proc/locks formatting fix from Jeff Layton:
"This is a trivial fix for a _very_ long standing bug in /proc/locks
formatting. Ordinarily, I'd wait for the merge window for something
like this, but it is making it difficult to validate some overlayfs
fixes.
I've also gone ahead and marked this for stable"
* tag 'locks-v5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
locks: print unsigned ino in /proc/locks
Linus Torvalds [Sun, 29 Dec 2019 17:48:47 +0000 (09:48 -0800)]
Merge tag '5.5-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"One performance fix for large directory searches, and one minor style
cleanup noticed by Clang"
* tag '5.5-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Optimize readdir on reparse points
cifs: Adjust indentation in smb2_open_file
Amir Goldstein [Sun, 22 Dec 2019 18:45:28 +0000 (20:45 +0200)]
locks: print unsigned ino in /proc/locks
An ino is unsigned, so display it as such in /proc/locks.
Cc: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
David S. Miller [Sat, 28 Dec 2019 19:43:41 +0000 (11:43 -0800)]
Merge branch 'DSA-TX-tstamp'
Vladimir Oltean says:
====================
The DSA TX timestamping situation
This series is the moral v2 of "[PATCH net] net: dsa: sja1105: Fix
double delivery of TX timestamps to socket error queue" [0] which did
not manage to convince public opinion (actually it didn't convince me
neither).
This fixes PTP timestamping on one particular board, where the DSA
switch is sja1105 and the master is gianfar. Unfortunately there is no
way to make the fix more general without committing logical
inaccuracies: the SKBTX_IN_PROGRESS flag does serve a purpose, even if
the sja1105 driver is not using it now: it prevents delivering a SW
timestamp to the app socket when the HW timestamp will be provided. So
not setting this flag (the approach from v1) might create avoidable
complications in the future (not to mention that there isn't any
satisfactory explanation on why that would be the correct solution).
So the goal of this change set is to create a more strict framework for
DSA master devices when attached to PTP switches, and to fix the first
master driver that is overstepping its duties and is delivering
unsolicited TX timestamps.
[0]: https://www.spinics.net/lists/netdev/msg619699.html
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 28 Dec 2019 13:30:46 +0000 (15:30 +0200)]
net: dsa: Deny PTP on master if switch supports it
It is possible to kill PTP on a DSA switch completely and absolutely,
until a reboot, with a simple command:
tcpdump -i eth2 -j adapter_unsynced
where eth2 is the switch's DSA master.
Why? Well, in short, the PTP API in place today is a bit rudimentary and
relies on applications to retrieve the TX timestamps by polling the
error queue and looking at the cmsg structure. But there is no timestamp
identification of any sorts (except whether it's HW or SW), you don't
know how many more timestamps are there to come, which one is this one,
from whom it is, etc. In other words, the SO_TIMESTAMPING API is
fundamentally limited in that you can get a single HW timestamp from the
stack.
And the "-j adapter_unsynced" flag of tcpdump enables hardware
timestamping.
So let's imagine what happens when the DSA master decides it wants to
deliver TX timestamps to the skb's socket too:
- The timestamp that the user space sees is taken by the DSA master.
Whereas the RX timestamp will eventually be overwritten by the DSA
switch. So the RX and TX timestamps will be in different time bases
(aka garbage).
- The user space applications have no way to deal with the second (real)
TX timestamp finally delivered by the DSA switch, or even to know to
wait for it.
Take ptp4l from the linuxptp project, for example. This is its behavior
after running tcpdump, before the patch:
ptp4l[172]: [6469.594] Unexpected data on socket err queue:
ptp4l[172]: [6469.693] rms 8 max 16 freq -21257 +/- 11 delay 748 +/- 0
ptp4l[172]: [6469.711] Unexpected data on socket err queue:
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 05 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: [6469.721] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b1 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: [6469.838] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 06 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: [6469.848] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 13 02
ptp4l[172]: 0010 00 36 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 04 1a 45 05 7f
ptp4l[172]: 0030 00 00 5e 05 41 32 27 c2 1a 68 00 04 9f ff fe 05
ptp4l[172]: 0040 de 06 00 01
ptp4l[172]: [6469.855] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b2 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: [6469.974] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 07 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
The ptp4l program itself is heavily patched to show this (more details
here [0]). Otherwise, by default it just hangs.
On the other hand, with the DSA patch to disallow HW timestamping
applied:
tcpdump -i eth2 -j adapter_unsynced
tcpdump: SIOCSHWTSTAMP failed: Device or resource busy
So it is a fact of life that PTP timestamping on the DSA master is
incompatible with timestamping on the switch MAC, at least with the
current API. And if the switch supports PTP, taking the timestamps from
the switch MAC is highly preferable anyway, due to the fact that those
don't contain the queuing latencies of the switch. So just disallow PTP
on the DSA master if there is any PTP-capable switch attached.
[0]: https://sourceforge.net/p/linuxptp/mailman/message/
36880648/
Fixes: 0336369d3a4d ("net: dsa: forward hardware timestamping ioctls to switch driver")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 28 Dec 2019 13:30:45 +0000 (15:30 +0200)]
gianfar: Fix TX timestamping with a stacked DSA driver
The driver wrongly assumes that it is the only entity that can set the
SKBTX_IN_PROGRESS bit of the current skb. Therefore, in the
gfar_clean_tx_ring function, where the TX timestamp is collected if
necessary, the aforementioned bit is used to discriminate whether or not
the TX timestamp should be delivered to the socket's error queue.
But a stacked driver such as a DSA switch can also set the
SKBTX_IN_PROGRESS bit, which is actually exactly what it should do in
order to denote that the hardware timestamping process is undergoing.
Therefore, gianfar would misinterpret the "in progress" bit as being its
own, and deliver a second skb clone in the socket's error queue,
completely throwing off a PTP process which is not expecting to receive
it, _even though_ TX timestamping is not enabled for gianfar.
There have been discussions [0] as to whether non-MAC drivers need or
not to set SKBTX_IN_PROGRESS at all (whose purpose is to avoid sending 2
timestamps, a sw and a hw one, to applications which only expect one).
But as of this patch, there are at least 2 PTP drivers that would break
in conjunction with gianfar: the sja1105 DSA switch and the felix
switch, by way of its ocelot core driver.
So regardless of that conclusion, fix the gianfar driver to not do stuff
based on flags set by others and not intended for it.
[0]: https://www.spinics.net/lists/netdev/msg619699.html
Fixes: f0ee7acfcdd4 ("gianfar: Add hardware TX timestamping support")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chen Zhou [Sat, 28 Dec 2019 03:09:47 +0000 (11:09 +0800)]
net/wan/fsl_ucc_hdlc: remove set but not used variables 'ut_info' and 'ret'
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/net/wan/fsl_ucc_hdlc.c: In function ucc_hdlc_irq_handler:
drivers/net/wan/fsl_ucc_hdlc.c:643:23:
warning: variable ut_info set but not used [-Wunused-but-set-variable]
drivers/net/wan/fsl_ucc_hdlc.c: In function uhdlc_suspend:
drivers/net/wan/fsl_ucc_hdlc.c:880:23:
warning: variable ut_info set but not used [-Wunused-but-set-variable]
drivers/net/wan/fsl_ucc_hdlc.c: In function uhdlc_resume:
drivers/net/wan/fsl_ucc_hdlc.c:925:6:
warning: variable ret set but not used [-Wunused-but-set-variable]
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Olof Johansson [Tue, 17 Dec 2019 04:07:04 +0000 (20:07 -0800)]
riscv: export flush_icache_all to modules
This is needed by LKDTM (crash dump test module), it calls
flush_icache_range(), which on RISC-V turns into flush_icache_all(). On
other architectures, the actual implementation is exported, so follow
that precedence and export it here too.
Fixes build of CONFIG_LKDTM that fails with:
ERROR: "flush_icache_all" [drivers/misc/lkdtm/lkdtm.ko] undefined!
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
David Abdurachmanov [Wed, 18 Dec 2019 08:47:56 +0000 (10:47 +0200)]
riscv: reject invalid syscalls below -1
Running "stress-ng --enosys 4 -t 20 -v" showed a large number of kernel oops
with "Unable to handle kernel paging request at virtual address" message. This
happens when enosys stressor starts testing random non-valid syscalls.
I forgot to redirect any syscall below -1 to sys_ni_syscall.
With the patch kernel oops messages are gone while running stress-ng enosys
stressor.
Signed-off-by: David Abdurachmanov <david.abdurachmanov@sifive.com>
Fixes: 5340627e3fe0 ("riscv: add support for SECCOMP and SECCOMP_FILTER")
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Luc Van Oostenryck [Sun, 22 Dec 2019 09:26:04 +0000 (10:26 +0100)]
riscv: fix compile failure with EXPORT_SYMBOL() & !MMU
When support for !MMU was added, the declaration of
__asm_copy_to_user() & __asm_copy_from_user() were #ifdefed
out hence their EXPORT_SYMBOL() give an error message like:
.../riscv_ksyms.c:13:15: error: '__asm_copy_to_user' undeclared here
.../riscv_ksyms.c:14:15: error: '__asm_copy_from_user' undeclared here
Since these symbols are not defined with !MMU it's wrong to export them.
Same for __clear_user() (even though this one is also declared in
include/asm-generic/uaccess.h and thus doesn't give an error message).
Fix this by doing the EXPORT_SYMBOL() directly where these symbols
are defined: inside lib/uaccess.S itself.
Fixes: 6bd33e1ece52 ("riscv: fix compile failure with EXPORT_SYMBOL() & !MMU")
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Linus Torvalds [Sat, 28 Dec 2019 01:28:41 +0000 (17:28 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four fixes and one spelling update, all in drivers: two in lpfc and
the rest in mp3sas, cxgbi and target"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target/iblock: Fix protection error with blocks greater than 512B
scsi: libcxgbi: fix NULL pointer dereference in cxgbi_device_destroy()
scsi: lpfc: fix spelling mistakes of asynchronous
scsi: lpfc: fix build failure with DEBUGFS disabled
scsi: mpt3sas: Fix double free in attach error handling
David S. Miller [Sat, 28 Dec 2019 00:40:02 +0000 (16:40 -0800)]
Merge branch 'ethtool-netlink-part-one'
Michal Kubecek says:
====================
ethtool netlink interface, part 1
This is first part of netlink based alternative userspace interface for
ethtool. It aims to address some long known issues with the ioctl
interface, mainly lack of extensibility, raciness, limited error reporting
and absence of notifications. The goal is to allow userspace ethtool
utility to provide all features it currently does but without using the
ioctl interface. However, some features provided by ethtool ioctl API will
be available through other netlink interfaces (rtnetlink, devlink) if it's
more appropriate.
The interface uses generic netlink family "ethtool" and provides multicast
group "monitor" which is used for notifications. Documentation for the
interface is in Documentation/networking/ethtool-netlink.rst file. The
netlink interface is optional, it is built when CONFIG_ETHTOOL_NETLINK
(bool) option is enabled.
There are three types of request messages distinguished by suffix "_GET"
(query for information), "_SET" (modify parameters) and "_ACT" (perform an
action). Kernel reply messages have name with additional suffix "_REPLY"
(e.g. ETHTOOL_MSG_SETTINGS_GET_REPLY). Most "_SET" and "_ACT" message types
do not have matching reply type as only some of them need additional reply
data beyond numeric error code and extack. Kernel also broadcasts
notification messages ("_NTF" suffix) on changes.
Basic concepts:
- make extensions easier not only by allowing new attributes but also by
imposing as few artificial limits as possible, e.g. by using arbitrary
size bit sets for most bitmap attributes or by not using fixed size
strings
- use extack for error reporting and warnings
- send netlink notifications on changes (even if they were done using the
ioctl interface) and actions
- avoid the racy read/modify/write cycle between kernel and userspace by
sending only attributes which userspace wants to change; there is still
a read/modify/write cycle between generic kernel code and ethtool_ops
handler in NIC driver but it is only in kernel and under RTNL lock
- reduce the number of name lists that need to be kept in sync between
kernel and userspace (e.g. recognized link modes)
- where feasible, allow dump requests to query specific information for all
network devices
- as parsing and generating netlink messages is more complicated than
simply copying data structures between userspace API and ethtool_ops
handlers (which most ioctl commands do), split the code into multiple
files in net/ethtool directory; move net/core/ethtool.c also to this
directory and rename it to ioctl.c
Changes between v8 and v9:
- fix ethnl_update_u8()
- fix description of ETHTOOL_A_LINKSTATE_LINK in rst file
- add explanation of verbose vs. compact bitset usage to documentation
- link ethtool-netlink.rst into toctree
Main changes between v7 and v8:
- preliminary patches sent as a separate series (already in net-next)
- split notification related changes out of _SET patches
- drop request specific flags from common header
- use FLAG/flag rather than GFLAG/gflag for global flags (as there are
only global flags now)
- allow device names up to ALTIFNAMSIZ characters
- rename ETHTOOL_A_BITSET_LIST to ETHTOOL_A_BITSET_NOMASK
- rename ETHTOOL_A_BIT{,S}_* to ETHTOOL_A_BITSET_BIT{,S}_*
- use standard bitset helpers for link modes (rather than in-place
conversion)
- use "default" rather than "standard" for unified _GET handlers
- fixed 64-bit big endian bitset code
Main changes between v6 and v7:
- split complex messages into small single purpose ones (drop info and
request masks and one level of nesting)
- separate request information and reply data into two structures
- refactor bitset handling (no simultaneous u32/ulong handling but avoid
kmalloc() except for long bitmaps on 64-bit big endian architectures)
- use only fixed size strings internally (will be replaced by char *
eventually but that will require rewriting also existing ioctl code)
- rework ethnl_update_* helpers to return error code
- rename request flag constants (to ETHTOOL_[GR]FLAG_ prefix)
- convert documentation to rst
Main changes between v5 and v6:
- use ETHTOOL_MSG_ prefix for message types
- replace ETHA_ prefix for netlink attributes by ETHTOOL_A_
- replace ETH_x_IM_y for infomask bits by ETHTOOL_IM_x_y
- split GET reply types from SET requests and notifications
- split kernel and userspace message types into different enums
- remove INFO_GET requests from submitted part
- drop EVENT notifications (use rtnetlink and on-demand string set load)
- reorganize patches to reduce the number of intermitent warnings
- unify request/reply header and its processing
- another nest around strings in a string set for consistency
- more consistent identifier naming
- coding style cleanup
- get rid of some of the helpers
- set bad attribute in extack where applicable
- various bug fixes
- improve documentation and code comments, more kerneldoc comments
- more verbose commit messages
Changes between v4 and v5:
- do not panic on failed initialization, only WARN()
Main changes between RFC v3 and v4:
- use more kerneldoc style comments
- strict attribute policy checking
- use macros for tables of link mode names and parameters
- provide permanent hardware address in rtnetlink
- coding style cleanup
- split too long patches, reorder
- wrap more ETHA_SETTINGS_* attributes in nests
- add also some SET_* implementation into submitted part
Main changes between RFC v2 and RFC v3:
- do not allow building as a module (no netdev notifiers needed)
- drop some obsolete fields
- add permanent hw address, timestamping and private flags support
- rework bitset handling to get rid of variable length arrays
- notify monitor on device renames
- restructure GET_SETTINGS/SET_SETTINGS messages
- split too long patches and submit only first part of the series
Main changes between RFC v1 and RFC v2:
- support dumps for all "get" requests
- provide notifications for changes related to supported request types
- support getting string sets (both global and per device)
- support getting/setting device features
- get rid of family specific header, everything passed as attributes
- split netlink code into multiple files in net/ethtool/ directory
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:56:23 +0000 (15:56 +0100)]
ethtool: provide link state with LINKSTATE_GET request
Implement LINKSTATE_GET netlink request to get link state information.
At the moment, only link up flag as provided by ETHTOOL_GLINK ioctl command
is returned.
LINKSTATE_GET request can be used with NLM_F_DUMP (without device
identification) to request the information for all devices in current
network namespace providing the data.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:56:18 +0000 (15:56 +0100)]
ethtool: add LINKMODES_NTF notification
Send ETHTOOL_MSG_LINKMODES_NTF notification message whenever device link
settings or advertised modes are modified using ETHTOOL_MSG_LINKMODES_SET
netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands.
The notification message has the same format as reply to LINKMODES_GET
request. ETHTOOL_MSG_LINKMODES_SET netlink request only triggers the
notification if there is a change but the ioctl command handlers do not
check if there is an actual change and trigger the notification whenever
the commands are executed.
As all work is done by ethnl_default_notify() handler and callback
functions introduced to handle LINKMODES_GET requests, all that remains is
adding entries for ETHTOOL_MSG_LINKMODES_NTF into ethnl_notify_handlers and
ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where
needed.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:56:13 +0000 (15:56 +0100)]
ethtool: set link modes related data with LINKMODES_SET request
Implement LINKMODES_SET netlink request to set advertised linkmodes and
related attributes as ETHTOOL_SLINKSETTINGS and ETHTOOL_SSET commands do.
The request allows setting autonegotiation flag, speed, duplex and
advertised link modes.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:56:08 +0000 (15:56 +0100)]
ethtool: provide link mode information with LINKMODES_GET request
Implement LINKMODES_GET netlink request to get link modes related
information provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl
commands.
This request provides supported, advertised and peer advertised link modes,
autonegotiation flag, speed and duplex.
LINKMODES_GET request can be used with NLM_F_DUMP (without device
identification) to request the information for all devices in current
network namespace providing the data.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:56:03 +0000 (15:56 +0100)]
ethtool: add LINKINFO_NTF notification
Send ETHTOOL_MSG_LINKINFO_NTF notification message whenever device link
settings are modified using ETHTOOL_MSG_LINKINFO_SET netlink message or
ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands.
The notification message has the same format as reply to LINKINFO_GET
request. ETHTOOL_MSG_LINKINFO_SET netlink request only triggers the
notification if there is a change but the ioctl command handlers do not
check if there is an actual change and trigger the notification whenever
the commands are executed.
As all work is done by ethnl_default_notify() handler and callback
functions introduced to handle LINKINFO_GET requests, all that remains is
adding entries for ETHTOOL_MSG_LINKINFO_NTF into ethnl_notify_handlers and
ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where
needed.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:58 +0000 (15:55 +0100)]
ethtool: add default notification handler
The ethtool netlink notifications have the same format as related GET
replies so that if generic GET handling framework is used to process GET
requests, its callbacks and instance of struct get_request_ops can be
also used to compose corresponding notification message.
Provide function ethnl_std_notify() to be used as notification handler in
ethnl_notify_handlers table.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:53 +0000 (15:55 +0100)]
ethtool: set link settings with LINKINFO_SET request
Implement LINKINFO_SET netlink request to set link settings queried by
LINKINFO_GET message.
Only physical port, phy MDIO address and MDI(-X) control can be set,
attempt to modify MDI(-X) status and transceiver is rejected.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:48 +0000 (15:55 +0100)]
ethtool: provide link settings with LINKINFO_GET request
Implement LINKINFO_GET netlink request to get basic link settings provided
by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands.
This request provides settings not directly related to autonegotiation and
link mode selection: physical port, phy MDIO address, MDI(-X) status,
MDI(-X) control and transceiver.
LINKINFO_GET request can be used with NLM_F_DUMP (without device
identification) to request the information for all devices in current
network namespace providing the data.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:43 +0000 (15:55 +0100)]
ethtool: provide string sets with STRSET_GET request
Requests a contents of one or more string sets, i.e. indexed arrays of
strings; this information is provided by ETHTOOL_GSSET_INFO and
ETHTOOL_GSTRINGS commands of ioctl interface. Unlike ioctl interface, all
information can be retrieved with one request and mulitple string sets can
be requested at once.
There are three types of requests:
- no NLM_F_DUMP, no device: get "global" stringsets
- no NLM_F_DUMP, with device: get string sets related to the device
- NLM_F_DUMP, no device: get device related string sets for all devices
Client can request either all string sets of given type (global or device
related) or only specific sets. With ETHTOOL_A_STRSET_COUNTS flag set, only
set sizes (numbers of strings) are returned.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:38 +0000 (15:55 +0100)]
ethtool: default handlers for GET requests
Significant part of GET request processing is common for most request
types but unfortunately it cannot be easily separated from type specific
code as we need to alternate between common actions (parsing common request
header, allocating message and filling netlink/genetlink headers etc.) and
specific actions (querying the device, composing the reply). The processing
also happens in three different situations: "do" request, "dump" request
and notification, each doing things in slightly different way.
The request specific code is implemented in four or five callbacks defined
in an instance of struct get_request_ops:
parse_request() - parse incoming message
prepare_data() - retrieve data from driver or NIC
reply_size() - estimate reply message size
fill_reply() - compose reply message
cleanup_data() - (optional) clean up additional data
Other members of struct get_request_ops describe the data structure holding
information from client request and data used to compose the message. The
default handlers ethnl_default_doit(), ethnl_default_dumpit(),
ethnl_default_start() and ethnl_default_done() can be then used in genl_ops
handler. Notification handler will be introduced in a later patch.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:33 +0000 (15:55 +0100)]
ethtool: support for netlink notifications
Add infrastructure for ethtool netlink notifications. There is only one
multicast group "monitor" which is used to notify userspace about changes
and actions performed. Notification messages (types using suffix _NTF)
share the format with replies to GET requests.
Notifications are supposed to be broadcasted on every configuration change,
whether it is done using the netlink interface or ioctl one. Netlink SET
requests only trigger a notification if some data is actually changed.
To trigger an ethtool notification, both ethtool netlink and external code
use ethtool_notify() helper. This helper requires RTNL to be held and may
sleep. Handlers sending messages for specific notification message types
are registered in ethnl_notify_handlers array. As notifications can be
triggered from other code, ethnl_ok flag is used to prevent an attempt to
send notification before genetlink family is registered.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:28 +0000 (15:55 +0100)]
ethtool: netlink bitset handling
The ethtool netlink code uses common framework for passing arbitrary
length bit sets to allow future extensions. A bitset can be a list (only
one bitmap) or can consist of value and mask pair (used e.g. when client
want to modify only some bits). A bitset can use one of two formats:
verbose (bit by bit) or compact.
Verbose format consists of bitset size (number of bits), list flag and
an array of bit nests, telling which bits are part of the list or which
bits are in the mask and which of them are to be set. In requests, bits
can be identified by index (position) or by name. In replies, kernel
provides both index and name. Verbose format is suitable for "one shot"
applications like standard ethtool command as it avoids the need to
either keep bit names (e.g. link modes) in sync with kernel or having to
add an extra roundtrip for string set request (e.g. for private flags).
Compact format uses one (list) or two (value/mask) arrays of 32-bit
words to store the bitmap(s). It is more suitable for long running
applications (ethtool in monitor mode or network management daemons)
which can retrieve the names once and then pass only compact bitmaps to
save space.
Userspace requests can use either format; ETHTOOL_FLAG_COMPACT_BITSETS
flag in request header tells kernel which format to use in reply.
Notifications always use compact format.
As some code uses arrays of unsigned long for internal representation and
some arrays of u32 (or even a single u32), two sets of parse/compose
helpers are introduced. To avoid code duplication, helpers for unsigned
long arrays are implemented as wrappers around helpers for u32 arrays.
There are two reasons for this choice: (1) u32 arrays are more frequent in
ethtool code and (2) unsigned long array can be always interpreted as an
u32 array on little endian 64-bit and all 32-bit architectures while we
would need special handling for odd number of u32 words in the opposite
direction.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:23 +0000 (15:55 +0100)]
ethtool: helper functions for netlink interface
Add common request/reply header definition and helpers to parse request
header and fill reply header. Provide ethnl_update_* helpers to update
structure members from request attributes (to be used for *_SET requests).
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Fri, 27 Dec 2019 14:55:18 +0000 (15:55 +0100)]
ethtool: introduce ethtool netlink interface
Basic genetlink and init infrastructure for the netlink interface, register
genetlink family "ethtool". Add CONFIG_ETHTOOL_NETLINK Kconfig option to
make the build optional. Add initial overall interface description into
Documentation/networking/ethtool-netlink.rst, further patches will add more
detailed information.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Blumenstingl [Thu, 26 Dec 2019 19:01:01 +0000 (20:01 +0100)]
net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs
GXBB and newer SoCs use the fixed FCLK_DIV2 (1GHz) clock as input for
the m250_sel clock. Meson8b and Meson8m2 use MPLL2 instead, whose rate
can be adjusted at runtime.
So far we have been running MPLL2 with ~250MHz (and the internal
m250_div with value 1), which worked enough that we could transfer data
with an TX delay of 4ns. Unfortunately there is high packet loss with
an RGMII PHY when transferring data (receiving data works fine though).
Odroid-C1's u-boot is running with a TX delay of only 2ns as well as
the internal m250_div set to 2 - no lost (TX) packets can be observed
with that setting in u-boot.
Manual testing has shown that the TX packet loss goes away when using
the following settings in Linux (the vendor kernel uses the same
settings):
- MPLL2 clock set to ~500MHz
- m250_div set to 2
- TX delay set to 2ns on the MAC side
Update the m250_div divider settings to only accept dividers greater or
equal 2 to fix the TX delay generated by the MAC.
iperf3 results before the change:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 182 MBytes 153 Mbits/sec 514 sender
[ 5] 0.00-10.00 sec 182 MBytes 152 Mbits/sec receiver
iperf3 results after the change (including an updated TX delay of 2ns):
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-10.00 sec 927 MBytes 778 Mbits/sec 0 sender
[ 5] 0.00-10.01 sec 927 MBytes 777 Mbits/sec receiver
Fixes: 4f6a71b84e1afd ("net: stmmac: dwmac-meson8b: fix internal RGMII clock configuration")
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shmulik Ladkani [Wed, 25 Dec 2019 08:51:01 +0000 (10:51 +0200)]
net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit device
There's no skb_pull performed when a mirred action is set at egress of a
mac device, with a target device/action that expects skb->data to point
at the network header.
As a result, either the target device is errornously given an skb with
data pointing to the mac (egress case), or the net stack receives the
skb with data pointing to the mac (ingress case).
E.g:
# tc qdisc add dev eth9 root handle 1: prio
# tc filter add dev eth9 parent 1: prio 9 protocol ip handle 9 basic \
action mirred egress redirect dev tun0
(tun0 is a tun device. result: tun0 errornously gets the eth header
instead of the iph)
Revise the push/pull logic of tcf_mirred_act() to not rely on the
skb_at_tc_ingress() vs tcf_mirred_act_wants_ingress() comparison, as it
does not cover all "pull" cases.
Instead, calculate whether the required action on the target device
requires the data to point at the network header, and compare this to
whether skb->data points to network header - and make the push/pull
adjustments as necessary.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Shmulik Ladkani <sladkani@proofpoint.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kevin Kou [Wed, 25 Dec 2019 08:27:25 +0000 (08:27 +0000)]
sctp: do trace_sctp_probe after SACK validation and check
The function sctp_sf_eat_sack_6_2 now performs the Verification
Tag validation, Chunk length validation, Bogu check, and also
the detection of out-of-order SACK based on the RFC2960
Section 6.2 at the beginning, and finally performs the further
processing of SACK. The trace_sctp_probe now triggered before
the above necessary validation and check.
this patch is to do the trace_sctp_probe after the chunk sanity
tests, but keep doing trace if the SACK received is out of order,
for the out-of-order SACK is valuable to congestion control
debugging.
v1->v2:
- keep doing SCTP trace if the SACK is out of order as Marcelo's
suggestion.
v2->v3:
- regenerate the patch as v2 generated on top of v1, and add
'net-next' tag to the new one as Marcelo's comments.
Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikita Yushchenko [Wed, 25 Dec 2019 05:22:38 +0000 (08:22 +0300)]
mv88e6xxx: Add serdes Rx statistics
If packet checker is enabled in the serdes, then Rx counter registers
start working, and no side effects have been detected.
This patch enables packet checker automatically when powering serdes on,
and exposes Rx counter registers via ethtool statistics interface.
Code partially basded by older attempt by Andrew Lunn.
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Tue, 24 Dec 2019 12:51:28 +0000 (20:51 +0800)]
net: ena: remove set but not used variable 'rx_ring'
drivers/net/ethernet/amazon/ena/ena_netdev.c: In function ena_xdp_xmit_buff:
drivers/net/ethernet/amazon/ena/ena_netdev.c:316:19: warning:
variable rx_ring set but not used [-Wunused-but-set-variable]
commit
548c4940b9f1 ("net: ena: Implement XDP_TX action")
left behind this unused variable.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mao Wenan [Tue, 24 Dec 2019 11:58:12 +0000 (19:58 +0800)]
net: dsa: qca: ar9331: drop pointless static qualifier in ar9331_sw_mbus_init
There is no need to set variable 'mbus' static
since new value always be assigned before use it.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xu Wang [Tue, 24 Dec 2019 09:37:04 +0000 (09:37 +0000)]
ppp: Remove redundant BUG_ON() check in ppp_pernet
Passing NULL to ppp_pernet causes a crash via BUG_ON.
Dereferencing net in net_generic() also has the same effect.
This patch removes the redundant BUG_ON check on the same parameter.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 28 Dec 2019 00:29:15 +0000 (16:29 -0800)]
Merge branch 'tcp_cubic-various-fixes'
Eric Dumazet says:
====================
tcp_cubic: various fixes
This patch series converts tcp_cubic to usec clock resolution
for Hystart logic.
This makes Hystart more relevant for data-center flows.
Prior to this series, Hystart was not kicking, or was
kicking without good reason, since the 1ms clock was too coarse.
Last patch also fixes an issue with Hystart vs TCP pacing.
v2: removed a last-minute debug chunk from last patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 23 Dec 2019 20:27:54 +0000 (12:27 -0800)]
tcp_cubic: make Hystart aware of pacing
For years we disabled Hystart ACK train detection at Google
because it was fooled by TCP pacing.
ACK train detection uses a simple heuristic, detecting if
we receive ACK past half the RTT, to exit slow start before
hitting the bottleneck and experience massive drops.
But pacing by design might delay packets up to RTT/2,
so we need to tweak the Hystart logic to be aware of this
extra delay.
Tested:
Added a 100 usec delay at receiver.
Before:
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -
4000000; done;nstat|egrep "Hystart"
9117
7057
9553
8300
7030
6849
9533
10126
6876
8473
TcpExtTCPHystartTrainDetect 10 0.0
TcpExtTCPHystartTrainCwnd 1230 0.0
After :
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -
4000000; done;nstat|egrep "Hystart"
9845
10103
10866
11096
11936
11487
11773
12188
11066
11894
TcpExtTCPHystartTrainDetect 10 0.0
TcpExtTCPHystartTrainCwnd 6462 0.0
Disabling Hystart ACK Train detection gives similar numbers
echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -
4000000; done;nstat|egrep "Hystart"
11173
10954
12455
10627
11578
11583
11222
10880
10665
11366
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 23 Dec 2019 20:27:53 +0000 (12:27 -0800)]
tcp_cubic: tweak Hystart detection for short RTT flows
After switching ca->delay_min to usec resolution, we exit
slow start prematurely for very low RTT flows, setting
snd_ssthresh to 20.
The reason is that delay_min is fed with RTT of small packet
trains. Then as cwnd is increased, TCP sends bigger TSO packets.
LRO/GRO aggregation and/or interrupt mitigation strategies
on receiver tend to inflate RTT samples.
Fix this by adding to delay_min the expected delay of
two TSO packets, given current pacing rate.
Tested:
Sender uses pfifo_fast qdisc
Before :
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -
4000000; done;nstat|egrep "Hystart"
11348
11707
11562
11428
11773
11534
9878
11693
10597
10968
TcpExtTCPHystartTrainDetect 10 0.0
TcpExtTCPHystartTrainCwnd 200 0.0
After :
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -
4000000; done;nstat|egrep "Hystart"
14877
14517
15797
18466
17376
14833
17558
17933
16039
18059
TcpExtTCPHystartTrainDetect 10 0.0
TcpExtTCPHystartTrainCwnd 1670 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 23 Dec 2019 20:27:52 +0000 (12:27 -0800)]
tcp_cubic: switch bictcp_clock() to usec resolution
Current 1ms clock feeds ca->round_start, ca->delay_min,
ca->last_ack.
This is quite problematic for data-center flows, where delay_min
is way below 1 ms.
This means Hystart Train detection triggers every time jiffies value
is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
expression becomes true.
This kind of random behavior can be solved by reusing the existing
usec timestamp that TCP keeps in tp->tcp_mstamp
Note that a followup patch will tweak things a bit, because
during slow start, GRO aggregation on receivers naturally
increases the RTT as TSO packets gradually come to ~64KB size.
To recap, right after this patch CUBIC Hystart train detection
is more aggressive, since short RTT flows might exit slow start at
cwnd = 20, instead of being possibly unbounded.
Following patch will address this problem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 23 Dec 2019 20:27:51 +0000 (12:27 -0800)]
tcp_cubic: remove one conditional from hystart_update()
If we initialize ca->curr_rtt to ~0U, we do not need to test
for zero value in hystart_update()
We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
been processed, and thus ca->curr_rtt will have a sane value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 23 Dec 2019 20:27:50 +0000 (12:27 -0800)]
tcp_cubic: optimize hystart_update()
We do not care which bit in ca->found is set.
We avoid accessing hystart and hystart_detect unless really needed,
possibly avoiding one cache line miss.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Dec 2019 22:20:10 +0000 (14:20 -0800)]
Merge git://git./linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-12-27
The following pull-request contains BPF updates for your *net-next* tree.
We've added 127 non-merge commits during the last 17 day(s) which contain
a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).
There are three merge conflicts. Conflicts and resolution looks as follows:
1) Merge conflict in net/bpf/test_run.c:
There was a tree-wide cleanup
c593642c8be0 ("treewide: Use sizeof_field() macro")
which gets in the way with
b590cb5f802d ("bpf: Switch to offsetofend in
BPF_PROG_TEST_RUN"):
<<<<<<< HEAD
if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
sizeof_field(struct __sk_buff, priority),
=======
if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
>>>>>>>
7c8dce4b166113743adad131b5a24c4acc12f92c
There are a few occasions that look similar to this. Always take the chunk with
offsetofend(). Note that there is one where the fields differ in here:
<<<<<<< HEAD
if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
sizeof_field(struct __sk_buff, tstamp),
=======
if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
>>>>>>>
7c8dce4b166113743adad131b5a24c4acc12f92c
Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
850a88cc4096 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").
2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:
(I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)
<<<<<<< HEAD
if (is_13b_check(off, insn))
return -1;
emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
=======
emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
>>>>>>>
7c8dce4b166113743adad131b5a24c4acc12f92c
Result should look like:
emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);
3) Merge conflict in arch/riscv/include/asm/pgtable.h:
<<<<<<< HEAD
=======
#define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
#define VMALLOC_END (PAGE_OFFSET - 1)
#define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
#define BPF_JIT_REGION_SIZE (SZ_128M)
#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
#define BPF_JIT_REGION_END (VMALLOC_END)
/*
* Roughly size the vmemmap space to be large enough to fit enough
* struct pages to map half the virtual address space. Then
* position vmemmap directly below the VMALLOC region.
*/
#define VMEMMAP_SHIFT \
(CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
#define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
#define VMEMMAP_END (VMALLOC_START - 1)
#define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)
#define vmemmap ((struct page *)VMEMMAP_START)
>>>>>>>
7c8dce4b166113743adad131b5a24c4acc12f92c
Only take the BPF_* defines from there and move them higher up in the
same file. Remove the rest from the chunk. The VMALLOC_* etc defines
got moved via
01f52e16b868 ("riscv: define vmemmap before pfn_to_page
calls"). Result:
[...]
#define __S101 PAGE_READ_EXEC
#define __S110 PAGE_SHARED_EXEC
#define __S111 PAGE_SHARED_EXEC
#define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
#define VMALLOC_END (PAGE_OFFSET - 1)
#define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
#define BPF_JIT_REGION_SIZE (SZ_128M)
#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
#define BPF_JIT_REGION_END (VMALLOC_END)
/*
* Roughly size the vmemmap space to be large enough to fit enough
* struct pages to map half the virtual address space. Then
* position vmemmap directly below the VMALLOC region.
*/
#define VMEMMAP_SHIFT \
(CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
#define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
#define VMEMMAP_END (VMALLOC_START - 1)
#define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)
[...]
Let me know if there are any other issues.
Anyway, the main changes are:
1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
to a provided BPF object file. This provides an alternative, simplified API
compared to standard libbpf interaction. Also, add libbpf extern variable
resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.
2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
add various BPF riscv JIT improvements, from Björn Töpel.
3) Extend bpftool to allow matching BPF programs and maps by name,
from Paul Chaignon.
4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
flag for allowing updates without service interruption, from Andrey Ignatov.
5) Cleanup and simplification of ring access functions for AF_XDP with a
bonus of 0-5% performance improvement, from Magnus Karlsson.
6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.
7) Move and extend test_select_reuseport into BPF program tests under
BPF selftests, from Jakub Sitnicki.
8) Various BPF sample improvements for xdpsock for customizing parameters
to set up and benchmark AF_XDP, from Jay Jayatheerthan.
9) Improve libbpf to provide a ulimit hint on permission denied errors.
Also change XDP sample programs to attach in driver mode by default,
from Toke Høiland-Jørgensen.
10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
programs, from Nikita V. Shirokov.
11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.
12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
libbpf conversion, from Jesper Dangaard Brouer.
13) Minor misc improvements from various others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 27 Dec 2019 21:21:06 +0000 (13:21 -0800)]
Merge tag 'drm-fixes-2019-12-28' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Post-xmas food coma recovery fixes. Only three fixes for i915 since I
expect most people are holidaying.
i915:
- power management rc6 fix
- framebuffer tracking fix
- display power management ratelimit fix"
* tag 'drm-fixes-2019-12-28' of git://anongit.freedesktop.org/drm/drm:
drm/i915: Hold reference to intel_frontbuffer as we track activity
drm/i915/gt: Ratelimit display power w/a
drm/i915/pmu: Ensure monotonic rc6
Linus Torvalds [Fri, 27 Dec 2019 19:30:26 +0000 (11:30 -0800)]
Merge tag 'linux-kselftest-5.5-rc4' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fixes from Shuah Khan:
- rseq build failures fixes related to glibc 2.30 compatibility from
Mathieu Desnoyers
- Kunit fixes and cleanups from SeongJae Park
- Fixes to filesystems/epoll, firmware, and livepatch build failures
and skip handling.
* tag 'linux-kselftest-5.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
rseq/selftests: Clarify rseq_prepare_unload() helper requirements
rseq/selftests: Fix: Namespace gettid() for compatibility with glibc 2.30
rseq/selftests: Turn off timeout setting
kunit/kunit_tool_test: Test '--build_dir' option run
kunit: Rename 'kunitconfig' to '.kunitconfig'
kunit: Place 'test.log' under the 'build_dir'
kunit: Create default config in '--build_dir'
kunit: Remove duplicated defconfig creation
docs/kunit/start: Use in-tree 'kunit_defconfig'
selftests: livepatch: Fix it to do root uid check and skip
selftests: firmware: Fix it to do root uid check and skip
selftests: filesystems/epoll: fix build error
Linus Torvalds [Fri, 27 Dec 2019 19:26:54 +0000 (11:26 -0800)]
Merge tag 'pm-5.5-rc4' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Fix compile test of the Tegra devfreq driver (Arnd Bergmann) and
remove redundant Kconfig dependencies from multiple devfreq drivers
(Leonard Crestez)"
* tag 'pm-5.5-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / devfreq: tegra: Add COMMON_CLK dependency
PM / devfreq: Drop explicit selection of PM_OPP
Linus Torvalds [Fri, 27 Dec 2019 19:17:08 +0000 (11:17 -0800)]
Merge tag 'io_uring-5.5-
20191226' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Removal of now unused busy wqe list (Hillf)
- Add cond_resched() to io-wq work processing (Hillf)
- And then the series that I hinted at from last week, which removes
the sqe from the io_kiocb and keeps all sqe handling on the prep
side. This guarantees that an opcode can't do the wrong thing and
read the sqe more than once. This is unchanged from last week, no
issues have been observed with this in testing. Hence I really think
we should fold this into 5.5.
* tag 'io_uring-5.5-
20191226' of git://git.kernel.dk/linux-block:
io-wq: add cond_resched() to worker thread
io-wq: remove unused busy list from io_sqe
io_uring: pass in 'sqe' to the prep handlers
io_uring: standardize the prep methods
io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler
io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler
io_uring: move all prep state for IORING_OP_CONNECT to prep handler
io_uring: add and use struct io_rw for read/writes
io_uring: use u64_to_user_ptr() consistently
Linus Torvalds [Fri, 27 Dec 2019 19:13:18 +0000 (11:13 -0800)]
Merge tag 'libata-5.5-
20191226' of git://git.kernel.dk/linux-block
Pull libata fixes from Jens Axboe:
"Two things in here:
- First half of a series that fixes ahci_brcm, also marked for
stable. The other part of the series is going into 5.6 (Florian)
- sata_nv regression fix that is also marked for stable (Sascha)"
* tag 'libata-5.5-
20191226' of git://git.kernel.dk/linux-block:
ata: ahci_brcm: Add missing clock management during recovery
ata: ahci_brcm: BCM7425 AHCI requires AHCI_HFLAG_DELAY_ENGINE
ata: ahci_brcm: Fix AHCI resources management
ata: libahci_platform: Export again ahci_platform_<en/dis>able_phys()
libata: Fix retrieving of active qcs
Linus Torvalds [Fri, 27 Dec 2019 19:09:04 +0000 (11:09 -0800)]
Merge tag 'block-5.5-
20191226' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Only thing here are the changes from Arnd from last week, which now
have the appropriate header include to ensure they actually compile if
COMPAT is enabled"
* tag 'block-5.5-
20191226' of git://git.kernel.dk/linux-block:
compat_ioctl: block: handle Persistent Reservations
compat_ioctl: block: handle add zone open, close and finish ioctl
compat_ioctl: block: handle BLKGETZONESZ/BLKGETNRZONES
compat_ioctl: block: handle BLKREPORTZONE/BLKRESETZONE
pktcdvd: fix regression on 64-bit architectures
Linus Torvalds [Fri, 27 Dec 2019 19:02:48 +0000 (11:02 -0800)]
Merge tag 'gpio-v5.5-2' of git://git./linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"A set of fixes for the v5.5 series:
- Fix the build for the Xtensa driver.
- Make sure to set up the parent device for mpc8xxx.
- Clarify the look-up error message.
- Fix the usage of the line direction in the mockup device.
- Fix a type warning on the Aspeed driver.
- Remove the pointless __exit annotation on the xgs-iproc which is
causing a compilation problem.
- Fix up emultation of open drain outputs .get_direction()
- Fix the IRQ callbacks on the PCA953xx to use bitops and work
properly.
- Fix the Kconfig on the Tegra driver"
* tag 'gpio-v5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: tegra186: Allow building on Tegra194-only configurations
gpio: pca953x: Switch to bitops in IRQ callbacks
gpiolib: fix up emulated open drain outputs
MAINTAINERS: Append missed file to the database
gpio: xgs-iproc: remove __exit annotation for iproc_gpio_remove
gpio: aspeed: avoid return type warning
gpio: mockup: Fix usage of new GPIO_LINE_DIRECTION
gpio: Fix error message on out-of-range GPIO in lookup table
gpio: mpc8xxx: Add platform device to gpiochip->parent
gpio: xtensa: fix driver build
Andrii Nakryiko [Thu, 26 Dec 2019 21:02:53 +0000 (13:02 -0800)]
bpftool: Make skeleton C code compilable with C++ compiler
When auto-generated BPF skeleton C code is included from C++ application, it
triggers compilation error due to void * being implicitly casted to whatever
target pointer type. This is supported by C, but not C++. To solve this
problem, add explicit casts, where necessary.
To ensure issues like this are captured going forward, add skeleton usage in
test_cpp test.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191226210253.3132060-1-andriin@fb.com
Dave Airlie [Fri, 27 Dec 2019 03:13:06 +0000 (13:13 +1000)]
Merge tag 'drm-intel-fixes-2019-12-23' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
i915 power and frontbuffer tracking fixes
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87r20vdlrs.fsf@intel.com
Eric Dumazet [Mon, 23 Dec 2019 19:13:24 +0000 (11:13 -0800)]
net_sched: sch_fq: properly set sk->sk_pacing_status
If fq_classify() recycles a struct fq_flow because
a socket structure has been reallocated, we do not
set sk->sk_pacing_status immediately, but later if the
flow becomes detached.
This means that any flow requiring pacing (BBR, or SO_MAX_PACING_RATE)
might fallback to TCP internal pacing, which requires a per-socket
high resolution timer, and therefore more cpu cycles.
Fixes: 218af599fa63 ("tcp: internal implementation for pacing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 26 Dec 2019 23:27:15 +0000 (15:27 -0800)]
Merge branch 'bnx2x-Bug-fixes'
Manish Chopra says:
====================
bnx2x: Bug fixes
This series has changes in the area of vlan resources
management APIs to fix fw assert issue reported in max
vlan configuration testing over the PF.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Mon, 23 Dec 2019 18:23:09 +0000 (10:23 -0800)]
bnx2x: Fix accounting of vlan resources among the PFs
While testing max vlan configuration on the PF, firmware gets
assert as driver was configuring number of vlans more than what
is supported per port/engine, it was figured out that there is an
implicit vlan (hidden default vlan consuming hardware cam entry resource)
which is configured default for all the clients (PF/VFs) on client_init
ramrod by the adapter implicitly, so when allocating resources among the
PFs this implicit vlan should be considered or total vlan entries should
be reduced by one to accommodate that default/implicit vlan entry.
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Mon, 23 Dec 2019 18:23:08 +0000 (10:23 -0800)]
bnx2x: Use appropriate define for vlan credit
Although it has same value as MAX_MAC_CREDIT_E2,
use MAX_VLAN_CREDIT_E2 appropriately.
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 26 Dec 2019 23:25:04 +0000 (15:25 -0800)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2019-12-23
The following pull-request contains BPF updates for your *net* tree.
We've added 2 non-merge commits during the last 1 day(s) which contain
a total of 4 files changed, 34 insertions(+), 31 deletions(-).
The main changes are:
1) Fix libbpf build when building on a read-only filesystem with O=dir
option, from Namhyung Kim.
2) Fix a precision tracking bug for unknown scalars, from Daniel Borkmann.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 26 Dec 2019 23:23:50 +0000 (15:23 -0800)]
Merge branch 's390-qeth-next'
Julian Wiedmann says:
====================
s390/qeth: updates 2019-12-23
please apply the following patch series for qeth to your net-next tree.
This reworks the RX code to use napi_gro_frags() when building non-linear
skbs, along with some consolidation and cleanups.
Happy holidays - and many thanks for all the effort & support over the past
year, to both Jakub and you. It's much appreciated.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Mon, 23 Dec 2019 14:22:27 +0000 (15:22 +0100)]
s390/qeth: remove QETH_RX_PULL_LEN
Since commit
f677fcb9aeb6 ("s390/qeth: ensure linear access to packet headers"),
the CQ-specific skbs are allocated with a slightly bigger linear part
than necessary. Shrink it down to the maximum that's needed by
qeth_extract_skb().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Mon, 23 Dec 2019 14:22:26 +0000 (15:22 +0100)]
s390/qeth: use napi_gro_frags() for SG skbs
For non-linear packets, get the skb for attaching the page fragments
from napi_get_frags() so that it can be recycled during GRO.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>