Lior David [Fri, 20 Jan 2017 11:49:51 +0000 (13:49 +0200)]
wil6210: report association ID (AID) per station in debugfs
Add reporting of the association ID (AID) for each station
as part of the stations file in the debugfs.
Valid AID values are 1-254. 0 is reported if the AID
is unknown or not reported by firmware.
Signed-off-by: Lior David <qca_liord@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Lior David [Fri, 20 Jan 2017 11:49:50 +0000 (13:49 +0200)]
wil6210: align to latest auto generated wmi.h
Align to latest version of the auto generated wmi file
describing the interface with FW.
Signed-off-by: Lior David <qca_liord@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Lior David [Fri, 20 Jan 2017 11:49:50 +0000 (13:49 +0200)]
wil6210: fix for broadcast workaround in PBSS
Currently we do not have full support for broadcast from
a station inside a PBSS network.
We have a workaround where instead of broadcast we do a
unicast to every known station in the PBSS.
This workaround was performed only for P2P clients.
This fix will perform the broadcast workaround also for a
regular station inside a PBSS.
Signed-off-by: Lior David <qca_liord@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Hamad Kadmany [Fri, 20 Jan 2017 11:49:49 +0000 (13:49 +0200)]
wil6210: protect against false interrupt during reset sequence
During reset sequence it is seen that device is generating an
interrupt eventhough interrupts are masked at device level.
Add workaround to disable the interrupts from host side during
reset and clear any pending interrupts before re-enabling
the interrupt.
Signed-off-by: Hamad Kadmany <qca_hkadmany@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Lior David [Fri, 20 Jan 2017 11:49:48 +0000 (13:49 +0200)]
wil6210: missing reinit_completion in wmi_call
The code in wmi_call uses the wil->wmi_call completion
structure to wait for a reply.
In some scenarios, complete was called twice on the
completion structure. This happened mainly with a disconnect
event which can arrive both unsolicited and as a reply to
a disconnect request. In this case the completion structure
was left marked as "done" and the next wmi_call returned
immediately with a corrupted reply buffer. This caused
unexpected results including crashes.
Fix this by adding the missing call to reinit_completion.
Signed-off-by: Lior David <qca_liord@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Dedy Lansky [Fri, 20 Jan 2017 11:49:47 +0000 (13:49 +0200)]
wil6210: support new WMI-only FW capability
WMI_ONLY FW is used for testing in production. It cannot be used for
scan/connect, etc.
In case FW reports this capability, driver will not allow interface up.
Signed-off-by: Dedy Lansky <qca_dlansky@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Lazar Alexei [Fri, 20 Jan 2017 11:49:46 +0000 (13:49 +0200)]
wil6210: remove __func__ from debug printouts
__func__ is automatically added to printouts by dynamic debug
mechanism and by wil_info/wil_err macros.
Remove __func__ from debug printouts to avoid duplication.
Signed-off-by: Lazar Alexei <qca_ailizaro@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Lazar Alexei [Fri, 20 Jan 2017 11:49:45 +0000 (13:49 +0200)]
wil6210: support loading dedicated image for sparrow-plus devices
Driver may be used in platforms where some use sparrow cards while
other use sparrow-plus cards, where different FW image is needed.
Add the capability to load dedicated FW image in case sparrow-plus
card is detected and fallback to default image if such does not exist.
Signed-off-by: Lazar Alexei <qca_ailizaro@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Dedy Lansky [Fri, 20 Jan 2017 11:49:44 +0000 (13:49 +0200)]
wil6210: add disable_ap_sme module parameter
By default, AP SME is handled by driver/FW.
In case disable_ap_sme is true, driver doesn't turn-on
WIPHY_FLAG_HAVE_AP_SME and the responsibility for
AP SME is passed to user space.
With AP SME disabled, driver reports assoc request frame
to user space which is then responsible for sending assoc
response frame and for sending NL80211_CMD_NEW_STATION.
Driver also reports disassoc frame to user space
which should then send NL80211_CMD_DEL_STATION.
NL80211_CMD_SET_STATION with NL80211_STA_FLAG_AUTHORIZED
is used by user space to allow/disallow data transmit.
Signed-off-by: Dedy Lansky <qca_dlansky@qca.qualcomm.com>
Signed-off-by: Maya Erez <qca_merez@qca.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Mohammed Shafi Shajakhan [Mon, 16 Jan 2017 15:31:08 +0000 (17:31 +0200)]
ath10k: dump Copy Engine registers during firmware crash
Dump Copy Engine source and destination ring addresses.
This is useful information to debug firmware crashes, assertes or hangs over long run
assessing the Copy Engine Register status. This also enables dumping CE
register status in debugfs Crash Dump file.
Screenshot:
ath10k_pci 0000:02:00.0: simulating hard firmware crash
ath10k_pci 0000:02:00.0: firmware crashed! (uuid
84901ff5-d33c-456e-93ee-
0165dea643cf)
ath10k_pci 0000:02:00.0: qca988x hw2.0 target 0x4100016c chip_id 0x043202ff sub 0000:0000
ath10k_pci 0000:02:00.0: kconfig debug 1 debugfs 1 tracing 1 dfs 1 testmode 1
ath10k_pci 0000:02:00.0: firmware ver 10.2.4.70.59-2 api 5 features no-p2p,raw-mode,mfp,allows-mesh-bcast crc32
4159f498
ath10k_pci 0000:02:00.0: board_file api 1 bmi_id N/A crc32
bebc7c08
ath10k_pci 0000:02:00.0: htt-ver 2.1 wmi-op 5 htt-op 2 cal otp max-sta 128 raw 0 hwcrypto 1
ath10k_pci 0000:02:00.0: firmware register dump:
ath10k_pci 0000:02:00.0: [00]: 0x4100016C 0x00000000 0x009A0F2A 0x00000000
ath10k_pci 0000:02:00.0: [04]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [08]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [12]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [16]: 0x00000000 0x00000000 0x00000000 0x009A0F2A
ath10k_pci 0000:02:00.0: [20]: 0x00000000 0x00401930 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [24]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [28]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [32]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [36]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [40]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [44]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [48]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [52]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: [56]: 0x00000000 0x00000000 0x00000000 0x00000000
ath10k_pci 0000:02:00.0: Copy Engine register dump:
ath10k_pci 0000:02:00.0: [00]: 0x00057400 7 7 3 3
ath10k_pci 0000:02:00.0: [01]: 0x00057800 18 18 85 86
ath10k_pci 0000:02:00.0: [02]: 0x00057c00 49 49 48 49
ath10k_pci 0000:02:00.0: [03]: 0x00058000 16 16 17 16
ath10k_pci 0000:02:00.0: [04]: 0x00058400 4 4 44 4
ath10k_pci 0000:02:00.0: [05]: 0x00058800 12 12 11 12
ath10k_pci 0000:02:00.0: [06]: 0x00058c00 3 3 3 3
ath10k_pci 0000:02:00.0: [07]: 0x00059000 0 0 0 0
ieee80211 phy0: Hardware restart was requested
ath10k_pci 0000:02:00.0: device successfully recovered
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
[kvalo@qca.qualcomm.com: simplify the implementation]
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Mohammed Shafi Shajakhan [Fri, 13 Jan 2017 10:30:03 +0000 (16:00 +0530)]
ath10k: fix per station tx bit rate reporting
Not clearing the previous tx bit rate status
results in a ambigous tx bit rate reporting to
mac80211/cfg80211, for example the previous bit
rate status would have been marked as legacy rate
, while the current rate would have been an HT/VHT
rate with the tx bit rate flags set and this results
in exporting tx bitrate as legacy rate but with HT/VHT
rate flags set, fix this by clearing the tx bitrate
status for each event. This also fixes the below
warning when we do:
iw dev wlan#N station dump
WARNING: net/wireless/util.c:1222 cfg80211
[<
c022f104>] (warn_slowpath_null) from [<
bf3b9adc>]
(cfg80211_calculate_bitrate+0x110/0x1f4 [cfg80211])
[<
bf3b9adc>] (cfg80211_calculate_bitrate [cfg80211]) from
[<
bf3dcd54>] (nl80211_put_sta_rate+0x44/0x1dc [cfg80211])
[<
bf3dcd54>] (nl80211_put_sta_rate [cfg80211]) from
[<
bf3cbc34>] (nl80211_set_interface+0x724/0xd70 [cfg80211])
[<
bf3cbc34>] (nl80211_set_interface [cfg80211]) from
[<
bf3d0a18>] (nl80211_dump_station+0xdc/0x100 [cfg80211])
[<
bf3d0a18>] (nl80211_dump_station [cfg80211])
Fixes: cec17c382140 ("ath10k: add per peer htt tx stats support for 10.4")
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Michal Kazior [Thu, 12 Jan 2017 15:14:30 +0000 (16:14 +0100)]
ath10k: prevent sta pointer rcu violation
Station pointers are RCU protected so driver must
be extra careful if it tries to store them
internally for later use outside of the RCU
section it obtained it in.
It was possible for station teardown to race with
some htt events. The possible outcome could be a
use-after-free and a crash.
Only peer-flow-control capable firmware was
affected (so hardware-wise qca99x0 and qca4019).
This could be done in sta_state() itself via
explicit synchronize_net() call but there's
already a convenient sta_pre_rcu_remove() op that
can be hooked up to avoid extra rcu stall.
The peer->sta pointer itself can't be set to
NULL/ERR_PTR because it is later used in
sta_state() for extra sanity checks.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Wei Yongjun [Wed, 11 Jan 2017 16:31:30 +0000 (16:31 +0000)]
ath6kl: fix warning for using 0 as NULL
Fixes the following sparse warning:
drivers/net/wireless/ath/ath6kl/sdio.c:716:55: warning:
Using plain integer as NULL pointer
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Mohammed Shafi Shajakhan [Thu, 12 Jan 2017 11:02:22 +0000 (13:02 +0200)]
ath10k: fix tx legacy rate reporting
Tx legacy rate is reported 10 fold, as below
iw dev wlan#N station dump | grep "tx bitrate"
tx bitrate: 240.0 MBit/s
This is because by mistake we multiply by the hardware reported
rate twice by 10, fix this.
Fixes: cec17c382140 ("ath10k: add per peer htt tx stats support for 10.4")
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Mohammed Shafi Shajakhan [Thu, 12 Jan 2017 11:02:21 +0000 (13:02 +0200)]
ath10k: fix wifi connectivity and warning in rx with channel 169
In countries where basic operation of channel 169 is allowed,
this fixes the below WARN_ON_ONCE in Rx and fixes the station
connectivity failure in channel 169 as the packet is dropped
in the driver as the current check limits to channel 165. As of
now all the packets beyond channel 165 is dropped, fix this
by extending the range to channel 169.
Call trace:
drivers/net/wireless/ath/ath10k/wmi.c:1505
ath10k_wmi_event_mgmt_rx+0x278/0x440 [ath10k_core]()
Call Trace:
[<
c158f812>] ? printk+0x2d/0x2f
[<
c105a182>] warn_slowpath_common+0x72/0xa0
[<
f8b67b58>] ? ath10k_wmi_event_mgmt_rx+0x278/0x440
[<
f8b67b58>] ? ath10k_wmi_event_mgmt_rx+0x278/0x440
[<
c105a1d2>] warn_slowpath_null+0x22/0x30
[<
f8b67b58>] ath10k_wmi_event_mgmt_rx+0x278/0x440
[<
f8b0e72b>] ? ath10k_pci_sleep+0x8b/0xb0 [ath10k_pci]
[<
f8b6ac63>] ath10k_wmi_10_2_op_rx+0xf3/0x3b0
[<
f8b6495e>] ath10k_wmi_process_rx+0x1e/0x60
[<
f8b5f077>] ath10k_htc_rx_completion_handler+0x347/0x4d0 [ath10k_core]
[<
f8b11dc3>] ? ath10k_ce_completed_recv_next+0x53/0x70 [ath10k_pci]
[<
f8b0f921>] ath10k_pci_ce_recv_data+0x171/0x1d0 [ath10k_pci]
[<
f8b0ec69>] ? ath10k_pci_write32+0x39/0x80 [ath10k_pci]
[<
f8b120bc>] ath10k_ce_per_engine_service+0x5c/0xa0 [ath10k_pci]
[<
f8b1215f>] ath10k_ce_per_engine_service_any+0x5f/0x70 [ath10k_pci]
[<
c1060dc0>] ? local_bh_enable_ip+0x90/0x90
[<
f8b1048b>] ath10k_pci_tasklet+0x1b/0x50 [ath10k_pci]
Fixes: 34c30b0a5e97 ("ath10k: enable advertising support for channel 169, 5Ghz")
Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Christian Lamparter [Thu, 12 Jan 2017 11:02:23 +0000 (13:02 +0200)]
ath9k: move RELAY and DEBUG_FS to ATH9K[_HTC]_DEBUGFS
Currently, the common ath9k_common module needs to have a
dependency on RELAY and DEBUG_FS in order to built. This
is usually not a problem. But for RAM and FLASH starved
AR71XX devices, every little bit counts.
This patch adds a new symbol CONFIG_ATH9K_COMMON_DEBUG
which makes it possible to drop the RELAY and DEBUG_FS
dependency there and move it to ATH_(HTC)_DEBUGFS.
Note: The shared FFT/spectral code (which is the only user
of the relayfs in ath9k*) needs DEBUG_FS to export the relayfs
interface to dump the data to userspace. So it makes no sense
to have the functions compiled in, if DEBUG_FS is not there.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Christian Lamparter [Mon, 5 Dec 2016 21:52:45 +0000 (22:52 +0100)]
ath10k: add accounting for the extended peer statistics
The 10.4 firmware adds extended peer information to the
firmware's statistics payload. This additional info is
stored as a separate data field and the elements are
stored in their own "peers_extd" list.
These elements can pile up in the same way as the peer
information elements. This is because the
ath10k_wmi_10_4_op_pull_fw_stats() function tries to
pull the same amount (num_peer_stats) for every statistic
data unit.
Fixes: 4a49ae94a448faa ("ath10k: fix 10.4 extended peer stats update")
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Sebastian Gottschall [Thu, 12 Jan 2017 11:02:12 +0000 (13:02 +0200)]
ath10k: add VHT160 support
This patch adds full VHT160 support for QCA9984 chipsets Tested on Netgear
R7800. 80+80 is possible, but disabled so far since it seems to contain
glitches like missing vht station flags (this may be firmware or mac80211
related).
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
[kvalo@qca.qualcomm.com: refactoring and fix few warnings]
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Kalle Valo [Thu, 12 Jan 2017 11:02:11 +0000 (13:02 +0200)]
ath10k: refactor ath10k_peer_assoc_h_phymode()
When adding VHT160 support to ath10k_peer_assoc_h_phymode() the VHT mode
selection code becomes too complex. Simplify it by refactoring the vht part to
a separate function.
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Colin Ian King [Wed, 11 Jan 2017 14:32:16 +0000 (16:32 +0200)]
ath9k: fix spelling mistake: "meaurement" -> "measurement"
Trivial fix to spelling mistake in ath_err message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Mark Rutland [Wed, 11 Jan 2017 14:32:15 +0000 (16:32 +0200)]
ath9k: ar9003_mac: kill off ACCESS_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some new features (e.g. KTSAN / Kernel Thread Sanitizer),
it is necessary to instrument reads and writes separately, which is not
possible with ACCESS_ONCE(). This distinction is critical to correct
operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, for some files (including the ath9k ar9003 mac
driver), this mangles the formatting. As a preparatory step, this patch
converts the driver to use {READ,WRITE}_ONCE() without said mangling.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: ath9k-devel@qca.qualcomm.com
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Cc: ath9k-devel@lists.ath9k.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Mark Rutland [Wed, 11 Jan 2017 14:32:14 +0000 (16:32 +0200)]
ath9k: ar9002_mac: kill off ACCESS_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some new features (e.g. KTSAN / Kernel Thread Sanitizer),
it is necessary to instrument reads and writes separately, which is not
possible with ACCESS_ONCE(). This distinction is critical to correct
operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, for some files (including the ath9k ar9002 mac
driver), this mangles the formatting. As a preparatory step, this patch
converts the driver to use {READ,WRITE}_ONCE() without said mangling.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: ath9k-devel@qca.qualcomm.com
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Cc: ath9k-devel@lists.ath9k.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Felix Fietkau [Wed, 11 Jan 2017 14:32:13 +0000 (16:32 +0200)]
ath5k: drop bogus warning on drv_set_key with unsupported cipher
Simply return -EOPNOTSUPP instead.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Bhumika Goyal [Wed, 11 Jan 2017 14:32:12 +0000 (16:32 +0200)]
wil6210: constify cfg80211_ops structures
cfg80211_ops structures are only passed as an argument to the function
wiphy_new. This argument is of type const, so cfg80211_ops strutures
having this property can be declared as const.
Done using Coccinelle
@r1 disable optional_qualifier @
identifier i;
position p;
@@
static struct cfg80211_ops i@p = {...};
@ok1@
identifier r1.i;
position p;
@@
wiphy_new(&i@p,...)
@bad@
position p!={r1.p,ok1.p};
identifier r1.i;
@@
i@p
@depends on !bad disable optional_qualifier@
identifier r1.i;
@@
+const
struct cfg80211_ops i;
File size before:
text data bss dec hex filename
18133 6632 0 24765 60bd wireless/ath/wil6210/cfg80211.o
File size after:
text data bss dec hex filename
18933 5832 0 24765 60bd wireless/ath/wil6210/cfg80211.o
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Erik Stromdahl [Wed, 11 Jan 2017 14:32:10 +0000 (16:32 +0200)]
ath10k: htc: simplified credit distribution
Simplified transmit credit distribution code somewhat.
Since the WMI control service will get assigned all credits
there is no need for having a credit_allocation array in
struct ath10k_htc.
Signed-off-by: Erik Stromdahl <erik.stromdahl@gmail.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Erik Stromdahl [Wed, 11 Jan 2017 14:32:09 +0000 (16:32 +0200)]
ath10k: htc: removal of unused struct members
Removed tx_credits_per_max_message and tx_credit_size
from struct ath10k_htc_ep since they are not used
anywhere in the code.
They are just written, never read.
Signed-off-by: Erik Stromdahl <erik.stromdahl@gmail.com>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Bjorn Andersson [Wed, 11 Jan 2017 14:32:21 +0000 (16:32 +0200)]
wcn36xx: Don't use the destroyed hal_mutex
ieee80211_unregister_hw() might invoke operations to stop the interface,
that uses the hal_mutex. So don't destroy it until after we're done
using it.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Bjorn Andersson [Wed, 11 Jan 2017 14:32:20 +0000 (16:32 +0200)]
wcn36xx: Implement print_reg indication
Some firmware versions sends a "print register indication", handle this
by printing out the content.
Cc: Nicolas Dechesne <ndec@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Bjorn Andersson [Wed, 11 Jan 2017 14:32:19 +0000 (16:32 +0200)]
wcn36xx: Implement firmware assisted scan
Using the software based channel scan mechanism from mac80211 keeps us
offline for 10-15 second, we should instead issue a start_scan/end_scan
on each channel reducing this time.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Bjorn Andersson [Wed, 11 Jan 2017 14:32:18 +0000 (16:32 +0200)]
wcn36xx: Transition driver to SMD client
The wcn36xx wifi driver follows the life cycle of the WLAN_CTRL SMD
channel, as such it should be a SMD client. This patch makes this
transition, now that we have the necessary frameworks available.
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Bjorn Andersson [Wed, 11 Jan 2017 14:32:17 +0000 (16:32 +0200)]
soc: qcom: smem_state: Fix include for ERR_PTR()
The correct include file for getting errno constants and ERR_PTR() is
linux/err.h, rather than linux/errno.h, so fix the include.
Fixes: e8b123e60084 ("soc: qcom: smem_state: Add stubs for disabled smem_state")
Acked-by: Andy Gross <andy.gross@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Sowmini Varadhan [Tue, 10 Jan 2017 15:47:15 +0000 (07:47 -0800)]
packet: pdiag_put_ring() should return TX_RING info for TPACKET_V3
Commit
7f953ab2ba46 ("af_packet: TX_RING support for TPACKET_V3")
now makes it possible to use TX_RING with TPACKET_V3, so make the
the relevant information available via 'ss -e -a --packet'
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Tue, 10 Jan 2017 14:02:16 +0000 (15:02 +0100)]
bpf: Make unnecessarily global functions static
Make the functions __local_list_pop_free(), __local_list_pop_pending(),
bpf_common_lru_populate() and bpf_percpu_lru_populate() static as they
are not used outide of bpf_lru_list.c
This fixes the following GCC warnings when building with 'W=1':
kernel/bpf/bpf_lru_list.c:363:22: warning: no previous prototype for ‘__local_list_pop_free’ [-Wmissing-prototypes]
kernel/bpf/bpf_lru_list.c:376:22: warning: no previous prototype for ‘__local_list_pop_pending’ [-Wmissing-prototypes]
kernel/bpf/bpf_lru_list.c:560:6: warning: no previous prototype for ‘bpf_common_lru_populate’ [-Wmissing-prototypes]
kernel/bpf/bpf_lru_list.c:577:6: warning: no previous prototype for ‘bpf_percpu_lru_populate’ [-Wmissing-prototypes]
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Tue, 10 Jan 2017 14:02:07 +0000 (15:02 +0100)]
bpf: Remove unused but set variable in __bpf_lru_list_shrink_inactive()
Remove the unused but set variable 'first_node' in
__bpf_lru_list_shrink_inactive() to fix the following GCC warning when
building with 'W=1':
kernel/bpf/bpf_lru_list.c:216:41: warning: variable ‘first_node’ set but not used [-Wunused-but-set-variable]
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Mon, 9 Jan 2017 23:05:54 +0000 (15:05 -0800)]
ipvlan: improvise dev_id generation logic in IPvlan
The patch
009146d117b ("ipvlan: assign unique dev-id for each slave
device.") used ida_simple_get() to generate dev_ids assigned to the
slave devices. However (Eric has pointed out that) there is a shortcoming
with that approach as it always uses the first available ID. This
becomes a problem when a slave gets deleted and a new slave gets added.
The ID gets reassigned causing the new slave to get the same link-local
address. This side-effect is undesirable.
This patch adds a per-port variable that keeps track of the IDs
assigned and used as the stat-base for the IDR api. This base will be
wrapped around when it reaches the MAX (0xFFFE) value possibly on a
busy system where slaves are added and deleted routinely.
Fixes: 009146d117b ("ipvlan: assign unique dev-id for each slave device.")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: David Miller <davem@davemloft.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prasad Kanneganti [Mon, 9 Jan 2017 22:42:40 +0000 (14:42 -0800)]
liquidio: store the L4 hash of rx packets in skb
Store the L4 hash of received packets in the skb; the hash is computed in
the NIC firmware.
Signed-off-by: Prasad Kanneganti <prasad.kanneganti@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 10 Jan 2017 19:16:18 +0000 (14:16 -0500)]
Merge branch 'sfc-physical-port-ids'
Edward Cree says:
====================
sfc: physical port ids
This series brings our handling of ndo_get_phys_port_id and related
interfaces into line with the behaviour of other drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Bert Kenward [Tue, 10 Jan 2017 16:24:31 +0000 (16:24 +0000)]
sfc: stop setting dev_port
Setting dev_port changes the device names allocated by systemd. Any devices
with a dev_port >0 will (in default distro configurations) have a suffix of
"d<port-number>" appended.
This is not something done by other drivers, and causes confusion for users.
Fixes: 8be41320f346 ("sfc: Add code to export port_num in netdev->dev_port")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bert Kenward [Tue, 10 Jan 2017 16:23:56 +0000 (16:23 +0000)]
sfc: implement ndo_get_phys_port_name
Output is of the form p<port-number>.
Note that the port numbers don't necessarily map one-to-one to physical
cages, partly because of 4x10G port modes on QSFP+ and partly because
of hw/fw implementation details.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bert Kenward [Tue, 10 Jan 2017 16:23:33 +0000 (16:23 +0000)]
sfc: support ndo_get_phys_port_id even when !CONFIG_SFC_SRIOV
There's no good reason why this should be an SRIOV-only thing.
Thus, also move it out of SRIOV-specific files.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timur Tabi [Mon, 9 Jan 2017 18:03:12 +0000 (12:03 -0600)]
net: qcom/emac: add ethtool support
Add support for some ethtool methods: get/set link settings, get/set
message level, get statistics, get link status, get ring params, get
pause params, and restart autonegotiation.
The code to collect the hardware statistics is moved into its own
function so that it can be used by "get statistics" method.
Signed-off-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Mon, 9 Jan 2017 21:49:26 +0000 (16:49 -0500)]
net: dsa: select NET_SWITCHDEV
The support for DSA Ethernet switch chips depends on TCP/IP networking,
thus explicit that HAVE_NET_DSA depends on INET.
DSA uses SWITCHDEV, thus select it instead of depending on it.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 22:09:31 +0000 (17:09 -0500)]
Merge tag 'mlx5-4kuar-for-4.11' of git://git./linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5 4K UAR
The following series of patches optimizes the usage of the UAR area which is
contained within the BAR 0-1. Previous versions of the firmware and the driver
assumed each system page contains a single UAR. This patch set will query the
firmware for a new capability that if published, means that the firmware can
support UARs of fixed 4K regardless of system page size. In the case of
powerpc, where page size equals 64KB, this means we can utilize 16 UARs per
system page. Since user space processes by default consume eight UARs per
context this means that with this change a process will need a single system
page to fulfill that requirement and in fact make use of more UARs which is
better in terms of performance.
In addition to optimizing user-space processes, we introduce an allocator
that can be used by kernel consumers to allocate blue flame registers
(which are areas within a UAR that are used to write doorbells). This provides
further optimization on using the UAR area since the Ethernet driver makes
use of a single blue flame register per system page and now it will use two
blue flame registers per 4K.
The series also makes changes to naming conventions and now the terms used in
the driver code match the terms used in the PRM (programmers reference manual).
Thus, what used to be called UUAR (micro UAR) is now called BFREG (blue flame
register).
In order to support compatibility between different versions of
library/driver/firmware, the library has now means to notify the kernel driver
that it supports the new scheme and the kernel can notify the library if it
supports this extension. So mixed versions of libraries can run concurrently
without any issues.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 9 Jan 2017 18:29:27 +0000 (10:29 -0800)]
tcp: make TCP_INFO more consistent
tcp_get_info() has to lock the socket, so lets lock it
for an extended critical section, so that various fields
have consistent values.
This solves an annoying issue that some applications
reported when multiple counters are updated during one
particular rx/rx event, and TCP_INFO was called from
another cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 21:56:28 +0000 (16:56 -0500)]
Merge branch 'bpf-verifier-improvements'
Alexei Starovoitov says:
====================
bpf: verifier improvements
A number of bpf verifier improvements from Gianluca.
See individual patches for details.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Mon, 9 Jan 2017 18:19:50 +0000 (10:19 -0800)]
bpf: rename ARG_PTR_TO_STACK
since ARG_PTR_TO_STACK is no longer just pointer to stack
rename it to ARG_PTR_TO_MEM and adjust comment.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gianluca Borello [Mon, 9 Jan 2017 18:19:49 +0000 (10:19 -0800)]
bpf: allow helpers access to variable memory
Currently, helpers that read and write from/to the stack can do so using
a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE.
ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so
that the verifier can safely check the memory access. However, requiring
the argument to be a constant can be limiting in some circumstances.
Since the current logic keeps track of the minimum and maximum value of
a register throughout the simulated execution, ARG_CONST_STACK_SIZE can
be changed to also accept an UNKNOWN_VALUE register in case its
boundaries have been set and the range doesn't cause invalid memory
accesses.
One common situation when this is useful:
int len;
char buf[BUFSIZE]; /* BUFSIZE is 128 */
if (some_condition)
len = 42;
else
len = 84;
some_helper(..., buf, len & (BUFSIZE - 1));
The compiler can often decide to assign the constant values 42 or 48
into a variable on the stack, instead of keeping it in a register. When
the variable is then read back from stack into the register in order to
be passed to the helper, the verifier will not be able to recognize the
register as constant (the verifier is not currently tracking all
constant writes into memory), and the program won't be valid.
However, by allowing the helper to accept an UNKNOWN_VALUE register,
this program will work because the bitwise AND operation will set the
range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE),
so the verifier can guarantee the helper call will be safe (assuming the
argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more
check against 0 would be needed). Custom ranges can be set not only with
ALU operations, but also by explicitly comparing the UNKNOWN_VALUE
register with constants.
Another very common example happens when intercepting system call
arguments and accessing user-provided data of variable size using
bpf_probe_read(). One can load at runtime the user-provided length in an
UNKNOWN_VALUE register, and then read that exact amount of data up to a
compile-time determined limit in order to fit into the proper local
storage allocated on the stack, without having to guess a suboptimal
access size at compile time.
Also, in case the helpers accepting the UNKNOWN_VALUE register operate
in raw mode, disable the raw mode so that the program is required to
initialize all memory, since there is no guarantee the helper will fill
it completely, leaving possibilities for data leak (just relevant when
the memory used by the helper is the stack, not when using a pointer to
map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will
be treated as ARG_PTR_TO_STACK.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gianluca Borello [Mon, 9 Jan 2017 18:19:48 +0000 (10:19 -0800)]
bpf: allow adjusted map element values to spill
commit
484611357c19 ("bpf: allow access into map value arrays")
introduces the ability to do pointer math inside a map element value via
the PTR_TO_MAP_VALUE_ADJ register type.
The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ
is spilled into the stack, limiting several use cases, especially when
generating bpf code from a compiler.
Handle this case by explicitly enabling the register type
PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and
max_value are reset just for BPF_LDX operations that don't result in a
restore of a spilled register from stack.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gianluca Borello [Mon, 9 Jan 2017 18:19:47 +0000 (10:19 -0800)]
bpf: allow helpers access to map element values
Enable helpers to directly access a map element value by passing a
register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper
arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK.
This enables several use cases. For example, a typical tracing program
might want to capture pathnames passed to sys_open() with:
struct trace_data {
char pathname[PATHLEN];
};
SEC("kprobe/sys_open")
void bpf_sys_open(struct pt_regs *ctx)
{
struct trace_data data;
bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di);
/* consume data.pathname, for example via
* bpf_trace_printk() or bpf_perf_event_output()
*/
}
Such a program could easily hit the stack limit in case PATHLEN needs to
be large or more local variables need to exist, both of which are quite
common scenarios. Allowing direct helper access to map element values,
one could do:
struct bpf_map_def SEC("maps") scratch_map = {
.type = BPF_MAP_TYPE_PERCPU_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(struct trace_data),
.max_entries = 1,
};
SEC("kprobe/sys_open")
int bpf_sys_open(struct pt_regs *ctx)
{
int id = 0;
struct trace_data *p = bpf_map_lookup_elem(&scratch_map, &id);
if (!p)
return;
bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di);
/* consume p->pathname, for example via
* bpf_trace_printk() or bpf_perf_event_output()
*/
}
And wouldn't risk exhausting the stack.
Code changes are loosely modeled after commit
6841de8b0d03 ("bpf: allow
helpers access the packet directly"). Unlike with PTR_TO_PACKET, these
changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not
ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be
trivial, but since there is not currently a use case for that, it's
reasonable to limit the set of changes.
Also, add new tests to make sure accesses to map element values from
helpers never go out of boundary, even when adjusted.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gianluca Borello [Mon, 9 Jan 2017 18:19:46 +0000 (10:19 -0800)]
bpf: split check_mem_access logic for map values
Move the logic to check memory accesses to a PTR_TO_MAP_VALUE_ADJ from
check_mem_access() to a separate helper check_map_access_adj(). This
enables to use those checks in other parts of the verifier as well,
where boundaries on PTR_TO_MAP_VALUE_ADJ might need to be checked, for
example when checking helper function arguments. The same thing is
already happening for other types such as PTR_TO_PACKET and its
check_packet_access() helper.
The code has been copied verbatim, with the only difference of removing
the "off += reg->max_value" statement and moving the sum into the call
statement to check_map_access(), as that was only needed due to the
earlier common check_map_access() call.
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 21:07:41 +0000 (16:07 -0500)]
Merge branch 'net-smc'
Ursula Braun says:
====================
net/smc: Shared Memory Communications - RDMA
here is now V4 of the SMC-R patches having processed your feedback from end
of November. The most important change is the replacement of sysfs by a
generic netlink solution in patch 04. And I tried to get rid of the __packed
attributes. There are still a few usages left due to SMC-R protocol defined
structures.
V4 changes:
The order of patches 03 and 04 for pnet table management and SMC IB-client
establishing has been exchanged, since pnet table management is now built on
top of smc_ib_devices.
Patch 01: Use EXPORT_SYMBOL_GPL().
Patch 02: Define "use_fallback" as bool.
Get rid of useless smc_sock fields clearing in smc_sock_alloc(),
since sk_alloc() clears out the memory.
Patch 03: Postpone smc_ib_remember_port_attr() call till ib_device is
mentioned in the pnet table.
Patch 04: Replace sysfs-usage by a generic netlink approach for pnet table
configuration.
Change layout of pnet table entries to reference net_device and
ib_device instead of dealing with names of net_devices and
ib_devices.
Patch 05: Adapt "use_fallback" usages to new type bool.
Get rid of useless smc_sock fields clearing in smc_sock_alloc()
Avoid __packed where possible.
Check if clc responses are not too big.
Patch 09: Postpone smc_setup_per_ibdev till the first connection with this
ib_device is really created.
Patch 11: Get rid of __packed usage.
V3 changes:
Patch 05: Remove unneeded DEFINE_WAIT
Patch 06: Improve synchronization of link group creation
Patch 07: Rename peer_rmbe_len into peer_rmbe_size to be more consistent
Patch 09: Avoid calls of ib_get_memory_region with IB_ACCESS_LOCAL_WRITE,
use new default local_dma_lkey from protection domain as lkey
instead.
Remove no longer needed function smc_ib_dereg_memory_region().
Patch 14: Switch to state ACTIVE only if still in state INIT.
Return 0 for recvmsg invoked in a socket closing state.
Allow getname call in state APPCLOSEWAIT1
Do not trigger destruction of a socket-in-error queued in accept
queue.
During cleanup of accept queue, make sure sockets are destructed,
and sockets in fallback mode are handled appropriately.
When freeing sndbufs/rmbs, remove them from their list and free
the entry.
Use add_wait_queue() and remove_wait_queue() in close wait
functions.
If actively closing a socket in state for PEERFINCLOSEWAIT, keep
this state.
If passively closing a socket while bytes are to be received, move
to state APPCLOSEWAIT1.
If actively aborting a socket, skip sending the close_abort flag,
since RDMA communication is no longer possible.
When terminating a link group, do not schedule link group freeing a
2nd time, since already done when unregistering the last remaining
connection.
Patch 15: Introduce smc_diag module for monitoring SMC protocol sockets.
This replaces the old patch 0015 dealing with procfs.
V2 changes:
Patch 0002: Add SMC versions for family key strings in net/core/sock.c.
Patch 0006: initialize rb_tree.
Patch 0007: Get rid of unneeded use of xchg() in smc_sndbuf_unuse() and
smc_rmb_unuse().
Patch 0008: Correct error checking logic for ib_function calls.
Define struct smc_link field wr_tx_id as atomic_long_t.
Use "do_div" instead of "%" to be architecture-independent.
Patch 0009: Correct error checking logic for ib_function calls.
Patch 0011: Remove xchg() calls in cursor handling. Use atomic64_t for cursor
overlays on 64-bit architectures. If not available, use plain u64
and add locking for cursor reading and writing.
Implement smc_curs_add() without modulo operator "%".
Patch 0012: Remove xchg() calls in cursor handling.
Implement smc_tx_rdma_writes() without module operator "%".
Patch 0013: Remove xchg() calls in cursor handling.
Patch 0014: Return type bool in smc_wr_tx_has_pending().
Remove unneeded semicolon in smc_close_shutdown_write().
Call smc_close_active() in non-fallback case only.
Get rid of duplicate schedule of sock_put_work().
Take nested sock_lock in smc_listen_work().
Start close stream_wait in case of prepared sends only.
Patch 0015: Remove unneeded socket ref_count in smc_proc_seq_show().
Take lock before list_empty check in smc_proc_sock_list_del().
These patches are the initial part of the implementation of the
"Shared Memory Communications-RDMA" (SMC-R) protocol as defined in
RFC7609 [1]. While SMC-R does not aim to replace TCP,
it taps a wealth of existing data center TCP socket applications
to become more efficient without the need for rewriting them.
SMC-R uses RDMA over Converged Ethernet (RoCE) to save CPU consumption.
For instance, when running 10 parallel connections with uperf, we measured
a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP
(with throughput and latency comparable;
measured on x86_64 with the same RoCE card and port).
SMC-R does not require an RDMA communication manager (RDMA CM).
SMC-R inherits TCP qualities such as reliable connections, host-based
firewall packet filtering (on connection establishment) and unmodified
application of communication encryption such as TLS (transport layer
security) or SSL (secure sockets layer). Since original TCP is used to
establish SMC-R connections, load balancers and packet inspection based
on TCP/IP connection establishment continue to work for SMC-R.
On the other hand, using SMC-R implies:
- either involving a preload library when invoking the unchanged TCP-application
or slightly modifying the source by simply changing the socket family in
the socket() call
- accepting extra overhead and latency in connection establishment due to
SMC Connection Layer Control (CLC) handshake
- explicit coupling of RoCE ports with Ethernet ports
- not routable as currently built on RoCE V1
- bypassing of packet-based networking features
- filtering (netfilter)
- sniffing (libpcap, packet sockets, (E)BPF)
- traffic control (scheduling, shaping)
- bypassing of IP-header based socket options
- bypassing of memory buffer (pressure) management
- unusable together with IPsec
Overview of the SMC-R Protocol described in informational RFC 7609
SMC-R is an open protocol that provides RDMA capabilities over RoCE
transparently for applications exploiting TCP sockets.
A new socket protocol family PF_SMC is introduced.
There are no changes required to applications using the sockets API for TCP
stream sockets other than the specification of the new socket family AF_SMC.
Unmodified applications can be used by means of a dynamic preload shared
library which rewrites the socket API call
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) into
socket(AF_SMC, SOCK_STREAM, IPPROTO_TCP).
SMC-R re-uses the address family AF_INET for all addressing purposes around
struct sockaddr.
SMC-R system architecture layers:
+=============================================================================+
| | unmodified TCP application |
| native SMC application +--------------------------------------+
| | dynamic preload shared library |
+=============================================================================+
| SMC socket |
+-----------------------------------------------------------------------------+
| | TCP socket (for connection establishment and fallback) |
| IB verbs +--------------------------------------------------------+
| | IP |
+--------------------+--------------------------------------------------------+
| RoCE device driver | some network device driver |
+=============================================================================+
Terms:
A link group is determined by an ordered peer pair of TCP client and TCP server
(IP addresses and subnet). Reversed client server roles cause an own link group.
A link is a logical point-to-point connection based on an
infiniband reliable connected queue pair (RC-QP) between two RoCE ports
(MACs and GIDs) of a peer pair.
A link group can have 1..8 links for failover and load balancing.
This initial Linux implementation always has 1 link per link group.
Each link group on a peer can have 1..255 remote memory buffers (RMBs).
If more RMBs are needed, a peer can open another link group
(this initial Linux implementation) or fall back to TCP.
Each RMB has its own particular size and its own (R)DMA mapping and credentials
(rtoken consisting of rkey and RDMA "virtual address").
This initial Linux implementation uses physically contiguous memory for RMBs
but we are working towards scattered memory because of memory fragmentation.
Each RMB has 1..255 RMB elements (RMBEs) of equal size
to provide multiplexing of connections within an RMB.
An RMBE is the RDMA Write destination organized as wrapping ring buffer
for data transmit of a particular connection in one direction
(duplex by means of mirror symmetry as with TCP).
This initial Linux implementation always has 1 RMBE per RMB
and thus an individual RMB for each connection.
SMC-R connection establishment with subsequent data transfer:
CLIENT SERVER
TCP three-way handshake:
regular TCP SYN
-------------------------------------------------------->
regular TCP SYN ACK
<--------------------------------------------------------
regular TCP ACK
-------------------------------------------------------->
SMC Connection Layer Control (CLC) handshake
exchanges RDMA credentials between peers:
via above TCP connection: SMC CLC Proposal
-------------------------------------------------------->
via above TCP connection: SMC CLC Accept
<--------------------------------------------------------
via above TCP connection: SMC CLC Confirm
-------------------------------------------------------->
SMC Link Layer Control (LLC) (only once per link, i.e. 1st conn. of link group):
RoCE RC-QP: SMC LLC Confirm Link
<========================================================
RoCE RC-QP: SMC LLC Confirm Link response
========================================================>
SMC data transmission (incl. SMC Connection Data Control (CDC) message):
RoCE RC-QP: RDMA Write
========================================================>
RoCE RC-QP: SMC CDC message (flow control)
========================================================>
...
RoCE RC-QP: RDMA Write
<========================================================
RoCE RC-QP: SMC CDC message (flow control)
<========================================================
...
Data flow within an established connection:
+----------------------------------------------------------------------------
| SENDER
| sendmsg()
| |
| | produces into sndbuf [sender's process context]
| v
| +--------+
| | sndbuf | [ring buffer]
| +--------+
| |
| | consumes from sndbuf and produces into receiver's RMBE [any context]
| | by sending RDMA Write followed by SMC CDC message over RoCE RC-QP
| |
+----|-----------------------------------------------------------------------
|
+----|-----------------------------------------------------------------------
| v RECEIVER
| +------+
| | RMBE | [ring buffer, can have size different from sender's sndbuf]
| | | [RMBE represents rcvbuf, no further de-coupling as on sender side]
| +------+
| |
| | consumes from RMBE [receiver's process context]
| v
| recvmsg()
+----------------------------------------------------------------------------
Flow control ("cursor" updates) by means of SMC CDC messages:
SENDER RECEIVER
sends updates via CDC-------------+ sends updates via CDC
on consuming from sndbuf | on consuming from RMBE
and producing into RMBE | by means of recvmsg()
| |
| |
+-----------------------------------|------------+
| |
+--v-------------------------+ +--v-----------------------+
| receiver's consumer cursor | | sender's producer cursor----+
+----------------|-----------+ +--------------------------+ |
| |
| receiver's RMBE |
| +--------------------------+ |
| | | |
+--------------------------------+ | |
| | | |
| v | |
| +------------| |
|-------------+////////////| |
|//RDMA data written by////| |
|////sender that is////////| |
|/available to be consumed/| |
|///////// +---------------| |
|----------+^ | |
| | | |
| +-----------------+
| |
+--------------------------+
Sending updates of the producer cursor is immediate for low latency;
something like Nagle's algorithm (absence of TCP_NODELAY) is optional and
currently not part of this initial Linux implementation.
Sending updates of the consumer cursor is conditional to avoid the
silly window syndrome.
Normal connection termination:
Normal connection termination starts transitioning from socket state
ACTIVE via either "Active Close" or "Passive Close".
shutdown rdwr +-----------------+
or close, +-------------->| INIT / CLOSED |<-------------+
send PeerCon|nClosed +-----------------+ | PeerConnClosed
| | | received
| connection | established |
| V |
+----------------+ +-----------------+ +----------------+
|AppFinCloseWait | | ACTIVE | |PeerFinCloseWait|
+----------------+ +-----------------+ +----------------+
| | | |
| Active Close: | |Passive Close: |
| close or | |PeerConnClosed or |
| shutdown wr or| |PeerDoneWriting |
| shutdown rdwr | |received |
| V V |
PeerConnClo|sed +--------------+ +-------------+ | close or
received +--<----|PeerCloseWait1| |AppCloseWait1|--->----+ shutdown rdwr,
| +--------------+ +-------------+ | send
| PeerDoneWri|ting | shutdown wr, | PeerConnClosed
| received | send Pee|rDoneWriting |
| V V |
| +--------------+ +-------------+ |
+--<----|PeerCloseWait2| |AppCloseWait2|--->----+
+--------------+ +-------------+
In state CLOSED, the socket can be destructed only, once the application has
issued a close().
Abnormal connection termination:
+-----------------+
+-------------->| INIT / CLOSED |<-------------+
| +-----------------+ |
| |
| +-----------------------+ |
| | Any state | |
PeerConnAbo|rt | (before setting | | send
received | | PeerConnClosed | | PeerConnAbort
| | indicator in | |
| | peer's RMBE) | |
| +-----------------------+ |
| | | |
| Active Abort: | | Passive Abort: |
| problem, | | PeerConnAbort |
| send | | received, |
| PeerConnAbort,| | ECONNRESET |
| ECONNABORTED | | |
| V V |
| +--------------+ +--------------+ |
+-------|PeerAbortWait | | ProcessAbort |------+
+--------------+ +--------------+
Implementation notes beyond RFC 7609:
A PNET table in sysfs provides the mapping between network device names and
RoCE Infiniband device names for the transparent switch of data communication.
A PNET table can contain an arbitrary number of PNETIDs.
Each PNETID contains exactly one (Ethernet) network device name
and one or more RoCE Infiniband device names.
Each device name can only exist in at most one PNETID (no overlapping).
This initial Linux implementation allows at most one RoCE Infiniband device
name per PNETID.
After a new TCP connection is established, the network device name
used for egress traffic with the TCP connection's local source IP address
is used as key to lookup the unique PNETID, and the RoCE Infiniband device
of this PNETID is used to switch data communication from TCP to RDMA
during SMC CLC handshake.
Problem determination:
A protocol dissector is available with upstream wireshark for formatting
SMC-R related RoCE LAN traffic.
[https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-smcr.c]
We are working on enhancing the Linux implementation to cover:
- Improve default socket closing asynchronicity
- Address corner cases with many parallel connections
- Tracing
- Integrated load balancing and fail-over within a link group
- Splice and sendpage support
- IPv6 addressing support
- Keepalive, Cork
- Namespaces support
- Urgent data
- More socket options
- Diagnostics
- Statistics support
- SNMP support
References:
[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:26 +0000 (16:55 +0100)]
smc: netlink interface for SMC sockets
Support for SMC socket monitoring via netlink sockets of protocol
NETLINK_SOCK_DIAG.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:25 +0000 (16:55 +0100)]
smc: socket closing and linkgroup cleanup
smc_shutdown() and smc_release() handling
delayed linkgroup cleanup for linkgroups without connections
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:24 +0000 (16:55 +0100)]
smc: receive data from RMBE
move RMBE data into user space buffer and update managing cursors
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:23 +0000 (16:55 +0100)]
smc: send data (through RDMA)
copy data to kernel send buffer, and trigger RDMA write
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:22 +0000 (16:55 +0100)]
smc: connection data control (CDC)
send and receive CDC messages (via IB message send and CQE)
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:21 +0000 (16:55 +0100)]
smc: link layer control (LLC)
send and receive LLC messages CONFIRM_LINK (via IB message send and CQE)
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:20 +0000 (16:55 +0100)]
smc: initialize IB transport incl. PD, MR, QP, CQ, event, WR
Prepare the link for RDMA transport:
Create a queue pair (QP) and move it into the state Ready-To-Receive (RTR).
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:19 +0000 (16:55 +0100)]
smc: work request (WR) base for use by LLC and CDC
The base containers for RDMA transport are work requests and completion
queue entries processed through Infiniband verbs:
* allocate and initialize these areas
* map these areas to DMA
* implement the basic communication consisting of work request posting
and receival of completion queue events
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:18 +0000 (16:55 +0100)]
smc: remote memory buffers (RMBs)
* allocate data RMB memory for sending and receiving
* size depends on the maximum socket send and receive buffers
* allocated RMBs are kept during life time of the owning link group
* map the allocated RMBs to DMA
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:17 +0000 (16:55 +0100)]
smc: connection and link group creation
* create smc_connection for SMC-sockets
* determine suitable link group for a connection
* create a new link group if necessary
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:16 +0000 (16:55 +0100)]
smc: CLC handshake (incl. preparation steps)
* CLC (Connection Layer Control) handshake
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Richter [Mon, 9 Jan 2017 15:55:15 +0000 (16:55 +0100)]
smc: establish pnet table management
Connection creation with SMC-R starts through an internal
TCP-connection. The Ethernet interface for this TCP-connection is not
restricted to the Ethernet interface of a RoCE device. Any existing
Ethernet interface belonging to the same physical net can be used, as
long as there is a defined relation between the Ethernet interface and
some RoCE devices. This relation is defined with the help of an
identification string called "Physical Net ID" or short "pnet ID".
Information about defined pnet IDs and their related Ethernet
interfaces and RoCE devices is stored in the SMC-R pnet table.
A pnet table entry consists of the identifying pnet ID and the
associated network and IB device.
This patch adds pnet table configuration support using the
generic netlink message interface referring to network and IB device
by their names. Commands exist to add, delete, and display pnet table
entries, and to flush or display the entire pnet table.
There are cross-checks to verify whether the ethernet interfaces
or infiniband devices really exist in the system. If either device
is not available, the pnet ID entry is not created.
Loss of network devices and IB devices is also monitored;
a pnet ID entry is removed when an associated network or
IB device is removed.
Signed-off-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:14 +0000 (16:55 +0100)]
smc: introduce SMC as an IB-client
* create a list of SMC IB-devices
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:13 +0000 (16:55 +0100)]
smc: establish new socket family
* enable smc module loading and unloading
* register new socket family
* basic smc socket creation and deletion
* use backing TCP socket to run CLC (Connection Layer Control)
handshake of SMC protocol
* Setup for infiniband traffic is implemented in follow-on patches.
For now fallback to TCP socket is always used.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Utz Bacher <utz.bacher@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ursula Braun [Mon, 9 Jan 2017 15:55:12 +0000 (16:55 +0100)]
net: introduce keepalive function in struct proto
Direct call of tcp_set_keepalive() function from protocol-agnostic
sock_setsockopt() function in net/core/sock.c violates network
layering. And newly introduced protocol (SMC-R) will need its own
keepalive function. Therefore, add "keepalive" function pointer
to "struct proto", and call it from sock_setsockopt() via this pointer.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Utz Bacher <utz.bacher@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 20:55:18 +0000 (15:55 -0500)]
Merge branch 'sh_eth-wol'
Niklas Söderlund says:
====================
sh_eth: add wake-on-lan support via magic packet
This series adds support for Wake-on-Lan using Magic Packet for a few
models of the sh_eth driver. Patch 1/6 fix a naming error, patch 2/6
adds generic support to control and support WoL while patches 3/6 - 6/6
enable different models.
Based ontop of net-next master.
Changes since v2.
- Fix bookkeeping for "active_count" and "event_count" reported in
/sys/kernel/debug/wakeup_sources. Thanks Geert for noticing this.
- Add new patch 1/6 which corrects the name of ECMR_MPDE bit, suggested
by Sergei.
- s/sh7743/sh7734/ in patch 5/6. Thanks Geert for spotting this.
- Spelling improvements suggested by Sergei and Geert.
- Add Tested-by to 3/6 and 4/6.
Changes since v1.
- Split generic WoL functionality and device enablement to different
patches.
- Enable more devices then Gen2 after feedback from Geert and
datasheets.
- Do not set mdp->irq_enabled = false and remove specific MagicPacket
interrupt clearing, instead let sh_eth_error() clear the interrupt as
for other EMAC interrupts, thanks Sergei for the suggestion.
- Use the original return logic in sh_eth_resume().
- Moved sh_eth_private variable *clk to top of data structure to avoid
possible gaps due to alignment restrictions.
- Make wol_enabled in sh_eth_private part of the already existing
bitfield instead of a bool.
- Do not initiate mdp->wol_enabled to 0, the struct is kzalloc'ed so
it's already set to 0.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Söderlund [Mon, 9 Jan 2017 15:34:09 +0000 (16:34 +0100)]
sh_eth: enable wake-on-lan for sh7763
This is based on public datasheet for sh7763 which shows it has the
same behavior and registers for WoL as other versions of sh_eth.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Söderlund [Mon, 9 Jan 2017 15:34:08 +0000 (16:34 +0100)]
sh_eth: enable wake-on-lan for sh7734
This is based on public datasheet for sh7734 which shows it has the
same behavior and registers for WoL as other versions of sh_eth.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Söderlund [Mon, 9 Jan 2017 15:34:07 +0000 (16:34 +0100)]
sh_eth: enable wake-on-lan for r8a7740/armadillo
Geert Uytterhoeven reported WoL worked on his Armadillo board.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Söderlund [Mon, 9 Jan 2017 15:34:06 +0000 (16:34 +0100)]
sh_eth: enable wake-on-lan for R-Car Gen2 devices
Tested on Gen2 r8a7791/Koelsch.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Söderlund [Mon, 9 Jan 2017 15:34:05 +0000 (16:34 +0100)]
sh_eth: add generic wake-on-lan support via magic packet
Add generic functionality to support Wake-on-LAN using MagicPacket which
are supported by at least a few versions of sh_eth. Only add
functionality for WoL, no specific sh_eth versions are marked to support
WoL yet.
WoL is enabled in the suspend callback by setting MagicPacket detection
and disabling all interrupts expect MagicPacket. In the resume path the
driver needs to reset the hardware to rearm the WoL logic, this prevents
the driver from simply restoring the registers and to take advantage of
that sh_eth was not suspended to reduce resume time. To reset the
hardware the driver closes and reopens the device just like it would do
in a normal suspend/resume scenario without WoL enabled, but it both
closes and opens the device in the resume callback since the device
needs to be open for WoL to work.
One quirk needed for WoL is that the module clock needs to be prevented
from being switched off by Runtime PM. To keep the clock alive the
suspend callback need to call clk_enable() directly to increase the
usage count of the clock. Then when Runtime PM decreases the clock usage
count it won't reach 0 and be switched off.
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Niklas Söderlund [Mon, 9 Jan 2017 15:34:04 +0000 (16:34 +0100)]
sh_eth: use correct name for ECMR_MPDE bit
This bit was wrongly named due to a typo, Sergei checked the SH7734/63
manuals and this bit should be named MPDE.
Suggested-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 20:49:13 +0000 (15:49 -0500)]
Merge branch 'icmp-reply-optimize'
Jesper Dangaard Brouer says:
====================
net: optimize ICMP-reply code path
This patchset is optimizing the ICMP-reply code path, for ICMP packets
that gets rate limited. A remote party can easily trigger this code
path by sending packets to port number with no listening service.
Generally the patchset moves the sysctl_icmp_msgs_per_sec ratelimit
checking to earlier in the code path and removes an allocation.
Use-case: The specific case I experienced this being a bottleneck is,
sending UDP packets to a port with no listener, which obviously result
in kernel replying with ICMP Destination Unreachable (type:3), Port
Unreachable (code:3), which cause the bottleneck.
After Eric and Paolo optimized the UDP socket code, the kernels PPS
processing capabilities is lower for no-listen ports, than normal UDP
sockets. This is bad for capacity planning when restarting a service.
UDP no-listen benchmark 8xCPUs using pktgen_sample04_many_flows.sh:
Baseline: 6.6 Mpps
Patch: 14.7 Mpps
Driver mlx5 at 50Gbit/s.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Mon, 9 Jan 2017 15:04:14 +0000 (16:04 +0100)]
net: for rate-limited ICMP replies save one atomic operation
It is possible to avoid the atomic operation in icmp{v6,}_xmit_lock,
by checking the sysctl_icmp_msgs_per_sec ratelimit before these calls,
as pointed out by Eric Dumazet, but the BH disabled state must be correct.
The icmp_global_allow() call states it must be called with BH
disabled. This protection was given by the calls icmp_xmit_lock and
icmpv6_xmit_lock. Thus, split out local_bh_disable/enable from these
functions and maintain it explicitly at callers.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Mon, 9 Jan 2017 15:04:09 +0000 (16:04 +0100)]
net: reduce cycles spend on ICMP replies that gets rate limited
This patch split the global and per (inet)peer ICMP-reply limiter
code, and moves the global limit check to earlier in the packet
processing path. Thus, avoid spending cycles on ICMP replies that
gets limited/suppressed anyhow.
The global ICMP rate limiter icmp_global_allow() is a good solution,
it just happens too late in the process. The kernel goes through the
full route lookup (return path) for the ICMP message, before taking
the rate limit decision of not sending the ICMP reply.
Details: The kernels global rate limiter for ICMP messages got added
in commit
4cdf507d5452 ("icmp: add a global rate limitation"). It is
a token bucket limiter with a global lock. It brilliantly avoids
locking congestion by only updating when 20ms (HZ/50) were elapsed. It
can then avoids taking lock when credit is exhausted (when under
pressure) and time constraint for refill is not yet meet.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Mon, 9 Jan 2017 15:04:04 +0000 (16:04 +0100)]
Revert "icmp: avoid allocating large struct on stack"
This reverts commit
9a99d4a50cb8 ("icmp: avoid allocating large struct
on stack"), because struct icmp_bxm no really a large struct, and
allocating and free of this small 112 bytes hurts performance.
Fixes: 9a99d4a50cb8 ("icmp: avoid allocating large struct on stack")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 20:47:52 +0000 (15:47 -0500)]
Merge tag 'rxrpc-rewrite-
20170109' of git://git./linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
afs: Refcount afs_call struct
These patches provide some tracepoints for AFS and fix a potential leak by
adding refcounting to the afs_call struct.
The patches are:
(1) Add some tracepoints for logging incoming calls and monitoring
notifications from AF_RXRPC and data reception.
(2) Get rid of afs_wait_mode as it didn't turn out to be as useful as
initially expected. It can be brought back later if needed. This
clears some stuff out that I don't then need to fix up in (4).
(3) Allow listen(..., 0) to be used to disable listening. This makes
shutting down the AFS cache manager server in the kernel much easier
and the accounting simpler as we can then be sure that (a) all
preallocated afs_call structs are relesed and (b) no new incoming
calls are going to be started.
For the moment, listening cannot be reenabled.
(4) Add refcounting to the afs_call struct to fix a potential multiple
release detected by static checking and add a tracepoint to follow the
lifecycle of afs_call objects.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 20:44:57 +0000 (15:44 -0500)]
Merge branch 'dsa_swqitch_ops-const'
Florian Fainelli says:
====================
net: dsa: Make dsa_switch_ops const
This patch series allows us to annotate dsa_switch_ops with a const
qualifier.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Sun, 8 Jan 2017 22:52:08 +0000 (14:52 -0800)]
net: dsa: Make dsa_switch_ops const
Now that we have properly encapsulated and made drivers utilize exported
functions, we can switch dsa_switch_ops to be a annotated with const.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Sun, 8 Jan 2017 22:52:07 +0000 (14:52 -0800)]
net: dsa: Encapsulate legacy switch drivers into dsa_switch_driver
In preparation for making struct dsa_switch_ops const, encapsulate it
within a dsa_switch_driver which has a list pointer and a pointer to
dsa_switch_ops. This allows us to take the list_head pointer out of
dsa_switch_ops, which is written to by {un,}register_switch_driver.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Sun, 8 Jan 2017 22:52:06 +0000 (14:52 -0800)]
net: dsa: bcm_sf2: Declare our own dsa_switch_ops
Utilize the b53 exported functions to fill our bcm_sf2_ops structure,
also making it clear what we utilize and what we specifically override.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Sun, 8 Jan 2017 22:52:05 +0000 (14:52 -0800)]
net: dsa: b53: Export most operations to other drivers
In preparation for making dsa_switch_ops const, export b53 operations
utilized by other drivers such as bcm_sf2.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 20:40:57 +0000 (15:40 -0500)]
Merge branch 'sh_eth-csum'
Sergei Shtylyov says:
====================
sh_eth: "intgelligent checksum" related cleanups
Here's a set of 2 patches against DaveM's 'net.git' repo, as they are based
on a couple patches merged there recently; however, the patches are destined
for 'net-next.git' (once 'net.git' gets merged there next time). I'm cleaning
up the "intelligent checksum" related code (however, the driver only disables
this feature for now, theres's no proper offload supprt yet).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Fri, 6 Jan 2017 21:03:37 +0000 (00:03 +0300)]
sh_eth: rename 'sh_eth_cpu_data::hw_crc'
The 'struct sh_eth_cpu_data' field indicating the "intelligent checksum"
support was misnamed 'hw_crc' -- rename it to 'hw_checksum'.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Fri, 6 Jan 2017 21:02:52 +0000 (00:02 +0300)]
sh_eth: get rid of 'sh_eth_cpu_data::shift_rd0'
After checking all the available manuals, I have enough information to
conclude that the 'shift_rd0' flag is only relevant for the Ether cores
supporting so called "intelligent checksum" (and hence having CSMR) which
is indicated by the 'hw_crc' flag. Since all the relevant SoCs now have
both these flags set, we can at last get rid of the former flag...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 20:39:11 +0000 (15:39 -0500)]
Merge git://git./linux/kernel/git/davem/net
Zefir Kurtisi [Fri, 6 Jan 2017 11:14:48 +0000 (12:14 +0100)]
phy state machine: failsafe leave invalid RUNNING state
While in RUNNING state, phy_state_machine() checks for link changes by
comparing phydev->link before and after calling phy_read_status().
This works as long as it is guaranteed that phydev->link is never
changed outside the phy_state_machine().
If in some setups this happens, it causes the state machine to miss
a link loss and remain RUNNING despite phydev->link being 0.
This has been observed running a dsa setup with a process continuously
polling the link states over ethtool each second (SNMPD RFC-1213
agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
call phy_read_status() and with that modify the link status - and
with that bricking the phy state machine.
This patch adds a fail-safe check while in RUNNING, which causes to
move to CHANGELINK when the link is gone and we are still RUNNING.
Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 9 Jan 2017 19:58:28 +0000 (11:58 -0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix dumping of nft_quota entries, from Pablo Neira Ayuso.
2) Fix out of bounds access in nf_tables discovered by KASAN, from
Florian Westphal.
3) Fix IRQ enabling in dp83867 driver, from Grygorii Strashko.
4) Fix unicast filtering in be2net driver, from Ivan Vecera.
5) tg3_get_stats64() can race with driver close and ethtool
reconfigurations, fix from Michael Chan.
6) Fix error handling when pass limit is reached in bpf code gen on
x86. From Daniel Borkmann.
7) Don't clobber switch ops and use proper MDIO nested reads and writes
in bcm_sf2 driver, from Florian Fainelli.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
net: dsa: bcm_sf2: Utilize nested MDIO read/write
net: dsa: bcm_sf2: Do not clobber b53_switch_ops
net: stmmac: fix maxmtu assignment to be within valid range
bpf: change back to orig prog on too many passes
tg3: Fix race condition in tg3_get_stats64().
be2net: fix unicast list filling
be2net: fix accesses to unicast list
netlabel: add CALIPSO to the list of built-in protocols
vti6: fix device register to report IFLA_INFO_KIND
net: phy: dp83867: fix irq generation
amd-xgbe: Fix IRQ processing when running in single IRQ mode
sh_eth: R8A7740 supports packet shecksumming
sh_eth: fix EESIPR values for SH77{34|63}
r8169: fix the typo in the comment
nl80211: fix sched scan netlink socket owner destruction
bridge: netfilter: Fix dropping packets that moving through bridge interface
netfilter: ipt_CLUSTERIP: check duplicate config when initializing
netfilter: nft_payload: mangle ckecksum if NFT_PAYLOAD_L4CSUM_PSEUDOHDR is set
netfilter: nf_tables: fix oob access
netfilter: nft_queue: use raw_smp_processor_id()
...
David S. Miller [Mon, 9 Jan 2017 19:54:30 +0000 (14:54 -0500)]
Merge branch 'dwmac-dwc-qos-eth'
Joao Pinto says:
====================
adding new glue driver dwmac-dwc-qos-eth
This patch set contains the porting of the synopsys/dwc_eth_qos.c driver
to the stmmac structure. This operation resulted in the creation of a new
platform glue driver called dwmac-dwc-qos-eth which was based in the
dwc_eth_qos as is.
dwmac-dwc-qos-eth inherited dwc_eth_qos DT bindings, to assure that current
and old users can continue to use it as before. We can see this driver as
being deprecated, since all new development will be done in stmmac.
Please check each patch for implementation details.
====================
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Reviewed-by: Lars Persson <larper@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
jpinto [Mon, 9 Jan 2017 12:35:10 +0000 (12:35 +0000)]
stmmac: adding new glue driver dwmac-dwc-qos-eth
This patch adds a new glue driver called dwmac-dwc-qos-eth which
was based in the dwc_eth_qos as is. To assure retro-compatibility a slight
tweak was also added to stmmac_platform.
Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Reviewed-by: Lars Persson <larper@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
jpinto [Mon, 9 Jan 2017 12:35:09 +0000 (12:35 +0000)]
stmmac: move stmmac_clk, pclk, clk_ptp_ref and stmmac_rst to platform structure
This patch moves stmmac_clk, pclk, clk_ptp_ref and stmmac_rst to the
plat_stmmacenet_data structure. It also moves these platform variables
initialization to stmmac_platform. This was done for two reasons:
a) If PCI is used, platform related code is being executed in stmmac_main
resulting in warnings that have no sense and conceptually was not right
b) stmmac as a synopsys reference ethernet driver stack will be hosting
more and more drivers to its structure like synopsys/dwc_eth_qos.c.
These drivers have their own DT bindings that are not compatible with
stmmac's. One of the most important are the clock names, and so they need
to be parsed in the glue logic and initialized there, and that is the main
reason why the clocks were passed to the platform structure.
Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Reviewed-by: Lars Persson <larper@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
jpinto [Mon, 9 Jan 2017 12:35:08 +0000 (12:35 +0000)]
stmmac: adding DT parameter for LPI tx clock gating
This patch adds a new parameter to the stmmac DT: snps,en-tx-lpi-clockgating.
It was ported from synopsys/dwc_eth_qos.c and it is useful if lpi tx clock
gating is needed by stmmac users also.
Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Reviewed-by: Lars Persson <larper@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Regnery [Mon, 9 Jan 2017 12:15:55 +0000 (13:15 +0100)]
alx: add feature flag for rx checksumming
The code to handle rx checksumming was in the driver since its introduction
but for reasons unknown the feature flag was left out. Now it is possible
to enable this feature with ethtool.
Tested on my AR8161 ethernet card, there are no regressions observed in
netperf if this feature is enabled.
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 19:36:58 +0000 (14:36 -0500)]
Merge branch 'act_csum-sctp'
Davide Caratti says:
====================
net/sched: act_csum: add support for SCTP checksum
This series extends current act_csum functionality to allow computation of
SCTP checksums. Patch 1 ensures LIBCRC32C will be selected if NET_ACT_CSUM
is selected. Patch 2 extends act_csum to handle IPPROTO_SCTP protocol in
IPv4/IPv6 header, and eventually compute the CRC32c value.
v2:
- style fix in tc_csum.h
- avoid nested if statement in act_csum.c
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Mon, 9 Jan 2017 10:24:21 +0000 (11:24 +0100)]
net/sched: act_csum: compute crc32c on SCTP packets
modify act_csum to compute crc32c on IPv4/IPv6 packets having SCTP in
their payload, and extend UAPI definitions accordingly.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Mon, 9 Jan 2017 10:24:20 +0000 (11:24 +0100)]
net/sched: Kconfig: select LIBCRC32C if NET_ACT_CSUM is selected
LIBCRC32C is needed to compute crc32c on SCTP packets.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 9 Jan 2017 19:35:15 +0000 (14:35 -0500)]
Merge branch 'mlxsw-small-driver-update'
Jiri Pirko says:
====================
mlxsw: small driver update
This patchset contains various small "non-net" fixes and enhancements.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yotam Gigi [Mon, 9 Jan 2017 10:25:48 +0000 (11:25 +0100)]
mlxsw: spectrum: Change ENOTSUPP to EOPNOTSUPP
As ENOTSUPP is specific to NFS, change the return error value to
EOPNOTSUPP in various places in the mlxsw driver.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yotam Gigi [Mon, 9 Jan 2017 10:25:47 +0000 (11:25 +0100)]
mlxsw: spectrum: Fix order of commands in port remove function
Fix the order of the free directives to match the port init function
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>