openwrt/staging/blogic.git
6 years agoiwlwifi: mvm: move queue reconfiguration into new function
Johannes Berg [Wed, 4 Jul 2018 14:11:14 +0000 (16:11 +0200)]
iwlwifi: mvm: move queue reconfiguration into new function

If TVQM is used we skip over this, move the code into a new
function to get rid of the label.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: clean up iteration in iwl_mvm_inactivity_check()
Johannes Berg [Wed, 4 Jul 2018 11:06:53 +0000 (13:06 +0200)]
iwlwifi: mvm: clean up iteration in iwl_mvm_inactivity_check()

There's no need to build a bitmap first and then iterate,
just do the iteration with the right locking directly.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: remove per-queue hw refcount
Johannes Berg [Wed, 4 Jul 2018 09:58:28 +0000 (11:58 +0200)]
iwlwifi: mvm: remove per-queue hw refcount

There's no need to have a hw refcount if we just mark the
command queue with a (fake) TID; at that point, the refcount
becomes equivalent to the hweight() of the TID bitmap.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: move queue management into sta.c
Johannes Berg [Wed, 4 Jul 2018 09:38:34 +0000 (11:38 +0200)]
iwlwifi: mvm: move queue management into sta.c

None of these functions really need to be separate, they're all
only used in sta.c, move them there and make them static.

Fix a small typo in related code while at it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: give TX queue info struct a name
Johannes Berg [Tue, 3 Jul 2018 14:00:53 +0000 (16:00 +0200)]
iwlwifi: mvm: give TX queue info struct a name

Make this a named struct rather than an anonymous one,
we'll want to refer to it by name later.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: move rt status check to the start of the resume flow
Shahar S Matityahu [Wed, 4 Jul 2018 14:12:49 +0000 (17:12 +0300)]
iwlwifi: mvm: move rt status check to the start of the resume flow

Move the rt status checking to the start of the resume flow in order
to avoid sending D0I3_END_CMD to the FW.  Also, collect dump if an
assert was encountered.

Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: dump debug data before stop device
Shahar S Matityahu [Mon, 11 Jun 2018 07:46:58 +0000 (10:46 +0300)]
iwlwifi: dump debug data before stop device

Debug data dump is not working in flows that stop the device is used
in their error handling. During these flows the op mode mutex is
locked until the device stops.  Because of that, any assert generated
from the firmware can be handled only after the device already
stopped.

Since dumping cannot occour after stopping the device, split the the
dump function to two parts, Part that handles locking, and the part
that starts the actual dumping and call the second part in the op mode
stop device function.

Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: use fast balance scan in case of DCM mode with P2P GO
Ayala Beker [Wed, 4 Jul 2018 09:08:23 +0000 (12:08 +0300)]
iwlwifi: mvm: use fast balance scan in case of DCM mode with P2P GO

Currently in case of DCM with P2P GO where BSS DTIM interval < 220 msec
the fw fails to allocate events for the P2P GO dtim due to long passive
scan events.

Fix this by requesting all scans in this scenario to be fragmented with
fast balance scan time settings.  The only exception is in case
fragmented scan was planned to be set due to low latency or high
throughput reason, set the scan timing as planned.

Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: introduce a new fragmented scan type: fast balance
Ayala Beker [Wed, 4 Jul 2018 09:00:27 +0000 (12:00 +0300)]
iwlwifi: mvm: introduce a new fragmented scan type: fast balance

Fast balance scan is similar to SCAN_TYPE_MILD, but this scan is
fragmented and has shorter out of operating channel time,
and therefore better match low latency scenarios.

Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: trace: change trace to trace one TB at a time
Sara Sharon [Tue, 3 Jul 2018 10:29:52 +0000 (13:29 +0300)]
iwlwifi: trace: change trace to trace one TB at a time

Split TX tracing to be per TB. This is needed now that
AMSDUs can be sent and skb can be larger than trace
limit.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: pcie: don't pad AMSDU packets
Sara Sharon [Mon, 2 Jul 2018 11:48:15 +0000 (14:48 +0300)]
iwlwifi: pcie: don't pad AMSDU packets

When we TX AMSDU, we shouldn't pad the packet. In the past,
we were building AMSDU only in transport layer, and gen2
functions are built based on this. However, now that op mode
may build AMSDUs, we need to take care of padding also in
gen2 "non-pcie-amsdu" path.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: don't send keys when entering D3
Sara Sharon [Sun, 1 Jul 2018 11:52:06 +0000 (14:52 +0300)]
iwlwifi: mvm: don't send keys when entering D3

In the past, we needed to program the keys when entering D3. This was
since we replaced the image. However, now that there is a single
image, this is no longer needed.  Note that RSC is sent separately in
a new command.  This solves issues with newer devices that support PN
offload. Since driver re-sent the keys, the PN got zeroed and the
receiver dropped the next packets, until PN caught up again.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agonet: vhost: remove bad code line
Tonghao Zhang [Mon, 8 Oct 2018 01:41:50 +0000 (18:41 -0700)]
net: vhost: remove bad code line

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: pie: fix coding style issues
Leslie Monis [Sat, 6 Oct 2018 19:52:45 +0000 (01:22 +0530)]
net: sched: pie: fix coding style issues

Fix 5 warnings and 14 checks issued by checkpatch.pl:

CHECK: Logical continuations should be on the previous line
+ if ((q->vars.qdelay < q->params.target / 2)
+     && (q->vars.prob < MAX_PROB / 5))

WARNING: line over 80 characters
+ q->params.tupdate = usecs_to_jiffies(nla_get_u32(tb[TCA_PIE_TUPDATE]));

CHECK: Blank lines aren't necessary after an open brace '{'
+{
+

CHECK: braces {} should be used on all arms of this statement
+ if (qlen < QUEUE_THRESHOLD)
[...]
+ else {
[...]

CHECK: Unbalanced braces around else statement
+ else {

CHECK: No space is necessary after a cast
+ if (delta > (s32) (MAX_PROB / (100 / 2)) &&

CHECK: Unnecessary parentheses around 'qdelay == 0'
+ if ((qdelay == 0) && (qdelay_old == 0) && update_prob)

CHECK: Unnecessary parentheses around 'qdelay_old == 0'
+ if ((qdelay == 0) && (qdelay_old == 0) && update_prob)

CHECK: Unnecessary parentheses around 'q->vars.prob == 0'
+ if ((q->vars.qdelay < q->params.target / 2) &&
+     (q->vars.qdelay_old < q->params.target / 2) &&
+     (q->vars.prob == 0) &&
+     (q->vars.avg_dq_rate > 0))

CHECK: Unnecessary parentheses around 'q->vars.avg_dq_rate > 0'
+ if ((q->vars.qdelay < q->params.target / 2) &&
+     (q->vars.qdelay_old < q->params.target / 2) &&
+     (q->vars.prob == 0) &&
+     (q->vars.avg_dq_rate > 0))

CHECK: Blank lines aren't necessary before a close brace '}'
+
+}

CHECK: Comparison to NULL could be written "!opts"
+ if (opts == NULL)

CHECK: No space is necessary after a cast
+ ((u32) PSCHED_TICKS2NS(q->params.target)) /

WARNING: line over 80 characters
+     nla_put_u32(skb, TCA_PIE_TUPDATE, jiffies_to_usecs(q->params.tupdate)) ||

CHECK: Blank lines aren't necessary before a close brace '}'
+
+}

CHECK: No space is necessary after a cast
+ .delay = ((u32) PSCHED_TICKS2NS(q->vars.qdelay)) /

WARNING: Missing a blank line after declarations
+ struct sk_buff *skb;
+ skb = qdisc_dequeue_head(sch);

WARNING: Missing a blank line after declarations
+ struct pie_sched_data *q = qdisc_priv(sch);
+ qdisc_reset_queue(sch);

WARNING: Missing a blank line after declarations
+ struct pie_sched_data *q = qdisc_priv(sch);
+ q->params.tupdate = 0;

Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agobnxt_en: Remove unnecessary unsigned integer comparison and initialize variable
Gustavo A. R. Silva [Fri, 5 Oct 2018 20:12:09 +0000 (22:12 +0200)]
bnxt_en: Remove unnecessary unsigned integer comparison and initialize variable

There is no need to compare *val.vu32* with < 0 because
such variable is of type u32 (32 bits, unsigned), making it
impossible to hold a negative value. Fix this by removing
such comparison.

Also, initialize variable *max_val* to -1, just in case
it is not initialized to either BNXT_MSIX_VEC_MAX or
BNXT_MSIX_VEC_MIN_MAX before using it in a comparison
with val.vu32 at line 159:

if (val.vu32 > max_val)

Addresses-Coverity-ID: 1473915 ("Unsigned compared against 0")
Addresses-Coverity-ID: 1473920 ("Uninitialized scalar variable")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge tag 'wireless-drivers-next-for-davem-2018-10-07' of git://git.kernel.org/pub...
David S. Miller [Sun, 7 Oct 2018 17:31:24 +0000 (10:31 -0700)]
Merge tag 'wireless-drivers-next-for-davem-2018-10-07' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.20

Second set of patches for 4.20. Heavy refactoring on mt76 continues
and the usual drivers in active development (iwlwifi, qtnfmac, ath10k)
getting new features. And as always, fixes and cleanup all over.

Major changes:

mt76

* more major refactoring to make it easier add new hardware support

* more work on mt76x0e support

* support for getting firmware version via ethtool

* add mt7650 PCI ID

iwlwifi

* HE radiotap cleanup and improvements

* reorder channel optimization for scans

* bump the FW API version

qtnfmac

* fixes for 'iw' output: rates for enabled SGI, 'dump station'

* expose more scan features to host: scan flush and dwell time

* inform cfg80211 when OBSS is not supported by firmware

wlcore

* add support for optional wakeirq

ath10k

* retrieve MAC address from system firmware if provided

* support extended board data download for dual-band QCA9984

* extended per sta tx statistics support via debugfs

* average ack rssi support for data frames

* speed up QCA6174 and QCA9377 firmware download using diag Copy
  Engine

* HTT High Latency mode support needed by SDIO and USB support

* get STA power save state via debugfs

ath9k

* add reset functionality for airtime station debugfs file
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Sat, 6 Oct 2018 21:43:42 +0000 (14:43 -0700)]
Merge git://git./linux/kernel/git/davem/net

6 years agoMerge tag 'mt76-for-kvalo-2018-10-05' of https://github.com/nbd168/wireless
Kalle Valo [Sat, 6 Oct 2018 11:22:47 +0000 (14:22 +0300)]
Merge tag 'mt76-for-kvalo-2018-10-05' of https://github.com/nbd168/wireless

mt76 patches for 4.20

* unify code between mt76x0, mt76x2
* mt76x0 fixes
* another fix for rx buffer allocation regression on usb
* move mt76x2 source files to mt76x2 folder
* more work on mt76x0e support

6 years agoMerge tag 'iwlwifi-next-for-kalle-2018-10-06' of git://git.kernel.org/pub/scm/linux...
Kalle Valo [Sat, 6 Oct 2018 09:50:14 +0000 (12:50 +0300)]
Merge tag 'iwlwifi-next-for-kalle-2018-10-06' of git://git./linux/kernel/git/iwlwifi/iwlwifi-next

Third set of iwlwifi patches for 4.20

* Fix for a race condition that caused the FW to crash;
* HE radiotap cleanup and improvements;
* Reorder channel optimization for scans;
* Bumped the FW API version supported after the last API change for
  this release;
* Debugging improvements;
* A few bug fixes;
* Some cleanups in preparation for a new implementation;
* Other small improvements, cleanups and fixes.

6 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Greg Kroah-Hartman [Sat, 6 Oct 2018 09:11:30 +0000 (02:11 -0700)]
Merge git://git./linux/kernel/git/davem/net

Dave writes:
  "Networking fixes:

  1) Fix truncation of 32-bit right shift in bpf, from Jann Horn.

  2) Fix memory leak in wireless wext compat, from Stefan Seyfried.

  3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao.

  4) Need to cancel pending work when unbinding in smsc75xx otherwise
     we oops, also from Yu Zhao.

  5) Don't allow enslaving a team device to itself, from Ido Schimmel.

  6) Fix backwards compat with older userspace for rtnetlink FDB dumps.
     From Mauricio Faria.

  7) Add validation of tc policy netlink attributes, from David Ahern.

  8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  net: mvpp2: Extract the correct ethtype from the skb for tx csum offload
  ipv6: take rcu lock in rawv6_send_hdrinc()
  net: sched: Add policy validation for tc attributes
  rtnetlink: fix rtnl_fdb_dump() for ndmsg header
  yam: fix a missing-check bug
  net: bpfilter: Fix type cast and pointer warnings
  net: cxgb3_main: fix a missing-check bug
  bpf: 32-bit RSH verification must truncate input before the ALU op
  net: phy: phylink: fix SFP interface autodetection
  be2net: don't flip hw_features when VXLANs are added/deleted
  net/packet: fix packet drop as of virtio gso
  net: dsa: b53: Keep CPU port as tagged in all VLANs
  openvswitch: load NAT helper
  bnxt_en: get the reduced max_irqs by the ones used by RDMA
  bnxt_en: free hwrm resources, if driver probe fails.
  bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request
  bnxt_en: Fix VNIC reservations on the PF.
  team: Forbid enslaving team device to itself
  net/usb: cancel pending work when unbinding smsc75xx
  mlxsw: spectrum: Delete RIF when VLAN device is removed
  ...

6 years agoiwlwifi: dbg: make trigger functions type agnostic
Sara Sharon [Thu, 21 Jun 2018 11:44:28 +0000 (14:44 +0300)]
iwlwifi: dbg: make trigger functions type agnostic

As preparation for new trigger type, make iwl_fw_dbg_collect_desc
agnostic to the trigger structure.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: dbg: decrement occurrences for all triggers
Sara Sharon [Thu, 21 Jun 2018 11:24:45 +0000 (14:24 +0300)]
iwlwifi: dbg: decrement occurrences for all triggers

iwl_fw_dbg_collect can be called by any function that already
has the error string ready. iwl_fw_dbg_collect_trig, on the
other hand, does string formatting. The occurrences decrement
is at iwl_fw_dbg_collect_trig, instead of iwl_fw_dbg_collect,
which causes it to sometimes be skipped. Move it to the right
location.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: use match_string() helper
Yisheng Xie [Mon, 21 May 2018 11:57:44 +0000 (19:57 +0800)]
iwlwifi: mvm: use match_string() helper

match_string() returns the index of an array for a matching string,
which can be used intead of open coded variant.

Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: dbg: make iwl_fw_dbg_no_trig_window trigger agnostic
Sara Sharon [Thu, 21 Jun 2018 06:43:59 +0000 (09:43 +0300)]
iwlwifi: dbg: make iwl_fw_dbg_no_trig_window trigger agnostic

As preparation for new trigger format, make the function
agnostic to the trigger fomat. Instead it gets the relevant
parameters - id and delay.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: show more HE radiotap data for TB PPDUs
Johannes Berg [Tue, 19 Jun 2018 07:32:00 +0000 (09:32 +0200)]
iwlwifi: mvm: show more HE radiotap data for TB PPDUs

For trigger-based PPDUs, most values aren't part of the HE-SIG-A
because they're preconfigured by the trigger frame. However, we
still have this information since we used the trigger frame to
configure the hardware, so we can (and do) read it back out and
can thus show it in radiotap.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: decode HE information for MU (without ext info)
Johannes Berg [Tue, 19 Jun 2018 07:21:58 +0000 (09:21 +0200)]
iwlwifi: mvm: decode HE information for MU (without ext info)

When the info type is MU, we still have the data from the TSF
overload words, so should decode that. When it's MU_EXT_INFO
we additionally have the SIG-B common 0/1/2 fields.

Also document the validity depending on the info type and fix
the name of the regular TB PPDU info type accordingly.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: add debugfs to send host command
Shahar S Matityahu [Mon, 28 May 2018 08:18:43 +0000 (11:18 +0300)]
iwlwifi: add debugfs to send host command

Add debugfs to send host command in mvm and fmac op modes.
Allows to send host command at runtime via send_hcmd debugfs file.
The command is received as a string that represents hex values.

The struct of the command is as follows:
[cmd_id][flags][length][data]
cmd_id and flags are 8 chars long each.
length is 4 chars long.
data is length * 2 chars long.

Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: runtime: add send host command op to firmware runtime op struct
Shahar S Matityahu [Tue, 12 Jun 2018 12:40:42 +0000 (15:40 +0300)]
iwlwifi: runtime: add send host command op to firmware runtime op struct

Add send host command op to firmware runtime op struct to allow sending
host commands to the op mode from the fw runtime context.

Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: bump firmware API version for 9000 and 22000 series devices
Johannes Berg [Wed, 9 May 2018 09:01:12 +0000 (11:01 +0200)]
iwlwifi: bump firmware API version for 9000 and 22000 series devices

Bump the firmware API version to 41.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm Support new MCC update response
Haim Dreyfuss [Mon, 4 Jun 2018 10:20:00 +0000 (13:20 +0300)]
iwlwifi: mvm Support new MCC update response

Change MCC update response API to be compatible with new FW API.
While at it change v2 which is not in use anymore to v3 and cleanup
mcc_update v1 command and response which is obsolete.

Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: pcie: check iwl_pcie_txq_build_tfd() return value
Johannes Berg [Mon, 18 Jun 2018 07:53:36 +0000 (09:53 +0200)]
iwlwifi: pcie: check iwl_pcie_txq_build_tfd() return value

If we use the iwl_pcie_txq_build_tfd() return value for BIT(),
we should validate that it's not going to be negative, so do
the check and bail out if we hit an error. We shouldn't, as
we check if it'll fit beforehand, but better be safe.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: add fall through comment
Johannes Berg [Mon, 18 Jun 2018 08:22:03 +0000 (10:22 +0200)]
iwlwifi: add fall through comment

The fall-through to the MVM case is intended as we have to do
*something* to continue, and can't easily clean up. So we'll
just fail in mvm later, if this does happen.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: pcie gen2: check iwl_pcie_gen2_set_tb() return value
Johannes Berg [Mon, 18 Jun 2018 07:53:36 +0000 (09:53 +0200)]
iwlwifi: pcie gen2: check iwl_pcie_gen2_set_tb() return value

If we use the iwl_pcie_gen2_set_tb() return value for BIT(),
we should validate that it's not going to be negative, so do
the check and bail out if we hit an error. We shouldn't, as
we check if it'll fit beforehand, but better be safe.

Fixes: ab6c644539e9 ("iwlwifi: pcie: copy TX functions to new transport")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: nvm: get num of hw addresses from firmware
Naftali Goldstein [Tue, 12 Jun 2018 06:08:40 +0000 (09:08 +0300)]
iwlwifi: nvm: get num of hw addresses from firmware

With NICs that don't read the NVM directly and instead rely on getting
the relevant data from the firmware, the number of reserved MAC
addresses was not added to the API. This caused the driver to assume
there is only one address which results in all interfaces getting the
same address. Update the API to fix this.

While at it, fix-up the comments with firmware api names to actually
match what we have in the firmware.

Fixes: e9e1ba3dbf00 ("iwlwifi: mvm: support getting nvm data from firmware")
Signed-off-by: Naftali Goldstein <naftali.goldstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: add dump collection in case alive flow fails
Shahar S Matityahu [Sun, 27 May 2018 14:17:07 +0000 (17:17 +0300)]
iwlwifi: add dump collection in case alive flow fails

Trigger dump collection if the alive flow fails, regardless of the
reason.

Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: pcie: avoid empty free RB queue
Shaul Triebitz [Wed, 6 Jun 2018 14:20:58 +0000 (17:20 +0300)]
iwlwifi: pcie: avoid empty free RB queue

If all free RB queues are empty, the driver will never restock the
free RB queue.  That's because the restocking happens in the Rx flow,
and if the free queue is empty there will be no Rx.

Although there's a background worker (a.k.a. allocator) allocating
memory for RBs so that the Rx handler can restock them, the worker may
run only after the free queue has become empty (and then it is too
late for restocking as explained above).

There is a solution for that called 'emergency': If the number of used
RB's reaches half the amount of all RB's, the Rx handler will not wait
for the allocator but immediately allocate memory for the used RB's
and restock the free queue.

But, since the used RB's is per queue, it may happen that the used
RB's are spread between the queues such that the emergency check will
fail for each of the queues
(and still run out of RBs, causing the above symptom).

To fix it, move to emergency mode if the sum of *all* used RBs (for
all Rx queues) reaches half the amount of all RB's

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: set max TX/RX A-MPDU subframes to HE limit
Johannes Berg [Fri, 15 Jun 2018 12:21:53 +0000 (14:21 +0200)]
iwlwifi: mvm: set max TX/RX A-MPDU subframes to HE limit

In mac80211, the default remains for HT, so set the limit to
HE for our driver.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: add more information to HE radiotap
Johannes Berg [Thu, 14 Jun 2018 14:30:52 +0000 (16:30 +0200)]
iwlwifi: mvm: add more information to HE radiotap

For SU/SU-ER/MU PPDUs we have spatial reuse.

For those where it's relevant we also know the pre-FEC
padding factor, PE disambiguity bit, beam change bit
and doppler bit.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: add LDPC-XSYM to HE radiotap data
Johannes Berg [Thu, 14 Jun 2018 12:58:24 +0000 (14:58 +0200)]
iwlwifi: mvm: add LDPC-XSYM to HE radiotap data

Add information about the LDCP extra symbol segment to the HE
data when applicable (not for trigger-based PPDUs).

While at it, clean up the code for UL/DL a bit.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: add TXOP to HE radiotap data
Johannes Berg [Thu, 14 Jun 2018 12:54:38 +0000 (14:54 +0200)]
iwlwifi: mvm: add TXOP to HE radiotap data

We have this data available, so add it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: move HE-MU LTF_NUM parsing to he_phy_data parsing
Johannes Berg [Thu, 14 Jun 2018 12:52:19 +0000 (14:52 +0200)]
iwlwifi: mvm: move HE-MU LTF_NUM parsing to he_phy_data parsing

This code gets shorter if it doesn't have to check all the
conditions, so move it to an appropriate place that has all
of them validated already.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: clean up HE radiotap RU allocation parsing
Johannes Berg [Thu, 14 Jun 2018 12:48:27 +0000 (14:48 +0200)]
iwlwifi: mvm: clean up HE radiotap RU allocation parsing

Split the code out into a separate routine, and move that to be
called inside the previously introduced iwl_mvm_decode_he_phy_data()
function.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: pull some he_phy_data decoding into a separate function
Johannes Berg [Thu, 14 Jun 2018 12:36:22 +0000 (14:36 +0200)]
iwlwifi: mvm: pull some he_phy_data decoding into a separate function

Pull some of the decoding of he_phy_data into a separate function so
we don't need to check over and over again if it's valid.

While at it, fix the UL/DL bit reporting to be for all but trigger-
based frames.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: put HE SIG-B symbols/users data correctly
Johannes Berg [Fri, 15 Jun 2018 07:43:43 +0000 (09:43 +0200)]
iwlwifi: mvm: put HE SIG-B symbols/users data correctly

As detected by Luca during code review when I move this in the
next patch, the code here is putting the data into the wrong
field (flags1 instead of flags2). Fix that.

Fixes: e5721e3f770f ("iwlwifi: mvm: add radiotap data for HE")
Reported-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: minor cleanups to HE radiotap code
Johannes Berg [Thu, 14 Jun 2018 12:17:42 +0000 (14:17 +0200)]
iwlwifi: mvm: minor cleanups to HE radiotap code

Remove a stray empty line, unbreak some lines that aren't
really that long, and move on variable setting into the
initializer to avoid initializing it twice.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: remove unnecessary overload variable
Johannes Berg [Thu, 14 Jun 2018 12:07:49 +0000 (14:07 +0200)]
iwlwifi: mvm: remove unnecessary overload variable

This is equivalent to checking he_phy_data != HE_PHY_DATA_INVAL,
which is already done in a number of places, so remove the extra
'overload' variable entirely.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: clear HW_RESTART_REQUESTED when stopping the interface
Emmanuel Grumbach [Wed, 13 Jun 2018 08:49:20 +0000 (11:49 +0300)]
iwlwifi: mvm: clear HW_RESTART_REQUESTED when stopping the interface

Fix a bug that happens in the following scenario:
1) suspend without WoWLAN
2) mac80211 calls drv_stop because of the suspend
3) __iwl_mvm_mac_stop deallocates the aux station
4) during drv_stop the firmware crashes
5) iwlmvm:
* sets IWL_MVM_STATUS_HW_RESTART_REQUESTED
* asks mac80211 to kick the restart flow
6) mac80211 puts the restart worker into a freezable
   queue which means that the worker will not run for now
   since the workqueue is already frozen
7) ...
8) resume
9) mac80211 runs ieee80211_reconfig as part of the resume
10) mac80211 detects that a restart flow has been requested
    and that we are now resuming from suspend and cancels
    the restart worker
11) mac80211 calls drv_start()
12) __iwl_mvm_mac_start checks that IWL_MVM_STATUS_HW_RESTART_REQUESTED
    clears it, sets IWL_MVM_STATUS_IN_HW_RESTART and calls
    iwl_mvm_restart_cleanup()
13) iwl_fw_error_dump gets called and accesses the device
    to get debug data
14) iwl_mvm_up adds the aux station
15) iwl_mvm_add_aux_sta() allocates an internal station for
    the aux station
16) iwl_mvm_allocate_int_sta() tests IWL_MVM_STATUS_IN_HW_RESTART
    and doesn't really allocate a station ID for the aux
    station
17) a new queue is added for the aux station

Note that steps from 5 to 9 aren't really part of the
problem but were described for the sake of completeness.

Once the iwl_mvm_mac_stop() is called, the device is not
accessible, meaning that step 12) can't succeed and we'll
see the following:

drivers/net/wireless/intel/iwlwifi/pcie/trans.c:2122 iwl_trans_pcie_grab_nic_access+0xc0/0x1d6 [iwlwifi]()
Timeout waiting for hardware access (CSR_GP_CNTRL 0x080403d8)
Call Trace:
[<ffffffffc03e6ad3>] iwl_trans_pcie_grab_nic_access+0xc0/0x1d6 [iwlwifi]
[<ffffffffc03e6a13>] iwl_trans_pcie_dump_regs+0x3fd/0x3fd [iwlwifi]
[<ffffffffc03dad42>] iwl_fw_error_dump+0x4f5/0xe8b [iwlwifi]
[<ffffffffc04bd43e>] __iwl_mvm_mac_start+0x5a/0x21a [iwlmvm]
[<ffffffffc04bd6d2>] iwl_mvm_mac_start+0xd4/0x103 [iwlmvm]
[<ffffffffc042d378>] drv_start+0xa1/0xc5 [iwl7000_mac80211]
[<ffffffffc045a339>] ieee80211_reconfig+0x145/0xf50 [mac80211]
[<ffffffffc044788b>] ieee80211_resume+0x62/0x66 [mac80211]
[<ffffffffc0366c5b>] wiphy_resume+0xa9/0xc6 [cfg80211]

The station id of the aux station is set to 0xff in step 3
and because we don't really allocate a new station id for
the auxliary station (as explained in 16), we end up sending
a command to the firmware asking to connect the queue
to station id 0xff. This makes the firmware crash with the
following information:

0x00002093 | ADVANCED_SYSASSERT
0x000002F0 | trm_hw_status0
0x00000000 | trm_hw_status1
0x00000B38 | branchlink2
0x0001978C | interruptlink1
0x00000000 | interruptlink2
0xFF080501 | data1
0xDEADBEEF | data2
0xDEADBEEF | data3
Firmware error during reconfiguration - reprobe!
FW error in SYNC CMD SCD_QUEUE_CFG

Fix this by clearing IWL_MVM_STATUS_HW_RESTART_REQUESTED
in iwl_mvm_mac_stop(). We won't be able to collect debug
data anyway and when we will brought up again, we will
have a clean state from the firmware perspective.
Since we won't have IWL_MVM_STATUS_IN_HW_RESTART set in
step 12) we won't get to the 2093 ASSERT either.

Fixes: bf8b286f86fc ("iwlwifi: mvm: defer setting IWL_MVM_STATUS_IN_HW_RESTART")
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: dbg: group trigger condition to helper function
Sara Sharon [Tue, 12 Jun 2018 07:41:35 +0000 (10:41 +0300)]
iwlwifi: dbg: group trigger condition to helper function

The triplet of get trigger, is trigger enabled and is trigger stopped
repeats itself.  Group them in a function to avoid code duplication.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: dbg: dump memory in a helper function
Sara Sharon [Tue, 12 Jun 2018 11:34:32 +0000 (14:34 +0300)]
iwlwifi: dbg: dump memory in a helper function

The code that dumps various memory types repeats itself.  Move it to a
function to avoid duplication.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: allow channel reorder optimization during scan
Ayala Beker [Thu, 17 May 2018 07:05:17 +0000 (10:05 +0300)]
iwlwifi: mvm: allow channel reorder optimization during scan

Allow the FW to reorder HB channels and first scan HB channels with
assumed APs, in order to reduce the scan duration.

Currently enable it for all scan requests types.

Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: dbg: split iwl_fw_error_dump to two functions
Sara Sharon [Mon, 11 Jun 2018 12:30:07 +0000 (15:30 +0300)]
iwlwifi: dbg: split iwl_fw_error_dump to two functions

Split iwl_fw_error_dump to two parts.  The first part will dump the
actual data, and second will do the file allocations, trans calls and
actual file operations.  This is done in order to enable reuse of the
code for the new debug ini infrastructure.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: dbg: refactor dump code to improve readability
Sara Sharon [Mon, 11 Jun 2018 09:43:26 +0000 (12:43 +0300)]
iwlwifi: dbg: refactor dump code to improve readability

Add a macro to replace all the conditions checking for valid dump
length.  In addition, move the fifo len calculation to a helper
function.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: dbg: move debug data to a struct
Sara Sharon [Mon, 11 Jun 2018 08:43:09 +0000 (11:43 +0300)]
iwlwifi: dbg: move debug data to a struct

The debug variables are bloating the iwl_fw struct.  And the fields
are out of order, missing docs and some are redundant.

Clean this up.  This serves as preparation for unionizing it for the
new ini infra.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoiwlwifi: mvm: check for n_profiles validity in EWRD ACPI
Luca Coelho [Mon, 11 Jun 2018 08:15:17 +0000 (11:15 +0300)]
iwlwifi: mvm: check for n_profiles validity in EWRD ACPI

When reading the profiles from the EWRD table in ACPI, we loop over
the data and set it into our internal table.  We use the number of
profiles specified in ACPI without checking its validity, so if the
ACPI table is corrupted and the number is larger than our array size,
we will try to make an out-of-bounds access.

Fix this by making sure the value specified in the ACPI table is
valid.

Fixes: 6996490501ed ("iwlwifi: mvm: add support for EWRD (Dynamic SAR) ACPI table")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
6 years agoMerge branch 'akpm'
Greg Kroah-Hartman [Fri, 5 Oct 2018 23:33:03 +0000 (16:33 -0700)]
Merge branch 'akpm'

* akpm:
  mm: madvise(MADV_DODUMP): allow hugetlbfs pages
  ocfs2: fix locking for res->tracking and dlm->tracking_list
  mm/vmscan.c: fix int overflow in callers of do_shrink_slab()
  mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
  mm/vmstat.c: fix outdated vmstat_text
  proc: restrict kernel stack dumps to root
  mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes
  mm/migrate.c: split only transparent huge pages when allocation fails
  ipc/shm.c: use ERR_CAST() for shm_lock() error return
  mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl
  mm, thp: fix mlocking THP page with migration enabled
  ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
  hugetlb: take PMD sharing into account when flushing tlb/caches
  mm: migration: fix migration of huge PMD shared pages

6 years agomm: madvise(MADV_DODUMP): allow hugetlbfs pages
Daniel Black [Fri, 5 Oct 2018 22:52:19 +0000 (15:52 -0700)]
mm: madvise(MADV_DODUMP): allow hugetlbfs pages

Reproducer, assuming 2M of hugetlbfs available:

Hugetlbfs mounted, size=2M and option user=testuser

  # mount | grep ^hugetlbfs
  hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
  # sysctl vm.nr_hugepages=1
  vm.nr_hugepages = 1
  # grep Huge /proc/meminfo
  AnonHugePages:         0 kB
  ShmemHugePages:        0 kB
  HugePages_Total:       1
  HugePages_Free:        1
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  Hugetlb:            2048 kB

Code:

  #include <sys/mman.h>
  #include <stddef.h>
  #define SIZE 2*1024*1024
  int main()
  {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
  }

Compile and strace:

  mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
  madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
  madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
author testing with analysis from Florian Weimer[1].

The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
consequence of the large useage of VM_DONTEXPAND in device drivers.

A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
marked DODUMP.

A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
memory for a while and later request that madvise(MADV_DODUMP) on the same
memory.  We correct this omission by allowing madvice(MADV_DODUMP) on
hugetlbfs pages.

[1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
[2] commit 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")

Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
Link: https://lists.launchpad.net/maria-discuss/msg05245.html
Fixes: 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")
Reported-by: Kenneth Penza <kpenza@gmail.com>
Signed-off-by: Daniel Black <daniel@linux.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoocfs2: fix locking for res->tracking and dlm->tracking_list
Ashish Samant [Fri, 5 Oct 2018 22:52:15 +0000 (15:52 -0700)]
ocfs2: fix locking for res->tracking and dlm->tracking_list

In dlm_init_lockres() we access and modify res->tracking and
dlm->tracking_list without holding dlm->track_lock.  This can cause list
corruptions and can end up in kernel panic.

Fix this by locking res->tracking and dlm->tracking_list with
dlm->track_lock instead of dlm->spinlock.

Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm/vmscan.c: fix int overflow in callers of do_shrink_slab()
Kirill Tkhai [Fri, 5 Oct 2018 22:52:10 +0000 (15:52 -0700)]
mm/vmscan.c: fix int overflow in callers of do_shrink_slab()

do_shrink_slab() returns unsigned long value, and the placing into int
variable cuts high bytes off.  Then we compare ret and 0xfffffffe (since
SHRINK_EMPTY is converted to ret type).

Thus a large number of objects returned by do_shrink_slab() may be
interpreted as SHRINK_EMPTY, if low bytes of their value are equal to
0xfffffffe.  Fix that by declaration ret as unsigned long in these
functions.

Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
Jann Horn [Fri, 5 Oct 2018 22:52:07 +0000 (15:52 -0700)]
mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly

5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
the kernel unconditional to reduce #ifdef soup, but (either to avoid
showing dummy zero counters to userspace, or because that code was missed)
didn't update the vmstat_array, meaning that all following counters would
be shown with incorrect values.

This only affects kernel builds with
CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm/vmstat.c: fix outdated vmstat_text
Jann Horn [Fri, 5 Oct 2018 22:52:03 +0000 (15:52 -0700)]
mm/vmstat.c: fix outdated vmstat_text

7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the
VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
entry in vmstat_text.  This causes an out-of-bounds access in
vmstat_show().

Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
is probably very rare.

Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoproc: restrict kernel stack dumps to root
Jann Horn [Fri, 5 Oct 2018 22:51:58 +0000 (15:51 -0700)]
proc: restrict kernel stack dumps to root

Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU.  That means that
the stack can change under the stack walker.  The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers.  This can cause exposure of
kernel stack contents to userspace.

Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents.  See the added comment for a longer
rationale.

There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails.  Therefore, I believe
that this change is unlikely to break things.  In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.

Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f50 ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes
Anshuman Khandual [Fri, 5 Oct 2018 22:51:54 +0000 (15:51 -0700)]
mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes

ARM64 architecture also supports 32MB and 512MB HugeTLB page sizes.  This
just adds mmap() system call argument encoding for them.

Link: http://lkml.kernel.org/r/1537841300-6979-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm/migrate.c: split only transparent huge pages when allocation fails
Anshuman Khandual [Fri, 5 Oct 2018 22:51:51 +0000 (15:51 -0700)]
mm/migrate.c: split only transparent huge pages when allocation fails

split_huge_page_to_list() fails on HugeTLB pages.  I was experimenting
with moving 32MB contig HugeTLB pages on arm64 (with a debug patch
applied) and hit the following stack trace when the kernel crashed.

[ 3732.462797] Call trace:
[ 3732.462835]  split_huge_page_to_list+0x3b0/0x858
[ 3732.462913]  migrate_pages+0x728/0xc20
[ 3732.462999]  soft_offline_page+0x448/0x8b0
[ 3732.463097]  __arm64_sys_madvise+0x724/0x850
[ 3732.463197]  el0_svc_handler+0x74/0x110
[ 3732.463297]  el0_svc+0x8/0xc
[ 3732.463347] Code: d1000400 f90b0e60 f2fbd5a2 a94982a1 (f9000420)

When unmap_and_move[_huge_page]() fails due to lack of memory, the
splitting should happen only for transparent huge pages not for HugeTLB
pages.  PageTransHuge() returns true for both THP and HugeTLB pages.
Hence the conditonal check should test PagesHuge() flag to make sure that
given pages is not a HugeTLB one.

Link: http://lkml.kernel.org/r/1537798495-4996-1-git-send-email-anshuman.khandual@arm.com
Fixes: 94723aafb9 ("mm: unclutter THP migration")
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipc/shm.c: use ERR_CAST() for shm_lock() error return
Kees Cook [Fri, 5 Oct 2018 22:51:48 +0000 (15:51 -0700)]
ipc/shm.c: use ERR_CAST() for shm_lock() error return

This uses ERR_CAST() instead of an open-coded cast, as it is casting
across structure pointers, which upsets __randomize_layout:

ipc/shm.c: In function `shm_lock':
ipc/shm.c:209:9: note: randstruct: casting between randomized structure pointer types (ssa): `struct shmid_kernel' and `struct kern_ipc_perm'

  return (void *)ipcp;
         ^~~~~~~~~~~~

Link: http://lkml.kernel.org/r/20180919180722.GA15073@beast
Fixes: 82061c57ce93 ("ipc: drop ipc_lock()")
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl
YueHaibing [Fri, 5 Oct 2018 22:51:44 +0000 (15:51 -0700)]
mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl

get_user_pages_fast() will return negative value if no pages were pinned,
then be converted to a unsigned, which is compared to zero, giving the
wrong result.

Link: http://lkml.kernel.org/r/20180921095015.26088-1-yuehaibing@huawei.com
Fixes: 09e35a4a1ca8 ("mm/gup_benchmark: handle gup failures")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm, thp: fix mlocking THP page with migration enabled
Kirill A. Shutemov [Fri, 5 Oct 2018 22:51:41 +0000 (15:51 -0700)]
mm, thp: fix mlocking THP page with migration enabled

A transparent huge page is represented by a single entry on an LRU list.
Therefore, we can only make unevictable an entire compound page, not
individual subpages.

If a user tries to mlock() part of a huge page, we want the rest of the
page to be reclaimable.

We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
PMD on border of VM_LOCKED VMA will be split into PTE table.

Introduction of THP migration breaks[1] the rules around mlocking THP
pages.  If we had a single PMD mapping of the page in mlocked VMA, the
page will get mlocked, regardless of PTE mappings of the page.

For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
remove_migration_pmd().

Anon THP pages can only be shared between processes via fork().  Mlocked
page can only be shared if parent mlocked it before forking, otherwise CoW
will be triggered on mlock().

For Anon-THP, we can fix the issue by munlocking the page on removing PTE
migration entry for the page.  PTEs for the page will always come after
mlocked PMD: rmap walks VMAs from oldest to newest.

Test-case:

#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/mempolicy.h>
#include <numaif.h>

int main(void)
{
        unsigned long nodemask = 4;
        void *addr;

addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

        if (fork()) {
wait(NULL);
return 0;
        }

        mlock(addr, 4UL << 10);
        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
                &nodemask, 4, MPOL_MF_MOVE);

        return 0;
}

[1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
Larry Chen [Fri, 5 Oct 2018 22:51:37 +0000 (15:51 -0700)]
ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()

ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages
is dirty.  When a page has not been written back, it is still in dirty
state.  If ocfs2_duplicate_clusters_by_page() is called against the dirty
page, the crash happens.

To fix this bug, we can just unlock the page and wait until the page until
its not dirty.

The following is the backtrace:

kernel BUG at /root/code/ocfs2/refcounttree.c:2961!
[exception RIP: ocfs2_duplicate_clusters_by_page+822]
__ocfs2_move_extent+0x80/0x450 [ocfs2]
? __ocfs2_claim_clusters+0x130/0x250 [ocfs2]
ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2]
__ocfs2_move_extents_range+0x2a4/0x470 [ocfs2]
ocfs2_move_extents+0x180/0x3b0 [ocfs2]
? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2]
ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2]
ocfs2_ioctl+0x253/0x640 [ocfs2]
do_vfs_ioctl+0x90/0x5f0
SyS_ioctl+0x74/0x80
do_syscall_64+0x74/0x140
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Once we find the page is dirty, we do not wait until it's clean, rather we
use write_one_page() to write it back

Link: http://lkml.kernel.org/r/20180829074740.9438-1-lchen@suse.com
[lchen@suse.com: update comments]
Link: http://lkml.kernel.org/r/20180830075041.14879-1-lchen@suse.com
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Larry Chen <lchen@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agohugetlb: take PMD sharing into account when flushing tlb/caches
Mike Kravetz [Fri, 5 Oct 2018 22:51:33 +0000 (15:51 -0700)]
hugetlb: take PMD sharing into account when flushing tlb/caches

When fixing an issue with PMD sharing and migration, it was discovered via
code inspection that other callers of huge_pmd_unshare potentially have an
issue with cache and tlb flushing.

Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst
case ranges for mmu notifiers.  Ensure that this range is flushed if
huge_pmd_unshare succeeds and unmaps a PUD_SUZE area.

Link: http://lkml.kernel.org/r/20180823205917.16297-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm: migration: fix migration of huge PMD shared pages
Mike Kravetz [Fri, 5 Oct 2018 22:51:29 +0000 (15:51 -0700)]
mm: migration: fix migration of huge PMD shared pages

The page migration code employs try_to_unmap() to try and unmap the source
page.  This is accomplished by using rmap_walk to find all vmas where the
page is mapped.  This search stops when page mapcount is zero.  For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings.  Shared mappings are tracked via the reference count of the PMD
page.  Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors.  DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages.  A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page.  After this, flush caches and TLB.

mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked.  Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.

Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoMerge tag 'pci-v4.19-fixes-3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git...
Greg Kroah-Hartman [Fri, 5 Oct 2018 23:11:16 +0000 (16:11 -0700)]
Merge tag 'pci-v4.19-fixes-3' of ssh://gitolite./linux/kernel/git/helgaas/pci

Bjorn writes:
  "PCI fixes for v4.19:

   - Reprogram bridge prefetch registers to fix NVIDIA and Radeon issues
     after suspend/resume (Daniel Drake)

   - Fix mvebu I/O mapping creation sequence (Thomas Petazzoni)

   - Fix minor MAINTAINERS file match issue (Bjorn Helgaas)"

* tag 'pci-v4.19-fixes-3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  PCI: mvebu: Fix PCI I/O mapping creation sequence
  MAINTAINERS: Remove obsolete drivers/pci pattern from ACPI section
  PCI: Reprogram bridge prefetch registers on resume

6 years agoMerge tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Greg Kroah-Hartman [Fri, 5 Oct 2018 23:09:56 +0000 (16:09 -0700)]
Merge tag 'for-4.19/dm-fixes-2' of git://git./linux/kernel/git/device-mapper/linux-dm

Mike writes:
  "device mapper fixes

   - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during
     4.19 merge window.

   - Fix leak and dangling pointer in DM multipath's scsi_dh related code.

   - A couple stable@ fixes for DM cache's resize support.

   - A DM raid fix to remove "const" from decipher_sync_action()'s return
     type."

* tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: fix resize crash if user doesn't reload cache table
  dm cache metadata: ignore hints array being too small during resize
  dm raid: remove bogus const from decipher_sync_action() return type
  dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer
  dm thin metadata: fix __udivdi3 undefined on 32-bit

6 years agoMerge tag 'gpio-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Greg Kroah-Hartman [Fri, 5 Oct 2018 23:09:11 +0000 (16:09 -0700)]
Merge tag 'gpio-v4.19-3' of git://git./linux/kernel/git/linusw/linux-gpio

Linus writes:
  "A single GPIO fix:
   Free the last used descriptor, an off by one error.
   This is tagged for stable as well."

* tag 'gpio-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpiolib: Free the last requested descriptor

6 years agoMerge tag 'pm-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Greg Kroah-Hartman [Fri, 5 Oct 2018 23:08:12 +0000 (16:08 -0700)]
Merge tag 'pm-4.19-rc7' of git://git./linux/kernel/git/rafael/linux-pm

Rafael writes:
  "Power management fix for 4.19-rc7

   Fix a bug that may cause runtime PM to misbehave for some devices
   after a failing or aborted system suspend which is nasty enough for
   an -rc7 time frame fix."

* tag 'pm-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / core: Clear the direct_complete flag on errors

6 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Fri, 5 Oct 2018 23:07:13 +0000 (16:07 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Ingo writes:
  "perf fixes:
    - fix a CPU#0 hot unplug bug and a PCI enumeration bug in the x86 Intel uncore PMU driver
    - fix a CPU event enumeration bug in the x86 AMD PMU driver
    - fix a perf ring-buffer corruption bug when using tracepoints
    - fix a PMU unregister locking bug"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events
  perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX
  perf/ring_buffer: Prevent concurent ring buffer access
  perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0
  perf/core: Fix perf_pmu_unregister() locking

6 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Fri, 5 Oct 2018 22:40:57 +0000 (15:40 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Ingo writes:
  "x86 fixes:

   Misc fixes:

    - fix various vDSO bugs: asm constraints and retpolines
    - add vDSO test units to make sure they never re-appear
    - fix UV platform TSC initialization bug
    - fix build warning on Clang"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Fix vDSO syscall fallback asm constraint regression
  x86/cpu/amd: Remove unnecessary parentheses
  x86/vdso: Only enable vDSO retpolines when enabled and supported
  x86/tsc: Fix UV TSC initialization
  x86/platform/uv: Provide is_early_uv_system()
  selftests/x86: Add clock_gettime() tests to test_vdso
  x86/vdso: Fix asm constraints on vDSO syscall fallbacks

6 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Fri, 5 Oct 2018 22:39:38 +0000 (15:39 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Ingo writes:
  "scheduler fixes:

   These fixes address a rather involved performance regression between
   v4.17->v4.19 in the sched/numa auto-balancing code. Since distros
   really need this fix we accelerated it to sched/urgent for a faster
   upstream merge.

   NUMA scheduling and balancing performance is now largely back to
   v4.17 levels, without reintroducing the NUMA placement bugs that
   v4.18 and v4.19 fixed.

   Many thanks to Srikar Dronamraju, Mel Gorman and Jirka Hladky, for
   reporting, testing, re-testing and solving this rather complex set of
   bugs."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task
  mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration
  sched/numa: Avoid task migration for small NUMA improvement
  mm/migrate: Use spin_trylock() while resetting rate limit
  sched/numa: Limit the conditions where scan period is reset
  sched/numa: Reset scan rate whenever task moves across nodes
  sched/numa: Pass destination CPU as a parameter to migrate_task_rq
  sched/numa: Stop multiple tasks from moving to the CPU at the same time

6 years agoMerge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Fri, 5 Oct 2018 22:38:32 +0000 (15:38 -0700)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Ingo writes:
  "locking fixes:

   A fix in the ww_mutex self-test that produces a scary splat, plus an
   updates to the maintained-filed patters in MAINTAINER."

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/ww_mutex: Fix runtime warning in the WW mutex selftest
  MAINTAINERS: Remove dead path from LOCKING PRIMITIVES entry

6 years agoMerge tag 'sound-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Greg Kroah-Hartman [Fri, 5 Oct 2018 22:37:22 +0000 (15:37 -0700)]
Merge tag 'sound-4.19-rc7' of git://git./linux/kernel/git/tiwai/sound

Takashi writes:
  "sound fixes for 4.19-rc7

   Just two small fixes for HD-audio: one is for a typo in completion
   timeout, and another a fixup for Dell machines as usual"

* tag 'sound-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek - Cannot adjust speaker's volume on Dell XPS 27 7760
  ALSA: hda: Fix the audio-component completion timeout

6 years agonet/ncsi: Add NCSI OEM command support
Vijay Khemka [Fri, 5 Oct 2018 17:46:01 +0000 (10:46 -0700)]
net/ncsi: Add NCSI OEM command support

This patch adds OEM commands and response handling. It also defines OEM
command and response structure as per NCSI specification along with its
handlers.

ncsi_cmd_handler_oem: This is a generic command request handler for OEM
commands
ncsi_rsp_handler_oem: This is a generic response handler for OEM commands

Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Reviewed-by: Justin Lee <justin.lee1@dell.com>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoselftests: net: Clean up an unused variable
Jakub Sitnicki [Fri, 5 Oct 2018 08:19:57 +0000 (10:19 +0200)]
selftests: net: Clean up an unused variable

Address compiler warning:

ip_defrag.c: In function 'send_udp_frags':
ip_defrag.c:206:16: warning: unused variable 'udphdr' [-Wunused-variable]
  struct udphdr udphdr;
                ^~~~~~

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: mvpp2: Extract the correct ethtype from the skb for tx csum offload
Maxime Chevallier [Fri, 5 Oct 2018 07:04:40 +0000 (09:04 +0200)]
net: mvpp2: Extract the correct ethtype from the skb for tx csum offload

When offloading the L3 and L4 csum computation on TX, we need to extract
the l3_proto from the ethtype, independently of the presence of a vlan
tag.

The actual driver uses skb->protocol as-is, resulting in packets with
the wrong L4 checksum being sent when there's a vlan tag in the packet
header and checksum offloading is enabled.

This commit makes use of vlan_protocol_get() to get the correct ethtype
regardless the presence of a vlan tag.

Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoipv6: take rcu lock in rawv6_send_hdrinc()
Wei Wang [Thu, 4 Oct 2018 17:12:37 +0000 (10:12 -0700)]
ipv6: take rcu lock in rawv6_send_hdrinc()

In rawv6_send_hdrinc(), in order to avoid an extra dst_hold(), we
directly assign the dst to skb and set passed in dst to NULL to avoid
double free.
However, in error case, we free skb and then do stats update with the
dst pointer passed in. This causes use-after-free on the dst.
Fix it by taking rcu read lock right before dst could get released to
make sure dst does not get freed until the stats update is done.
Note: we don't have this issue in ipv4 cause dst is not used for stats
update in v4.

Syzkaller reported following crash:
BUG: KASAN: use-after-free in rawv6_send_hdrinc net/ipv6/raw.c:692 [inline]
BUG: KASAN: use-after-free in rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921
Read of size 8 at addr ffff8801d95ba730 by task syz-executor0/32088

CPU: 1 PID: 32088 Comm: syz-executor0 Not tainted 4.19.0-rc2+ #93
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 rawv6_send_hdrinc net/ipv6/raw.c:692 [inline]
 rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:621 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:631
 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114
 __sys_sendmsg+0x11d/0x280 net/socket.c:2152
 __do_sys_sendmsg net/socket.c:2161 [inline]
 __se_sys_sendmsg net/socket.c:2159 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457099
Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f83756edc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f83756ee6d4 RCX: 0000000000457099
RDX: 0000000000000000 RSI: 0000000020003840 RDI: 0000000000000004
RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000004d4b30 R14: 00000000004c90b1 R15: 0000000000000000

Allocated by task 32088:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554
 dst_alloc+0xbb/0x1d0 net/core/dst.c:105
 ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:353
 ip6_rt_cache_alloc+0x247/0x7b0 net/ipv6/route.c:1186
 ip6_pol_route+0x8f8/0xd90 net/ipv6/route.c:1895
 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2093
 fib6_rule_lookup+0x277/0x860 net/ipv6/fib6_rules.c:122
 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2121
 ip6_route_output include/net/ip6_route.h:88 [inline]
 ip6_dst_lookup_tail+0xe27/0x1d60 net/ipv6/ip6_output.c:951
 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
 rawv6_sendmsg+0x12d9/0x4630 net/ipv6/raw.c:905
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:621 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:631
 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114
 __sys_sendmsg+0x11d/0x280 net/socket.c:2152
 __do_sys_sendmsg net/socket.c:2161 [inline]
 __se_sys_sendmsg net/socket.c:2159 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 5356:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kmem_cache_free+0x83/0x290 mm/slab.c:3756
 dst_destroy+0x267/0x3c0 net/core/dst.c:141
 dst_destroy_rcu+0x16/0x19 net/core/dst.c:154
 __rcu_reclaim kernel/rcu/rcu.h:236 [inline]
 rcu_do_batch kernel/rcu/tree.c:2576 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:2847 [inline]
 rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864
 __do_softirq+0x30b/0xad8 kernel/softirq.c:292

Fixes: 1789a640f556 ("raw: avoid two atomics in xmit")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotc-testing: use a plugin to build eBPF program
Davide Caratti [Thu, 4 Oct 2018 16:34:39 +0000 (18:34 +0200)]
tc-testing: use a plugin to build eBPF program

use a TDC plugin, instead of building eBPF programs in the 'setup' stage.
'-B' argument can be used to build eBPF programs in $EBPFDIR directory,
in the 'pre-suite' stage. Binaries are then cleaned in 'post-suite' stage.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotc-testing: fix build of eBPF programs
Davide Caratti [Thu, 4 Oct 2018 16:34:38 +0000 (18:34 +0200)]
tc-testing: fix build of eBPF programs

rely on uAPI headers in the current kernel tree, rather than requiring the
correct version installed on the test system. While at it, group all
sections in a single binary and test the 'section' parameter.

Reported-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'mscc-ocelot-add-support-for-SerDes-muxing-configuration'
David S. Miller [Fri, 5 Oct 2018 21:36:45 +0000 (14:36 -0700)]
Merge branch 'mscc-ocelot-add-support-for-SerDes-muxing-configuration'

Quentin Schulz says:

====================
mscc: ocelot: add support for SerDes muxing configuration

The Ocelot switch has currently an hardcoded SerDes muxing that suits only
a particular use case. Any other board setup will fail to work.

To prepare for upcoming boards' support that do not have the same muxing,
create a PHY driver that will handle all possible cases.

A SerDes can work in SGMII, QSGMII or PCIe and is also muxed to use a
given port depending on the selected mode or board design.

The SerDes configuration is in the middle of an address space (HSIO) that
is used to configure some parts in the MAC controller driver, that is why
we need to use a syscon so that we can write to the same address space from
different drivers safely using regmap.

This breaks backward compatibility but it's fine because there's only one
board at the moment that is using what's modified in this patch series.
This will break git bisect.

Even though this patch series is about SerDes __muxing__ configuration, the
DT node is named serdes for the simple reason that I couldn't find any
mention to SerDes anywhere else from the address space handled by this
driver.

v4:
  - add reviewed-by,
  - format the patch series with -M for identifying renamed files,
  - add parent info in DT binding of the SerDes IP,
  - move to macros SERDES[16]G(X) instead of multiple SERDES[16]G_[012345]
  constants,
  - move to SERDES[16]G_MAX being the last VALID macro of a type, so
  migrate to <= conditions instead of < when iterating,
  - create a SERDES_MUX_SGMII and SERDES_MUX_QSGMII macro so the muxing
  configurations are a tad more readable,
  - use a bunch of unsigned int instead of int,
  - return -EOPNOTSUPP for SERDES6G/PCIe until it's supported,
  - simplify condition when there is an error code returned by
  devm_of_phy_get,

v3:
  - add Paul Burton's Acked-By on MIPS patches so that the patch series can
  be merged in the net tree in its entirety,

v2:
  - use a switch case for setting the phy_mode in the SerDes driver as
  suggested by Andrew,
  - stop replacing the value of the error pointer in the SerDes driver,
  - use a dev_dbg for the deferring of the probe in the SerDes driver,
  - use constants in the Device Tree to select the SerDes macro in use with
  a port,
  - adapt the SerDes driver to use those constants,
  - add a header file in include/dt-bindings for the constants,
  - fix space/tab issue,
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: mscc: ocelot: make use of SerDes PHYs for handling their configuration
Quentin Schulz [Thu, 4 Oct 2018 12:22:08 +0000 (14:22 +0200)]
net: mscc: ocelot: make use of SerDes PHYs for handling their configuration

Previously, the SerDes muxing was hardcoded to a given mode in the MAC
controller driver. Now, the SerDes muxing is configured within the
Device Tree and is enforced in the MAC controller driver so we can have
a lot of different SerDes configurations.

Make use of the SerDes PHYs in the MAC controller to set up the SerDes
according to the SerDes<->switch port mapping and the communication mode
with the Ethernet PHY.

Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agophy: add driver for Microsemi Ocelot SerDes muxing
Quentin Schulz [Thu, 4 Oct 2018 12:22:07 +0000 (14:22 +0200)]
phy: add driver for Microsemi Ocelot SerDes muxing

The Microsemi Ocelot can mux SerDes lanes (aka macros) to different
switch ports or even make it act as a PCIe interface.

This adds support for the muxing of the SerDes.

Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agodt-bindings: add constants for Microsemi Ocelot SerDes driver
Quentin Schulz [Thu, 4 Oct 2018 12:22:06 +0000 (14:22 +0200)]
dt-bindings: add constants for Microsemi Ocelot SerDes driver

The Microsemi Ocelot has multiple SerDes and requires that the SerDes be
muxed accordingly to the hardware representation.

Let's add a constant for each SerDes available in the Microsemi Ocelot.

Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMIPS: mscc: ocelot: add SerDes mux DT node
Quentin Schulz [Thu, 4 Oct 2018 12:22:05 +0000 (14:22 +0200)]
MIPS: mscc: ocelot: add SerDes mux DT node

The Microsemi Ocelot has a set of register for SerDes/switch port muxing
as well as PCIe muxing for a specific SerDes, so let's add the device
and all SerDes in the Device Tree.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Paul Burton <paul.burton@mips.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agodt-bindings: phy: add DT binding for Microsemi Ocelot SerDes muxing
Quentin Schulz [Thu, 4 Oct 2018 12:22:04 +0000 (14:22 +0200)]
dt-bindings: phy: add DT binding for Microsemi Ocelot SerDes muxing

Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agophy: add QSGMII and PCIE modes
Quentin Schulz [Thu, 4 Oct 2018 12:22:03 +0000 (14:22 +0200)]
phy: add QSGMII and PCIE modes

Prepare for upcoming phys that'll handle QSGMII or PCIe.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: mscc: ocelot: simplify register access for PLL5 configuration
Quentin Schulz [Thu, 4 Oct 2018 12:22:02 +0000 (14:22 +0200)]
net: mscc: ocelot: simplify register access for PLL5 configuration

Since HSIO address space can be accessed by different drivers, let's
simplify the register address definitions so that it can be easily used
by all drivers and put the register address definition in the
include/soc/mscc/ocelot_hsio.h header file.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: mscc: ocelot: move the HSIO header to include/soc
Quentin Schulz [Thu, 4 Oct 2018 12:22:01 +0000 (14:22 +0200)]
net: mscc: ocelot: move the HSIO header to include/soc

Since HSIO address space can be used by different drivers (PLL, SerDes
muxing, temperature sensor), let's move it somewhere it can be included
by all drivers.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: mscc: ocelot: get HSIO regmap from syscon
Quentin Schulz [Thu, 4 Oct 2018 12:22:00 +0000 (14:22 +0200)]
net: mscc: ocelot: get HSIO regmap from syscon

HSIO address space was moved to a syscon, hence we need to get the
regmap of this address space from there and no more from the device
node.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agodt-bindings: net: ocelot: remove hsio from the list of register address spaces
Quentin Schulz [Thu, 4 Oct 2018 12:21:59 +0000 (14:21 +0200)]
dt-bindings: net: ocelot: remove hsio from the list of register address spaces

HSIO register address space should be handled outside of the MAC
controller as there are some registers for PLL5 configuring,
SerDes/switch port muxing and a thermal sensor IP, so let's remove it.

Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMIPS: mscc: ocelot: make HSIO registers address range a syscon
Quentin Schulz [Thu, 4 Oct 2018 12:21:58 +0000 (14:21 +0200)]
MIPS: mscc: ocelot: make HSIO registers address range a syscon

HSIO contains registers for PLL5 configuration, SerDes/switch port
muxing and a thermal sensor, hence we can't keep it in the switch DT
node.

Acked-by: Paul Burton <paul.burton@mips.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Quentin Schulz <quentin.schulz@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agosocket: Tighten no-error check in bind()
Jakub Sitnicki [Thu, 4 Oct 2018 09:09:40 +0000 (11:09 +0200)]
socket: Tighten no-error check in bind()

move_addr_to_kernel() returns only negative values on error, or zero on
success. Rewrite the error check to an idiomatic form to avoid confusing
the reader.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoatm: nicstar: Replace spin_is_locked() with spin_trylock()
Lance Roy [Thu, 4 Oct 2018 07:46:57 +0000 (00:46 -0700)]
atm: nicstar: Replace spin_is_locked() with spin_trylock()

ns_poll() used spin_is_locked() + spin_lock() to get achieve the same
thing as a spin_trylock(), so simplify it by using that instead. This is
also a step towards possibly removing spin_is_locked().

Signed-off-by: Lance Roy <ldr709@gmail.com>
Cc: Chas Williams <3chas3@gmail.com>
Cc: <linux-atm-general@lists.sourceforge.net>
Cc: <netdev@vger.kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: Add policy validation for tc attributes
David Ahern [Wed, 3 Oct 2018 22:05:36 +0000 (15:05 -0700)]
net: sched: Add policy validation for tc attributes

A number of TC attributes are processed without proper validation
(e.g., length checks). Add a tca policy for all input attributes and use
when invoking nlmsg_parse.

The 2 Fixes tags below cover the latest additions. The other attributes
are a string (KIND), nested attribute (OPTIONS which does seem to have
validation in most cases), for dumps only or a flag.

Fixes: 5bc1701881e39 ("net: sched: introduce multichain support for filters")
Fixes: d47a6b0e7c492 ("net: sched: introduce ingress/egress block index attributes for qdisc")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agortnetlink: fix rtnl_fdb_dump() for ndmsg header
Mauricio Faria de Oliveira [Tue, 2 Oct 2018 01:46:40 +0000 (22:46 -0300)]
rtnetlink: fix rtnl_fdb_dump() for ndmsg header

Currently, rtnl_fdb_dump() assumes the family header is 'struct ifinfomsg',
which is not always true -- 'struct ndmsg' is used by iproute2 ('ip neigh').

The problem is, the function bails out early if nlmsg_parse() fails, which
does occur for iproute2 usage of 'struct ndmsg' because the payload length
is shorter than the family header alone (as 'struct ifinfomsg' is assumed).

This breaks backward compatibility with userspace -- nothing is sent back.

Some examples with iproute2 and netlink library for go [1]:

 1) $ bridge fdb show
    33:33:00:00:00:01 dev ens3 self permanent
    01:00:5e:00:00:01 dev ens3 self permanent
    33:33:ff:15:98:30 dev ens3 self permanent

      This one works, as it uses 'struct ifinfomsg'.

      fdb_show() @ iproute2/bridge/fdb.c
        """
        .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
        ...
        if (rtnl_dump_request(&rth, RTM_GETNEIGH, [...]
        """

 2) $ ip --family bridge neigh
    RTNETLINK answers: Invalid argument
    Dump terminated

      This one fails, as it uses 'struct ndmsg'.

      do_show_or_flush() @ iproute2/ip/ipneigh.c
        """
        .n.nlmsg_type = RTM_GETNEIGH,
        .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndmsg)),
        """

 3) $ ./neighlist
    < no output >

      This one fails, as it uses 'struct ndmsg'-based.

      neighList() @ netlink/neigh_linux.go
        """
        req := h.newNetlinkRequest(unix.RTM_GETNEIGH, [...]
        msg := Ndmsg{
        """

The actual breakage was introduced by commit 0ff50e83b512 ("net: rtnetlink:
bail out from rtnl_fdb_dump() on parse error"), because nlmsg_parse() fails
if the payload length (with the _actual_ family header) is less than the
family header length alone (which is assumed, in parameter 'hdrlen').
This is true in the examples above with struct ndmsg, with size and payload
length shorter than struct ifinfomsg.

However, that commit just intends to fix something under the assumption the
family header is indeed an 'struct ifinfomsg' - by preventing access to the
payload as such (via 'ifm' pointer) if the payload length is not sufficient
to actually contain it.

The assumption was introduced by commit 5e6d24358799 ("bridge: netlink dump
interface at par with brctl"), to support iproute2's 'bridge fdb' command
(not 'ip neigh') which indeed uses 'struct ifinfomsg', thus is not broken.

So, in order to unbreak the 'struct ndmsg' family headers and still allow
'struct ifinfomsg' to continue to work, check for the known message sizes
used with 'struct ndmsg' in iproute2 (with zero or one attribute which is
not used in this function anyway) then do not parse the data as ifinfomsg.

Same examples with this patch applied (or revert/before the original fix):

    $ bridge fdb show
    33:33:00:00:00:01 dev ens3 self permanent
    01:00:5e:00:00:01 dev ens3 self permanent
    33:33:ff:15:98:30 dev ens3 self permanent

    $ ip --family bridge neigh
    dev ens3 lladdr 33:33:00:00:00:01 PERMANENT
    dev ens3 lladdr 01:00:5e:00:00:01 PERMANENT
    dev ens3 lladdr 33:33:ff:15:98:30 PERMANENT

    $ ./neighlist
    netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0x0, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
    netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x1, 0x0, 0x5e, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
    netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0xff, 0x15, 0x98, 0x30}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}

Tested on mainline (v4.19-rc6) and net-next (3bd09b05b068).

References:

[1] netlink library for go (test-case)
    https://github.com/vishvananda/netlink

    $ cat ~/go/src/neighlist/main.go
    package main
    import ("fmt"; "syscall"; "github.com/vishvananda/netlink")
    func main() {
        neighs, _ := netlink.NeighList(0, syscall.AF_BRIDGE)
        for _, neigh := range neighs { fmt.Printf("%#v\n", neigh) }
    }

    $ export GOPATH=~/go
    $ go get github.com/vishvananda/netlink
    $ go build neighlist
    $ ~/go/src/neighlist/neighlist

Thanks to David Ahern for suggestions to improve this patch.

Fixes: 0ff50e83b512 ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse error")
Fixes: 5e6d24358799 ("bridge: netlink dump interface at par with brctl")
Reported-by: Aidan Obley <aobley@pivotal.io>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>