git.openwrt.org Git - openwrt/staging/blogic.git/log

tcp: higher throughput under reordering with adaptive RACK reordering wnd

Currently TCP RACK loss detection does not work well if packets are
being reordered beyond its static reordering window (min_rtt/4).Under
such reordering it may falsely trigger loss recoveries and reduce TCP
throughput significantly.

This patch improves that by increasing and reducing the reordering
window based on DSACK, which is now supported in major TCP implementations.
It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

- If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
  by srtt), since there is possibility that spurious retransmission was
  due to reordering delay longer than reo_wnd.

- Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
  no. of successful recoveries (accounts for full DSACK-based loss
  recovery undo). After that, reset it to default (min_rtt/4).

- At max, reo_wnd is incremented only once per rtt. So that the new
  DSACK on which we are reacting, is due to the spurious retx (approx)
  after the reo_wnd has been updated last time.

- reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
  absolute value to account for change in rtt.

In our internal testing, we observed significant increase in throughput,
in scenarios where reordering exceeds min_rtt/4 (previous static value).

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-parsing-stage'

Vivien Didelot says:

====================
net: dsa: parsing stage

When registering a DSA switch, there is basically two stages.

The first stage is the parsing of the switch device, from either device
tree or platform data. It fetches the DSA tree to which it belongs, and
validates its ports. The switch device is then added to the tree, and
the second stage is called if this was the last switch of the tree.

The second stage is the setup of the tree, which validates that the tree
is complete, sets up the routing tables, the default CPU port for user
ports, sets up the switch drivers and finally the master interfaces,
which makes the whole switch fabric functional.

This patch series covers the first parsing stage. It fixes the type of
the switch and tree indexes to unsigned int, simplifies the tree
reference counting and the switch and CPU ports parsing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: resolve tagging protocol at parse time

Extend the dsa_port_parse_cpu() function to resolve the tagging protocol
at port parsing time, instead of waiting for the whole tree to be
complete.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: add one port parsing function per type

Add dsa_port_parse_user, dsa_port_parse_dsa and dsa_port_parse_cpu
functions to factorize the code shared by both OF and pdata parsing.

They don't do much for the moment but will be extended later to support
tagging protocol resolution for example.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: only check presence of link property

When parsing a port, simply use of_property_read_bool which checks the
presence of a given property, instead of parsing the link phandle.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: rework switch parsing

When parsing a switch, we have to identify to which tree it belongs and
parse its ports. Provide two functions to separate the OF and platform
data specific paths.

Also use the of_property_read_variable_u32_array function to parse the
OF member array instead of calling of_property_read_u32_index twice.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: get tree before parsing ports

We will need a reference to the dsa_switch_tree when parsing a CPU port,
so fetch it right after parsing the member and before parsing ports.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: rework switch addition and removal

This patch removes the unnecessary index argument from the
dsa_dst_add_ds and dsa_dst_del_ds functions and renames them to
dsa_tree_add_switch and dsa_tree_remove_switch respectively.

In addition to a more explicit scope, we now check the presence of an
existing switch with the same index directly within dsa_tree_add_switch.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: provide a find or new tree helper

Rename dsa_get_dst to dsa_tree_find since it doesn't increment the
reference counter, rename dsa_add_dst to dsa_tree_alloc for symmetry
with dsa_tree_free, and provide a convenient dsa_tree_touch function to
find or allocate a new tree.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: get and put tree reference counting

Provide convenient dsa_tree_get and dsa_tree_put functions scoping a DSA
tree used to increment and decrement its reference counter, instead of
poking directly its kref structure.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: simplify tree reference counting

DSA trees have a refcount used to automatically free the dsa_switch_tree
structure once there is no switch devices inside of it.

The refcount is incremented when a switch is added to the tree, and
decremented when it is removed from it.

But because of kref_init, the refcount is also incremented at
initialization, and when looking up the tree from the list for symmetry.

Thus the current code stores the number of switches plus one, and makes
the switch registration more complex.

To simplify the switch registration function, we reset the refcount to
zero after initialization and don't increment it when looking up a tree.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: make tree index unsigned

Similarly to a DSA switch and port, rename the tree index from "tree" to
"index" and make it an unsigned int because it isn't supposed to be less
than 0.

u32 is an OF specific data used to retrieve the value and has no need to
be propagated up to the tree index.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: make switch index unsigned

Define the DSA switch index as an unsigned int, because it will never be
less than 0.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: do not consider packets dropped by network stack as driver Rx dropped

netdev->rx_dropped was including packets dropped by napi_gro_receive.
If a packet is dropped by network stack, it should not be counted under
driver Rx dropped.

Made necessary changes to not include network stack drops under
netdev->rx_dropped.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: bpftool: move p_err() and p_info() from main.h to common.c

The two functions were declared as static inline in a header file. There
is no particular reason why they should be inlined, they just happened to
remain in the same header file when they were turned from macros to
functions in a precious commit.

Make them non-inlined functions and move them to common.c file instead.

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-add-offload-as-a-first-class-citizen'

Jakub Kicinski says:

====================
bpf: add offload as a first class citizen

This series is my stab at what was discussed at a recent IOvisor
bi-weekly call.  The idea is to make the device translator run at
the program load time.  This makes the offload more explicit to
the user space.  It also makes it easy for the device translator
to insert information into the original verifier log.

v2:
- include linux/bug.h instead of asm/bug.h;
- rebased on top of Craig's verifier fix (no changes, the last patch
   just removes more code now).  I checked the set doesn't conflict
   with Jiri's, Josef's or Roman's patches, but missed Craig's fix :(
v1:
- rename the ifindex member on load;
- improve commit messages;
- split nfp patches more.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: remove old offload/analyzer

Thanks to the ability to load a program for a specific device,
running verifier twice is no longer needed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: move to new BPF program offload infrastructure

Following steps are taken in the driver to offload an XDP program:

XDP_SETUP_PROG:
* prepare:
   - allocate program state;
   - run verifier (bpf_analyzer());
   - run translation;
* load:
   - stop old program if needed;
   - load program;
   - enable BPF if not enabled;
* clean up:
   - free program image.

With new infrastructure the flow will look like this:

BPF_OFFLOAD_VERIFIER_PREP:
  - allocate program state;
BPF_OFFLOAD_TRANSLATE:
   - run translation;
XDP_SETUP_PROG:
   - stop old program if needed;
   - load program;
   - enable BPF if not enabled;
BPF_OFFLOAD_DESTROY:
   - free program image.

Take advantage of the new infrastructure.  Allocation of driver
metadata has to be moved from jit.c to offload.c since it's now
done at a different stage.  Since there is no separate driver
private data for verification step, move temporary nfp_meta
pointer into nfp_prog.  We will now use user space context
offsets.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: move translation prepare to offload.c

struct nfp_prog is currently only used internally by the translator.
This means there is a lot of parameter passing going on, between
the translator and different stages of offload. Simplify things
by allocating nfp_prog in offload.c already.

We will now use kmalloc() to allocate the program area and only
DMA map it for the time of loading (instead of allocating DMA
coherent memory upfront).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: move program prepare and free into offload.c

Most of offload/translation prepare logic will be moved to
offload.c. To help git generate more reasonable diffs
move nfp_prog_prepare() and nfp_prog_free() functions
there as a first step.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: require seamless reload for program replace

Firmware supports live replacement of programs for quite some
time now. Remove the software-fallback related logic and
depend on the FW for program replace. Seamless reload will
become a requirement if maps are present, anyway.

Load and start stages have to be split now, since replace
only needs a load, start has already been done on add.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: refactor offload logic

We currently create a fake cls_bpf offload object when we want
to offload XDP.  Simplify and clarify the code by moving the
TC/XDP specific logic out of common offload code.  This is easy
now that we don't support legacy TC actions.  We only need the
bpf program and state of the skip_sw flag.

Temporarily set @code to NULL in nfp_net_bpf_offload(), compilers
seem to have trouble recognizing it's always initialized.  Next
patches will eliminate that variable.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: remove unnecessary include of nfp_net.h

BPF offload's main header does not need to include nfp_net.h.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: remove the register renumbering leftovers

The register renumbering was removed and will not be coming back
in its old, naive form, given that it would be fundamentally
incompatible with calling functions. Remove the leftovers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: bpf: drop support for cls_bpf with legacy actions

Only support BPF_PROG_TYPE_SCHED_CLS programs in direct
action mode. This simplifies preparing the offload since
there will now be only one mode of operation for that type
of program. We need to know the attachment mode type of
cls_bpf programs, because exit codes are interpreted
differently for legacy vs DA mode.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cls_bpf: allow attaching programs loaded for specific device

If TC program is loaded with skip_sw flag, we should allow
the device-specific programs to be accepted.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

xdp: allow attaching programs loaded for specific device

Pass the netdev pointer to bpf_prog_get_type(). This way
BPF code can decide whether the device matches what the
code was loaded/translated for.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpftool: print program device bound info

If program is bound to a device, print the name of the relevant
interface or unknown if the netdev has since been removed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: report offload info to user space

Extend struct bpf_prog_info to contain information about program
being bound to a device. Since the netdev may get destroyed while
program still exists we need a flag to indicate the program is
loaded for a device, even if the device is gone.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: offload: add infrastructure for loading programs for a specific netdev

The fact that we don't know which device the program is going
to be used on is quite limiting in current eBPF infrastructure.
We have to reverse or limit the changes which kernel makes to
the loaded bytecode if we want it to be offloaded to a networking
device.  We also have to invent new APIs for debugging and
troubleshooting support.

Make it possible to load programs for a specific netdev.  This
helps us to bring the debug information closer to the core
eBPF infrastructure (e.g. we will be able to reuse the verifer
log in device JIT).  It allows device JITs to perform translation
on the original bytecode.

__bpf_prog_get() when called to get a reference for an attachment
point will now refuse to give it if program has a device assigned.
Following patches will add a version of that function which passes
the expected netdev in. @type argument in __bpf_prog_get() is
renamed to attach_type to make it clearer that it's only set on
attachment.

All calls to ndo_bpf are protected by rtnl, only verifier callbacks
are not.  We need a wait queue to make sure netdev doesn't get
destroyed while verifier is still running and calling its driver.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bpf: rename ndo_xdp to ndo_bpf

ndo_xdp is a control path callback for setting up XDP in the
driver. We can reuse it for other forms of communication
between the eBPF stack and the drivers. Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mISDN: l1oip_core: replace _manual_ swap with swap macro

Make use of the swap macro and remove unnecessary variables skb and cnt.
This makes the code easier to read and maintain.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: plip: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114893
Addresses-Coverity-ID: 114894
Addresses-Coverity-ID: 114895
Addresses-Coverity-ID: 114896
Addresses-Coverity-ID: 114897
Addresses-Coverity-ID: 114898
Addresses-Coverity-ID: 114899
Addresses-Coverity-ID: 114900
Addresses-Coverity-ID: 114901
Addresses-Coverity-ID: 114902
Addresses-Coverity-ID: 114903
Addresses-Coverity-ID: 114904
Addresses-Coverity-ID: 114905
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: do not clear again skb->csum in tcp_init_nondata_skb()

tcp_init_nondata_skb() is fed with freshly allocated skbs.
They already have a cleared csum field, no need to clear it again.

This is based on Neal review on commit 3b11775033dc ("tcp: do not mangle
skb->cb[] in tcp_make_synack()"), noticing I did not clear skb->csum.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tcp_mtu_probing() cleanup

Reduce one indentation level to make code more readable.
tcp_sync_mss() can be factorized.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa_eth: avoid uninitialized variable false-positive warning

We can now build this driver on ARM, so I ran into a randconfig build
warning that presumably had existed on powerpc already.

drivers/net/ethernet/freescale/dpaa/dpaa_eth.c: In function 'sg_fd_to_skb':
drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:1712:18: error: 'skb' may be used uninitialized in this function [-Werror=maybe-uninitialized]

I'm slightly changing the logic here, to make it obvious to the
compiler that 'skb' is always initialized.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'openvswitch-netns'

Flavio Leitner says:

====================
Allow openvswitch to query ports in another netns.

Today Open vSwitch users are moving internal ports to other namespaces and
although packets are flowing OK, the userspace daemon can't find out basic
information like if the port is UP or DOWN, for instance.

This patchset extends openvswitch API to retrieve the current netnsid of
a port. It will be used by the userspace daemon to find out in which netns
the port is located.

This patchset also extends the rtnetlink getlink call to accept and operate
on a given netnsid. More details are available in each patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

rtnetlink: use netnsid to query interface

Currently, when an application gets netnsid from the kernel (for example as
the result of RTM_GETLINK call on one end of the veth pair), it's not much
useful. There's no reliable way to get to the netns fd from the netnsid, nor
does any kernel API accept netnsid.

Extend the RTM_GETLINK call to also accept netnsid. It will operate on the
netns with the given netnsid in such case. Of course, the calling process
needs to have enough capabilities in the target name space; for now, require
CAP_NET_ADMIN. This can be relaxed in the future.

To signal to the calling process that the kernel understood the new
IFLA_IF_NETNSID attribute in the query, it will include it in the response.
This is needed to detect older kernels, as they will just ignore
IFLA_IF_NETNSID and query in the current name space.

This patch implemetns IFLA_IF_NETNSID only for get and dump. For set
operations, this can be extended later.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: reliable interface indentification in port dumps

This patch allows reliable identification of netdevice interfaces connected
to openvswitch bridges. In particular, user space queries the netdev
interfaces belonging to the ports for statistics, up/down state, etc.
Datapath dump needs to provide enough information for the user space to be
able to do that.

Currently, only interface names are returned. This is not sufficient, as
openvswitch allows its ports to be in different name spaces and the
interface name is valid only in its name space. What is needed and generally
used in other netlink APIs, is the pair ifindex+netnsid.

The solution is addition of the ifindex+netnsid pair (or only ifindex if in
the same name space) to vport get/dump operation.

On request side, ideally the ifindex+netnsid pair could be used to
get/set/del the corresponding vport. This is not implemented by this patch
and can be added later if needed.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: export peernet2id_alloc

It will be used by openvswitch.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: remove IN6_ADDR_HSIZE from addrconf.h

IN6_ADDR_HSIZE is private to addrconf.c, move it here to avoid
confusion.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

pktgen: do not abuse IN6_ADDR_HSIZE

pktgen accidentally used IN6_ADDR_HSIZE, instead of using the size of an
IPv6 address.

Since IN6_ADDR_HSIZE recently was increased from 16 to 256, this old
bug is hitting us.

Fixes: 3f27fb23219e ("ipv6: addrconf: add per netns perturbation in inet6_addr_hash()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: cls_u32: use bitwise & rather than logical && on n->flags

Currently n->flags is being operated on by a logical && operator rather
than a bitwise & operator. This looks incorrect as these should be bit
flag operations. Fix this.

Detected by CoverityScan, CID#1460398 ("Logical vs. bitwise operator")

Fixes: 245dc5121a9b ("net: sched: cls_u32: call block callbacks for offload")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp_nv: use do_div() instead of expensive div64_u64()

Average RTT is 32-bit thus full 64-bit division is redundant.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

add support of IFF_XMIT_DST_RELEASE bit in vlan

Some time ago Eric Dumazet suggested a "hack the IFF_XMIT_DST_RELEASE
flag on the vlan netdev". But the last comment was "does not support
properly bonding/team.(If the real_dev->privflags IFF_XMIT_DST_RELEASE
bit changes, we want to update all the vlans at the same time )"

I've extended that patch to support changes of IFF_XMIT_DST_RELEASE in
bonding/team.
Both bonding and team call netdev_change_features() after recalculation
of features including priv_flags IFF_XMIT_DST_RELEASE bit. So the only
thing needed to support is to recheck this bit in
vlan_transfer_features().

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vadim Fedorenko <vfedorenko@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

phylink: make local function phylink_phy_change() static

Fixes the following sparse warnings:

drivers/net/phy/phylink.c:570:6: warning:
symbol 'phylink_phy_change' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'wireless-drivers-next-for-davem-2017-11-03' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.15

Mostly fixes this time, but also few new features.

Major changes:

wil6210

* remove ssid debugfs file

rsi

* add WOWLAN support for suspend, hibernate and shutdown states

ath10k

* add support for CCMP-256, GCMP and GCMP-256 ciphers on hardware
where it's supported (QCA99x0 and QCA4019)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/davem/net

Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: Fix an issue with multiple switchdev enable disables

Return success if the same dispatch function is being registered for
a given opcode and subcode, there by allow multiple switchdev enable
and disables.

Signed-off-by: Vijaya Mohan Guvva <vijaya.guvva@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Handle-changes-in-GRE-configuration'

Jiri Pirko says:

====================
mlxsw: Handle changes in GRE configuration

Petr says:

Until now, when an IP tunnel was offloaded by the mlxsw driver, the
offload was pretty much static, and changes in Linux configuration were
not reflected in the hardware. That led to discrepancies between traffic
flows in slow path and fast path. The work-around used to be to remove
all routes that forward to the netdevice and re-add them. This is
clearly suboptimal, but actually, as of the decap-only patchset, it's
not even enough anymore, and one needs to go all the way and simply drop
the tunnel and recreate it correctly.

With this patchset, the NETDEV_CHANGE events that are generated for
changes of up'd tunnel netdevices are captured and interpreted to
correctly reconfigure the HW in accordance with changes requested at the
software layer. In addition, NETDEV_CHANGEUPPER, NETDEV_UP and
NETDEV_DOWN are now handled not only for tunnel devices themselves, but
also for their bound devices. Each change is then translated to one or
more of the following updates to the HW configuration:

- refresh of offload of local route that corresponds to tunnel's local
  address
- refresh of the loopback RIF
- refresh of offloads of routes that forward to the changed tunnel
- removal of tunnel offloads

These tools are used to implement the following configuration changes:

- addition of a new offloadable tunnel with local address that conflicts
  with that of an already-offloaded tunnel (the existing tunnel is
  onloaded, the new one isn't offloaded)
- changes to TTL, TOS that make tunnel unsuitable for offloading
- changes to ikey, okey, remote
- changes to local, which when they cause conflict with another
  tunnel, lead to onloading of both newly-conflicting tunnels
- migration of a bound device of an offloaded tunnel device to a
  different VRF
- changes to what device is bound to a tunnel device (i.e. like what
  "ip tunnel change name g dev another" does)
- changes to up / down state of a bound device. A down bound device
  doesn't forward encapsulated traffic anymore, but decap still works.

This patchset starts with a suite of patches that adapt the existing
code base step by step to facilitate introduction of the offloading
code. The five substantial patches at the end then implement the changes
mentioned above.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Handle down of tunnel underlay

When the bound device of a tunnel device is down, encapsulated packets
are not egressed anymore, but tunnel decap still works. Extend
mlxsw_sp_nexthop_rif_update() to take IFF_UP into consideration when
deciding whether a given next hop should be offloaded.

Because the new logic was added to mlxsw_sp_nexthop_rif_update(), this
fixes the case where a newly-added tunnel has a down bound device, which
would previously be fully offloaded. Now the down state of the bound
device is noted and next hops forwarding to such tunnel are not
offloaded.

In addition to that, notice NETDEV_UP and NETDEV_DOWN of a bound device
to force refresh of tunnel encap route offloads.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_ipip: Handle underlay device change

When a bound device of an IP-in-IP tunnel changes, such as through
'ip tunnel change name $name dev $dev', the loopback backing the tunnel
needs to be recreated.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Handle NETDEV_CHANGE on L3 tunnels

Changes to L3 tunnel netdevices (through `ip tunnel change' as well as
`ip link set') lead to NETDEV_CHANGE being generated on the tunnel
device. Because what is relevant for the tunnel in question depends on
the tunnel type, handling of the event is dispatched to the IPIP module
through a newly-added interface mlxsw_sp_ipip_ops.ol_netdev_change().

IPIP tunnels now remember the last set of tunnel parameters in struct
mlxsw_sp_ipip_entry.parms, and use it to figure out what exactly has
changed.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Support IPIP underlay VRF migration

When a bound device of a tunnel netdevice changes VRF, the loopback RIF
that backs the tunnel needs to be updated and existing encapsulating
routes need to be refreshed.

Note that several tunnels can share the same bound device, in which case
all the impacted tunnels need to be updated.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Onload conflicting tunnels

The approach for offloading IP tunnels implemented currently by mlxsw
doesn't allow two tunnels that have the same local IP address in the
same (underlay) VRF. Previously, offloads were introduced on demand as
encap routes were formed. When such a route was created that would cause
offload of a conflicting tunnel, mlxsw_sp_ipip_entry_create() would
detect it and return -EEXIST, which would propagate up and cause FIB
abort.

Now however IPIP entries are created as soon as an offloadable netdevice
is created, and the failure prevents creation of such device.
Furthermore, if the driver is installed at the point where such
conflicting tunnels exist, the failure actually prevents successful
modprobe.

Furthermore, follow-up patches implement handling of NETDEV_CHANGE due
to the local address change. However, NETDEV_CHANGE can't be vetoed. The
failure merely means that the offloads weren't updated, but the change
in Linux configuration is not rolled back. It is thus desirable to have
a robust way of handling these conflicts, which can later be reused for
handling NETDEV_CHANGE as well.

To fix this, when a conflicting tunnel is created, instead of failing,
simply pull the old tunnel to slow path and reject offloading the
new one.

Introduce two functions: mlxsw_sp_ipip_entry_demote_tunnel() and
mlxsw_sp_ipip_demote_tunnel_by_saddr() to handle this. Make them both
public, because they will be useful later on in this patchset.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Fix saddr deduction in mlxsw_sp_ipip_entry_create()

When trying to determine whether there are other offloaded tunnels with
the same local address, mlxsw_sp_ipip_entry_create() should look for a
tunnel with matching UL protocol, matching saddr, in the same VRF.
However instead of taking into account the UL protocol of the tunnel
netdevice (which mlxsw_sp_ipip_entry_saddr_matches() then compares to
the UL protocol of inspected IPIP entry), it deduces the UL protocol
from the inspected IPIP entry (and that's compared to itself).

This is currently immaterial, because only one tunnel type is offloaded,
and therefore the UL protocol always matches, but introducing support
for a tunnel with IPv6 underlay would uncover this error.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Generalize __mlxsw_sp_ipip_entry_update_tunnel()

The work that needs to be done to update HW configuration in response to
changes is similar to what __mlxsw_sp_ipip_entry_update_tunnel() already
does, but with a number of twists: each change requires a different
subset of things to happen. Extend the function to support all these
uses, and allow finely-grained configuration of what should happen at
each call through a suite of function arguments.

Publish the updated function to allow use from the spectrum_ipip module.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Extract __mlxsw_sp_ipip_entry_update_tunnel()

The work that's done by mlxsw_sp_netdevice_ipip_ol_vrf_event() is a good
basis for a more versatile function that would take care of all sorts of
tunnel updates requests: __mlxsw_sp_ipip_entry_update_tunnel(). Extract
that function. Factor out a helper mlxsw_sp_ipip_entry_ol_lb_update() as
well.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Propagate extack for tunnel events

The function mlxsw_sp_rif_create() takes an extack parameter. So far,
for creation of loopback interfaces, NULL was passed. For some events
however the extack can be extracted and passed along. So do that for
NETDEV_CHANGEUPPER handler.

Use the opportunity to update the type of info argument that
mlxsw_sp_netdevice_ipip_ol_event() takes. Follow-up patches will
introduce handling of more changes, and some of them carry an extack as
well, but in an info structure of a different type. Though not strictly
erroneous (the pointer could be cast whichever way), it makes no sense
to pretend the value is always of a certain type, when in fact it isn't.
So change the prototype of the above-mentioned function as well.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Extract mlxsw_sp_ipip_entry_ol_up_event()

The piece of logic to promote decap route, if any, is useful for generic
tunnel updates, not just for handling of NETDEV_UP events on tunnel
interfaces. Extract it to a separate function.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Make mlxsw_sp_netdevice_ipip_ol_up_event() void

This function only ever returns 0, so don't pretend it returns anything
useful and just make it void.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Extract mlxsw_sp_ipip_entry_ol_down_event()

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_ipip: Split accessor functions

To implement NETDEV_CHANGE notifications on IP-in-IP tunnels, the
handler needs to figure out what actually changed, to understand how
exactly to update the offloads. It will do so by storing struct
ip_tunnel_parm with previous configuration, and comparing that to the
new version.

To facilitate these comparisons, extract the code that operates on
struct ip_tunnel_parm from the existing accessor functions, and make
those a thin wrapper that extracts tunnel parameters and dispatches.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Move mlxsw_sp_ipip_netdev_{s, d}addr{, 4}()

These functions ideologically belong to the IPIP module, and some
follow-up work will benefit from their presence there.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Extract mlxsw_sp_netdevice_ipip_can_offload()

Some of the code down the road needs this logic as well.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Rename IPIP-related netdevice handlers

To distinguish between events related to tunnel device itself and its
bound device, rename a number of functions related to handling tunneling
netdevice events to include _ol_ (for "overlay") in the name. That
leaves room in the namespace for underlay-related functions, which would
have _ul_ in the name.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'clk-fixes-for-linus' of git://git./linux/kernel/git/clk/linux

Pull clk fix from Stephen Boyd:
"One fix for USB clks on Uniphier PXs3 SoCs"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: uniphier: fix clock data for PXs3

Merge git://git./linux/kernel/git/cmetcalf/linux-tile

Pull arch/tile fixes from Chris Metcalf:
"Two one-line bug fixes"

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
arch/tile: Implement ->set_state_oneshot_stopped()
tile: pass machine size to sparse

Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fix from James Bottomley:
"One minor fix in the error leg of the qla2xxx driver (it oopses the
  system if we get an error trying to start the internal kernel thread).

  The fix is minor because the problem isn't often encountered in the
  field (although it can be induced by inserting the module in a low
  memory environment)"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla2xxx: Fix oops in qla2x00_probe_one error path

arch/tile: Implement ->set_state_oneshot_stopped()

set_state_oneshot_stopped() is called by the clkevt core, when the
next event is required at an expiry time of 'KTIME_MAX'. This normally
happens with NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes.

This patch makes the clockevent device to stop on such an event, to
avoid spurious interrupts, as explained by: commit 8fff52fd5093
("clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state").

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>

Merge tag 'powerpc-4.14-6' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 4.14.

  This is bigger than I like to send at rc7, but that's at least partly
  because I didn't send any fixes last week. If it wasn't for the IMC
  driver, which is new and getting heavy testing, the diffstat would
  look a bit better. I've also added ftrace on big endian to my test
  suite, so we shouldn't break that again in future.

   - A fix to the handling of misaligned paste instructions (P9 only),
     where a change to a #define has caused the check for the
     instruction to always fail.

   - The preempt handling was unbalanced in the radix THP flush (P9
     only). Though we don't generally use preempt we want to keep it
     working as much as possible.

   - Two fixes for IMC (P9 only), one when booting with restricted
     number of CPUs and one in the error handling when initialisation
     fails due to firmware etc.

   - A revert to fix function_graph on big endian machines, and then a
     rework of the reverted patch to fix kprobes blacklist handling on
     big endian machines.

  Thanks to: Anju T Sudhakar, Guilherme G. Piccoli, Madhavan Srinivasan,
  Naveen N. Rao, Nicholas Piggin, Paul Mackerras"

* tag 'powerpc-4.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/perf: Fix core-imc hotplug callback failure during imc initialization
  powerpc/kprobes: Dereference function pointers only if the address does not belong to kernel text
  Revert "powerpc64/elfv1: Only dereference function descriptor for non-text symbols"
  powerpc/64s/radix: Fix preempt imbalance in TLB flush
  powerpc: Fix check for copy/paste instructions in alignment handler
  powerpc/perf: Fix IMC allocation routine

Merge tag 'mmc-v4.14-rc4-3' of git://git./linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
"Fix dw_mmc request timeout issues"

* tag 'mmc-v4.14-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: dw_mmc: Fix the DTO timeout calculation
  mmc: dw_mmc: Add locking to the CTO timer
  mmc: dw_mmc: Fix the CTO timeout calculation
  mmc: dw_mmc: cancel the CTO timer after a voltage switch

Merge tag 'drm-fixes-for-v4.14-rc8' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:

- one nouveau regression fix

- some amdgpu fixes for stable to fix hangs on some harvested Polaris
   GPUs

- a set of KASAN and regression fixes for i915, their CI system seems
   to be working pretty well now.

* tag 'drm-fixes-for-v4.14-rc8' of git://people.freedesktop.org/~airlied/linux:
  drm/amdgpu: allow harvesting check for Polaris VCE
  drm/amdgpu: return -ENOENT from uvd 6.0 early init for harvesting
  drm/i915: Check incoming alignment for unfenced buffers (on i915gm)
  drm/nouveau/kms/nv50: use the correct state for base channel notifier setup
  drm/i915: Hold rcu_read_lock when iterating over the radixtree (vma idr)
  drm/i915: Hold rcu_read_lock when iterating over the radixtree (objects)
  drm/i915/edp: read edp display control registers unconditionally
  drm/i915: Do not rely on wm preservation for ILK watermarks
  drm/i915: Cancel the modeset retry work during modeset cleanup

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:
"Hopefully this is the last batch of networking fixes for 4.14

  Fingers crossed...

   1) Fix stmmac to use the proper sized OF property read, from Bhadram
      Varka.

   2) Fix use after free in net scheduler tc action code, from Cong
      Wang.

   3) Fix SKB control block mangling in tcp_make_synack().

   4) Use proper locking in fib_dump_info(), from Florian Westphal.

   5) Fix IPG encodings in systemport driver, from Florian Fainelli.

   6) Fix division by zero in NV TCP congestion control module, from
      Konstantin Khlebnikov.

   7) Fix use after free in nf_reject_ipv4, from Tejaswi Tanikella"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: systemport: Correct IPG length settings
  tcp: do not mangle skb->cb[] in tcp_make_synack()
  fib: fib_dump_info can no longer use __in_dev_get_rtnl
  stmmac: use of_property_read_u32 instead of read_u8
  net_sched: hold netns refcnt for each action
  net_sched: acquire RTNL in tc_action_net_exit()
  net: vrf: correct FRA_L3MDEV encode type
  tcp_nv: fix division by zero in tcpnv_acked()
  netfilter: nf_reject_ipv4: Fix use-after-free in send_reset
  netfilter: nft_set_hash: disable fast_ops for 2-len keys

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"7 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, swap: fix race between swap count continuation operations
  mm/huge_memory.c: deposit page table when copying a PMD migration entry
  initramfs: fix initramfs rebuilds w/ compression after disabling
  fs/hugetlbfs/inode.c: fix hwpoison reserve accounting
  ocfs2: fstrim: Fix start offset of first cluster group during fstrim
  mm, /proc/pid/pagemap: fix soft dirty marking for PMD migration entry
  userfaultfd: hugetlbfs: prevent UFFDIO_COPY to fill beyond the end of i_size

Update MIPS email addresses

MIPS will soon not be a part of Imagination Technologies, and as such
many @imgtec.com email addresses will no longer be valid. This patch
updates the addresses for those who:

- Have 10 or more patches in mainline authored using an @imgtec.com
   email address, or any patches dated within the past year.

- Are still with Imagination but leaving as part of the MIPS business
   unit, as determined from an internal email address list.

- Haven't already updated their email address (ie. JamesH) or expressed
   a desire to be excluded (ie. Maciej).

- Acked v2 or earlier of this patch, which leaves Deng-Cheng, Matt &
   myself.

New addresses are of the form firstname.lastname@mips.com, and all
verified against an internal email address list.  An entry is added to
.mailmap for each person such that get_maintainer.pl will report the new
addresses rather than @imgtec.com addresses which will soon be dead.

Instances of the affected addresses throughout the tree are then
mechanically replaced with the new @mips.com address.

Signed-off-by: Paul Burton <paul.burton@mips.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@imgtec.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@mips.com>
Acked-by: Dengcheng Zhu <dengcheng.zhu@mips.com>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Matt Redfearn <matt.redfearn@mips.com>
Acked-by: Matt Redfearn <matt.redfearn@mips.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: trivial@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo

Commit 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get() for
/proc/cpuinfo "cpu MHz"") is not sufficient to restore the previous
behavior of "cpu MHz" in /proc/cpuinfo on x86 due to some changes
made after the commit it has reverted.

To address this, make the code in question use arch_freq_get_on_cpu()
which also is used by cpufreq for reporting the current frequency of
CPUs and since that function doesn't really depend on cpufreq in any
way, drop the CONFIG_CPU_FREQ dependency for the object file
containing it.

Also refactor arch_freq_get_on_cpu() somewhat to avoid IPIs and
return cached values right away if it is called very often over a
short time (to prevent user space from triggering IPI storms through
it).

Fixes: 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"")
Cc: stable@kernel.org # 4.13 - together with 890da9cf0983
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm, swap: fix race between swap count continuation operations

One page may store a set of entries of the sis->swap_map
(swap_info_struct->swap_map) in multiple swap clusters.

If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
multiple pages will be used to store the set of entries of the
sis->swap_map.  And the pages are linked with page->lru.  This is called
swap count continuation.  To access the pages which store the set of
entries of the sis->swap_map simultaneously, previously, sis->lock is
used.  But to improve the scalability of __swap_duplicate(), swap
cluster lock may be used in swap_count_continued() now.  This may race
with add_swap_count_continuation() which operates on a nearby swap
cluster, in which the sis->swap_map entries are stored in the same page.

The race can cause wrong swap count in practice, thus cause unfreeable
swap entries or software lockup, etc.

To fix the race, a new spin lock called cont_lock is added to struct
swap_info_struct to protect the swap count continuation page list.  This
is a lock at the swap device level, so the scalability isn't very well.
But it is still much better than the original sis->lock, because it is
only acquired/released when swap count continuation is used.  Which is
considered rare in practice.  If it turns out that the scalability
becomes an issue for some workloads, we can split the lock into some
more fine grained locks.

Link: http://lkml.kernel.org/r/20171017081320.28133-1-ying.huang@intel.com
Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/huge_memory.c: deposit page table when copying a PMD migration entry

We need to deposit pre-allocated PTE page table when a PMD migration
entry is copied in copy_huge_pmd(). Otherwise, we will leak the
pre-allocated page and cause a NULL pointer dereference later in
zap_huge_pmd().

The missing counters during PMD migration entry copy process are added
as well.

The bug report is here: https://lkml.org/lkml/2017/10/29/214

Link: http://lkml.kernel.org/r/20171030144636.4836-1-zi.yan@sent.com
Fixes: 84c3fc4e9c563 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

initramfs: fix initramfs rebuilds w/ compression after disabling

This is a follow-up to commit 57ddfdaa9a72 ("initramfs: fix disabling of
initramfs (and its compression)").  This particular commit fixed the use
case where we build the kernel with an initramfs with no compression,
and then we build the kernel with no initramfs.

Now this still left us with the same case as described here:

  http://lkml.kernel.org/r/20170521033337.6197-1-f.fainelli@gmail.com

not working with initramfs compression.  This can be seen by the
following steps/timestamps:

  https://www.spinics.net/lists/kernel/msg2598153.html

.initramfs_data.cpio.gz.cmd is correct:

  cmd_usr/initramfs_data.cpio.gz := /bin/bash
  ./scripts/gen_initramfs_list.sh -o usr/initramfs_data.cpio.gz  -u 1000 -g 1000  /home/fainelli/work/uclinux-rootfs/romfs /home/fainelli/work/uclinux-rootfs/misc/initramfs.dev

and was generated the first time we did generate the gzip initramfs, so
the command has not changed, nor its arguments, so we just don't call
it, no initramfs cpio is re-generated as a consequence.

The fix for this problem is just to properly keep track of the
.initramfs_cpio_data.d file by suffixing it with the compression
extension.  This takes care of properly tracking dependencies such that
the initramfs get (re)generated any time files are added/deleted etc.

Link: http://lkml.kernel.org/r/20170930033936.6722-1-f.fainelli@gmail.com
Fixes: db2aa7fd15e8 ("initramfs: allow again choice of the embedded initramfs compression algorithm")
Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Cc: "Francisco Blas Izquierdo Riera (klondike)" <klondike@xiscosoft.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/hugetlbfs/inode.c: fix hwpoison reserve accounting

Calling madvise(MADV_HWPOISON) on a hugetlbfs page will result in bad
(negative) reserved huge page counts.  This may not happen immediately,
but may happen later when the underlying file is removed or filesystem
unmounted.  For example:

  AnonHugePages:         0 kB
  ShmemHugePages:        0 kB
  HugePages_Total:       1
  HugePages_Free:        0
  HugePages_Rsvd:    18446744073709551615
  HugePages_Surp:        0
  Hugepagesize:       2048 kB

In routine hugetlbfs_error_remove_page(), hugetlb_fix_reserve_counts is
called after remove_huge_page.  hugetlb_fix_reserve_counts is designed
to only be called/used only if a failure is returned from
hugetlb_unreserve_pages.  Therefore, call hugetlb_unreserve_pages as
required and only call hugetlb_fix_reserve_counts in the unlikely event
that hugetlb_unreserve_pages returns an error.

Link: http://lkml.kernel.org/r/20171019230007.17043-2-mike.kravetz@oracle.com
Fixes: 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ocfs2: fstrim: Fix start offset of first cluster group during fstrim

The first cluster group descriptor is not stored at the start of the
group but at an offset from the start.  We need to take this into
account while doing fstrim on the first cluster group.  Otherwise we
will wrongly start fstrim a few blocks after the desired start block and
the range can cross over into the next cluster group and zero out the
group descriptor there.  This can cause filesytem corruption that cannot
be fixed by fsck.

Link: http://lkml.kernel.org/r/1507835579-7308-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm, /proc/pid/pagemap: fix soft dirty marking for PMD migration entry

When the pagetable is walked in the implementation of /proc/<pid>/pagemap,
pmd_soft_dirty() is used for both the PMD huge page map and the PMD
migration entries. That is wrong, pmd_swp_soft_dirty() should be used
for the PMD migration entries instead because the different page table
entry flag is used.

As a result, /proc/pid/pagemap may report incorrect soft dirty information
for PMD migration entries.

Link: http://lkml.kernel.org/r/20171017081818.31795-1-ying.huang@intel.com
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

userfaultfd: hugetlbfs: prevent UFFDIO_COPY to fill beyond the end of i_size

This oops:

  kernel BUG at fs/hugetlbfs/inode.c:484!
  RIP: remove_inode_hugepages+0x3d0/0x410
  Call Trace:
    hugetlbfs_setattr+0xd9/0x130
    notify_change+0x292/0x410
    do_truncate+0x65/0xa0
    do_sys_ftruncate.constprop.3+0x11a/0x180
    SyS_ftruncate+0xe/0x10
    tracesys+0xd9/0xde

was caused by the lack of i_size check in hugetlb_mcopy_atomic_pte.

mmap() can still succeed beyond the end of the i_size after vmtruncate
zapped vmas in those ranges, but the faults must not succeed, and that
includes UFFDIO_COPY.

We could differentiate the retval to userland to represent a SIGBUS like
a page fault would do (vs SIGSEGV), but it doesn't seem very useful and
we'd need to pick a random retval as there's no meaningful syscall
retval that would differentiate from SIGSEGV and SIGBUS, there's just
-EFAULT.

Link: http://lkml.kernel.org/r/20171016223914.2421-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'net-mini_Qdisc'

Jiri Pirko says:

====================
net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath

This patchset's main patch is patch number 2. It carries the
description. Patch 1 is just a dependency.

---
v3->v4:
- rebased to be applicable on top of the current net-next
v2->v3:
- Using head change callback to replace miniq pointer every time tp head
changes. This eliminates one rcu dereference and makes the claim "without
added overhead" valid.
v1->v2:
- Use dev instead of skb->dev in sch_handle_egress as pointed out by Daniel
- Fixed synchronize_rcu_bh() in mini_qdisc_disable and commented
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath

In sch_handle_egress and sch_handle_ingress tp->q is used only in order
to update stats. So stats and filter list are the only things that are
needed in clsact qdisc fastpath processing. Introduce new mini_Qdisc
struct to hold those items. Also, introduce a helper to swap the
mini_Qdisc structures in case filter list head changes.

This removes need for tp->q usage without added overhead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: introduce chain_head_change callback

Add a callback that is to be called whenever head of the chain changes.
Also provide a callback for the default case when the caller gets a
block using non-extended getter.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hns3-ethtool-ksettings'

Lipeng says:

====================
net: hns3: support set_link_ksettings and for nway_reset ethtool command

This patch-set adds support for set_link_ksettings && for nway_resets
ethtool command and fixes some related ethtool bugs.
1, patch[4/6] adds support for ethtool_ops.set_link_ksettings.
2, patch[5/6] adds support ethtool_ops.for nway_reset.
3, patch[1/6,2/6,3/6,6/6] fix some bugs for getting port information by
ethtool command(ethtool ethx).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix a bug for phy supported feature initialization

This patch fixes a bug for phy supported feature initialization.
Currently, the value of phydev->supported is initialized by kernel.
So it includes many features that we do not support, such as
SUPPORTED_FIBRE and SUPPORTED_BNC. This patch fixes it.

Fixes: 256727d (net: hns3: Add MDIO support to HNS3 Ethernet driver for hip08 SoC)
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add support for nway_reset

This patch adds nway_reset support for ethtool cmd.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: add support for set_link_ksettings

This patch adds set_link_ksettings support for ethtool cmd.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix a bug in hns3_driv_to_eth_caps

The value of link_modes.advertising and the value of link_modes.supported
is initialized to zero every time in for loop in hns3_driv_to_eth_caps().
But we just want to set specified bit for them. Initialization is
unnecessary. This patch fixes it.

Fixes: 496d03e (net: hns3: Add Ethtool support to HNS3 driver)
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix for getting advertised_caps in hns3_get_link_ksettings

This patch fixes a bug for ethtool's get_link_ksettings().
The advertising for autoneg is always added to advertised_caps
whether autoneg is enable or disable. This patch fixes it.

Fixes: 496d03e (net: hns3: Add Ethtool support to HNS3 driver)
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns3: fix for getting autoneg in hns3_get_link_ksettings

This patch fixes a bug for ethtool's get_link_ksettings().
When phy exists, we should get autoneg from phy rather than from mac.
Because the value of mac.autoneg is invalid when phy exists.

Fixes: 496d03e (net: hns3: Add Ethtool support to HNS3 driver)
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnxt_en-next'

Michael Chan says:

====================
bnxt_en: Fix IRQ coalescing regressions.

There was a typo and missing guard-rail against illegal values in the
recent code clean up. All reported by Andy Gospodarek.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Fix IRQ coalescing regression.

Recent IRQ coalescing clean up has removed a guard-rail for the max DMA
buffer coalescing value. This is a 6-bit value and must not be 0. We
already have a check for 0 but 64 is equivalent to 0 and will cause
non-stop interrupts. Fix it by adding the proper check.

Fixes: f8503969d27b ("bnxt_en: Refactor and simplify coalescing code.")
Reported-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: fix typo in bnxt_set_coalesce

Recent refactoring of coalesce settings contained a typo that prevents
receive settings from being set properly.

Fixes: 18775aa8a91f ("bnxt_en: Reorganize the coalescing parameters.")
Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: check NULL in tcf_block_put()

Callers of tcf_block_put() could pass NULL so
we can't use block->q before checking if block is
NULL or not.

tcf_block_put_ext() callers are fine, it is always
non-NULL.

Fixes: 8c4083b30e56 ("net: sched: add block bind/unbind notif. and extended block_get/put")
Reported-by: Dave Taht <dave.taht@gmail.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tcp_fragment() should not assume rtx skbs

While stress testing MTU probing, we had crashes in list_del() that we root-caused
to the fact that tcp_fragment() is unconditionally inserting the freshly allocated
skb into tsorted_sent_queue list.

But this list is supposed to contain skbs that were sent.
This was mostly harmless until MTU probing was enabled.

Fortunately we can use the tcp_queue enum added later (but in same linux version)
for rtx-rb-tree to fix the bug.

Fixes: e2080072ed2d ("tcp: new list for sent but unacked skbs for RACK recovery")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Priyaranjan Jha <priyarjha@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mISDN: hfcpci: Convert timers to use timer_setup()

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Geliang Tang <geliangtang@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>