git.openwrt.org Git - openwrt/staging/blogic.git/log

Merge branch 'mlx5-flow-steering'

Saeed Mahameed says:

====================
mlx5 improved flow steering management

First two patches fixes some minor issues in recently
introduced SRIOV code.

The other seven patches modifies the driver's code that
manages flow steering rules with Connectx-4 devices.

Basic introduction:

The flow steering device specification model is composed of the following entities:

Destination (either a TIR/Flow table/vport), where TIR is RSS end-point, vport
is the VF eSwitch port in SRIOV.

Flow table entry (FTE) - the values used by the flow specification
Flow table group (FG) - the masks used by the flow specification
Flow table (FT) - groups several FGs and can serve as destination

The flow steering software entities:

In addition to the device objects, the software have two more objects:

Priorities - group several FTs. Handles order of packet matching.

Namespaces - group several priorities. Namespace are used in order to
isolate different usages of steering (for example, add two separate
namespaces, one for the NIC driver and one for E-Switch FDB).

The base data structure for the flow steering management is a tree and
all the flow steering objects such as (Namespace/Flow table/Flow Group/FTE/etc.)
are represented as a node in the tree, e.g.:
Priority-0 -> FT1 -> FG -> FTE -> TIR (destination)
Priority-1 -> FT2 -> FG->  FTE -> TIR (destination)

Matching begins in FT1 flow rules and if there is a miss on all the FTEs
then matching continues on the FTEs in FT2.

The new implementation solves/improves the following
issues in the current code:

1) The new impl. supports multiple destinations, the search for existing rule with
   the same matching value is performed by the flow steering management.
   In the current impl. the E-switch FDB management code needs to search
   for existing rules before calling to the add rule function.

2) The new impl. manages the flow table level, in the current implementation the
   consumer states the flow table level when new flow table is created without
   any knowledge about the levels of other flow tables.

3) In the current impl. the consumer can't create or destroy flow
   groups dynamically, the flow groups are passed as argument to the create
   flow table API. The new impl. exposes API for create/destroy flow group.

The series is built as follows:

Patch #1 add flow steering API firmware commands.

Patch #2 add tree operation of the flow steering tree: add/remove node,
initialize node and take reference count on a node.

Patch #3 add essential algorithms for managing the flow steering.

Patch #4 Initialize the flow steering tree, flow steering initialization is based
on static tree which illustrates the flow steering tree when the driver is loaded.

Patch #5 is the main patch of the series. It introduce the flow steering API.

Patch #6 Expose the new flow steering API and remove the old one.
The Ethernet flow steering follows the existing implementation,
but uses the new steering API.

Patch #7 Rename en_flow_table.c to en_fs.c in order to be aligned with
the new flow steering files.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Rename en_flow_table.c to en_fs.c

Rename en_flow_table.c to en_fs.c in order to be aligned
with the new flow steering files.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Use flow steering infrastructure for mlx5_en

Expose the new flow steering API and remove the old
one.

Few changes are required:

1. The Ethernet flow steering follows the existing implementation, but uses
the new steering API. The old flow steering implementation is removed.

2. Move the E-switch FDB management to use the new API.

3. When driver is loaded call to mlx5_init_fs which initialize
the flow steering tree structure, open namespaces for NIC receive
and for E-switch FDB.

4. Call to mlx5_cleanup_fs when the driver is unloaded.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5_core: Flow steering tree initialization

Flow steering initialization is based on static tree which
illustrates the flow steering tree when the driver is loaded. The
initialization considers the max supported flow table level of the device,
a minimum of 2 kernel flow tables(vlan and mac) are required to have
kernel flow table functionality.

The tree structures when the driver is loaded:

root_namespace(receive nic)
  |
priority-0 (kernel priority)
  |
namespace(kernel namespace)
  |
priority-0 (flow tables priority)

In the following patches, When the EN driver will use the flow steering
API, it create two flow tables and their flow groups under
priority-0(flow tables priority).

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5_core: Introduce flow steering API

Introducing the following objects:

mlx5_flow_root_namespace: represent the root of specific flow table
type tree(e.g NIC receive, FDB, etc..)

mlx5_flow_group: define the mask of the flow specification.

fs_fte(flow steering flow table entry): defines the value of the
flow specification.

The following describes the relationships between the tree objects:
root_namespace --> priorities -->namespaces -->
priorities -->flow-tables --> flow-groups -->
flow-entries --> destinations

When we create new object(flow table/flow group/flow table entry), we
call to the FW command and then we add the related sw object to the tree.

When we destroy object, e.g. call to mlx5_destroy_flow_table, we use
the tree node destructor for destroying the FW object and remove the
node from the tree.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5_core: Add flow steering lookup algorithms

Introduce the flow steering mlx5_flow_namespace (Namespace)
and fs_prio (Flow Steering Priority) tree nodes.

Namespaces are used in order to isolate different usages or types
of steering (for example, downstream patches will add a different
namespaces for the NIC driver and for E-Switch FDB usages).

Flow Steering Priorities are objects that describes priorities
ranges between different flow objects under the same namespace.

Example, entries in priority i are matched before entries
in priority i+1.

This patch adds the following algorithms:

1) Calculate level:
Each flow table has level(the priority between the flow tables).
When we initialize the flow steering tree, we assign range of levels
to each priority, therefore the level for new flow table is
the location within the priority related to the range of the priority.

2) Match between match criteria. This function is used
for searching flow group when new flow rule is added.

3) Match between match values. This function is used
for searching flow table entry when new flow rule is added.

4) Add essential macros for traversing on a node's children.
E.g. traversing on all the flow table of some priority

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5_core: Add flow steering base data structures

Introducing the base data structure and its operations that are
going to represent ConnectX-4 Flow Steering, this data structure
is basically a tree and all Flow steering objects such as
(Flow Table/Flow Group/FTE/etc ..) are represented as fs_node(s).

fs_node is the base object which describes a basic tree node, with the
following extra info:
    type: describes the runtime type of the node (Object).
    lock: lock this node sub-tree.
    ref_count: number of children + current references.
    remove_func: a generic destructor.

fs_node types will be used and explained once the usage is added in the
following patches.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5_core: Introduce flow steering firmware commands

Introduce new Flow Steering (FS) firmware commands,
in-order to support the new flow steering infrastructure.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Assign random MAC address if needed

Under SRIOV there might be a case where VFs are loaded
without pre-assigned MAC address. In this case, the VF
will randomize its own MAC. This will address the case
of administrator not assigning MAC to the VF through
the PF OS APIs and keep udev happy.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5: Fix query E-Switch capabilities

E-Switch capabilities should be queried only if E-Switch flow table
is supported and not only when vport group manager.

Fixes: d6666753c6e8 ("net/mlx5: E-Switch, Introduce HCA cap and E-Switch vport context")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'thunderx-pass2'

Sunil Goutham says:

====================
net: thunderx: Support for pass-2 hw features

This patch set adds support for new features added in pass-2 revision
of hardware like TSO and count based interrupt coalescing.

Changes from v1:
- Addressed comments received regarding boolean bit field changes
  by excluding them from this patch. Will submit a seperate
  patch along with cleanup of unsed field.
- Got rid of new macro 'VNIC_NAPI_WEIGHT' introduced in
  count threshold interrupt patch.
====================

Reviewed-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: Enable CQE count threshold interrupt

This feature is introduced in pass-2 chip and with this CQ interrupt
coalescing will work based on both timer and count.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: HW TSO support for pass-2 hardware

This adds support for offloading TCP segmentation to HW in pass-2
revision of hardware. Both driver level SW TSO for pass1.x chips
and HW TSO for pass-2 chip will co-exist. Modified SQ descriptor
structures to reflect pass-2 hw implementation.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Doc: Micrel-ksz90x1.txt: Document deprecated MAC OF properties

Phy properties are expected to be found in the PHY OF node. However
this Micrel driver also allows them to be placed into the MAC OF node.
This is deprecated. Document it as such, and remove the example using
the deprecated method to prevent people copying it into new device
tree files.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mvneta-rss-xps'

Gregory CLEMENT says:

====================
mvneta: Introduce RSS support and XPS configuration

this series is the first step add RSS support on mvneta.

It will allow associating an ethernet interface to a given CPU through
RSS by using "ethtool -X ethX weight". Indeed, currently I only enable
one entry in the RSS lookup table. Even if it is not really RSS, it
allows to get back the irq affinity feature we lost by using the
percpu interrupt.

The main change compared to the second version is the setup for the XPS
instead of using specific hack inside the driver in the forth
patch.

Th first patch make the default queue associate to each port and no
more a global variable.

The second patch really associates the RX queues with the CPUs instead
of masking the percpu interrupts for doing it. All the RX queues are
enabled and are statically associated with the CPUs by using a modulo
of the number of present CPUs. But at this stage only one RX queue
will receive the stream.

The third patch introduces a first level of RSS support through the
ethtool functions. As explained in the introduction there is only one
entry in the RSS lookup table which permits at the end to associate an
mvneta port to a CPU through the RX queues because the mapping is
static.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: Configure XPS support

With this patch each CPU is associated with its own set of TX queues.

It also setup the XPS with an initial configuration which set the
affinity matching the hardware configuration.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: Add naive RSS support

This patch adds the support for the RSS related ethtool
function. Currently it only uses one entry in the indirection table which
allows associating an mvneta interface to a given CPU.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Tested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: Associate RX queues with each CPU

We enable the percpu interrupt for all the CPU and we just associate a
CPU to a few queue at the neta level. The mapping between the CPUs and
the queues is static. The queues are associated to the CPU module the
number of CPUs. However currently we only use on RX queue for a given
Ethernet port.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: Make the default queue related for each port

Instead of using the same default queue for all the port. Move it in the
port struct. It will allow have a different default queue for each port.

Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mpls_iptunnel: add static qualifier to mpls_output

This gets rid of the following compile warn:
net/mpls/mpls_iptunnel.c:40:5: warning: no previous prototype for
mpls_output [-Wmissing-prototypes]

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Handle clip return values

Add a warn message when clip table overflows. If clip table isn't
allocated, return from cxgb4_clip_release() to avoid panic.
Disable offload if clip isn't enabled in the hardware.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core: remove an unneeded condition

We already know "err" is zero so there is no need to check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: fix some error handling

The "err = " assignment is missing here.

Fixes: 0d65fc13042f ('mlxsw: spectrum: Implement LAG port join/leave')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netcp: add more __le32 annotations

The handling of epib and psdata remains a bit unclear in the driver,
as we access the same fields both as CPU-endian and through DMA
from the device.

Sparse warns about this:
ti/netcp_core.c:1147:21: warning: incorrect type in assignment (different base types)
ti/netcp_core.c:1147:21: expected unsigned int [usertype] *[assigned] epib
ti/netcp_core.c:1147:21: got restricted __le32 *<noident>

This uses __le32 types in a few places and uses __force where the code
looks fishy. The previous patch should really have produced the correct
behavior, but this second patch is needed to shut up the warnings about
it. Ideally it would be slightly rewritten to not need those casts,
but I don't dare do that without access to the hardware for proper
testing.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

netcp: try to reduce type confusion in descriptors

The netcp driver produces tons of warnings when CONFIG_LPAE is enabled
on ARM:

drivers/net/ethernet/ti/netcp_core.c: In function 'netcp_tx_map_skb':
drivers/net/ethernet/ti/netcp_core.c:1084:13: warning: passing argument 1 of 'set_words' from incompatible pointer type [-Wincompatible-pointer-types]

This is the result of trying to pass a pointer to a dma_addr_t to
a function that expects a u32 pointer to copy that into a DMA descriptor.

Looking at that code in more detail to fix the warnings, I see multiple
related problems:

* The conversion functions are not endian-safe, as the DMA descriptors
  are almost certainly fixed-endian, but the CPU is not.

* On 64-bit machines, passing a pointer through a u32 variable is a
  bug, accessing an indirect pointer as a u32 pointer even more so.

* The handling of epib and psdata mixes native-endian and device-endian
  data.

In this patch, I try to sort out the types for most accesses here,
adding le32_to_cpu/cpu_to_le32 where appropriate, and passing pointers
through two 32-bit words in the descriptor padding, to make it plausible
that the driver does the right thing if compiled for big-endian or
64-bit systems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

cgroup: fix sock_cgroup_data initialization on earlier compilers

sock_cgroup_data is a struct containing an anonymous union.
sock_cgroup_set_prioidx() and sock_cgroup_set_classid() were
initializing a field inside the anonymous union as follows.

struct sock_ccgroup_data skcd_buf = { .val = VAL };

While this is fine on more recent compilers, gcc-4.4.7 triggers the
following errors.

include/linux/cgroup-defs.h: In function ‘sock_cgroup_set_prioidx’:
include/linux/cgroup-defs.h:619: error: unknown field ‘val’ specified in initializer
include/linux/cgroup-defs.h:619: warning: missing braces around initializer
include/linux/cgroup-defs.h:619: warning: (near initialization for ‘skcd_buf.<anonymous>’)

This is because .val belongs to the anonymous union nested inside the
struct but the initializer is missing the nesting. Fix it by adding
an extra pair of braces.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alaa Hleihel <alaa@dev.mellanox.co.il>
Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: David S. Miller <davem@davemloft.net>

chelsio: constify cmac_ops structures

The cmac_ops structures are never modified, so declare them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: remove rx_pkt/rx_calls

These fields are updated but never read.
Remove the overhead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: avoid soft lockup in bnx2x_poll()

Under heavy TX load, bnx2x_poll() can loop forever and trigger
soft lockup bugs.

A napi poll handler must yield after one TX completion round,
risk of livelock is too high otherwise.

Bug is very easy to trigger using a debug build, and udp flood, because
of added cpu cycles in TX completion, and we do not receive enough
packets to break the loop.

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rhashtable: Remove unnecessary wmb for future_tbl

The patch 9497df88ab5567daa001829051c5f87161a81ff0 ("rhashtable:
Fix reader/rehash race") added a pair of barriers. In fact the
wmb is superfluous because every subsequent write to the old or
new hash table uses rcu_assign_pointer, which itself carriers a
full barrier prior to the assignment.

Therefore we may remove the explicit wmb.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cxgb4-update-kconfig-and-fixes'

Hariprasad Shenai says:

====================
Update Kconfig and some fixes for cxgb4

This series update Kconfig to add description for Chelsio's next
generation T6 family of adapters, also fixes ethtool stats alignment
and prevents simultaneous execution of service_ofldq thread, deals with
queue wrap around and adds some fl counters for debugging purpose and
device ID for new T5 adapters.

This patch series has been created against net-next tree and includes
patches on cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.

Thanks

V2: Declare 'service_ofldq_running' as bool in Patch 4/7 ("cxgb4: prevent
simultaneous execution of service_ofldq()") based on review comment
by David Miller
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Adds PCI device id for new T5 adapters

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Add FL DMA mapping error and low counter

Add Free List DMA Mapping Errors to SGE Queue info for
Free Lists. Add Free List "Low" counter to count the number of times we
see the number of pointers that we _think_ the hardware sees in the
Free List below the Egress Threshold.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Deal with wrap-around of queue for Work request

The WR headers may not fit within one descriptor.
So we need to deal with wrap-around here.

Based on original patch by Pranjal Joshi <pjoshi@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: prevent simultaneous execution of service_ofldq()

Change mutual exclusion mechanism to prevent multiple threads of
execution from running in service_ofldq() at the same time. The old
mechanism used an implicit guard on the down-call path and none on the
restart path and wasn't working. This checking makes the mechanism
explicit and is much easier to understand as a result.

Based on original work by Casey Leedom <leedom@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Use ACCES_ONCE macro to read queue's consumer index

Use helper macro ACCESS_ONCE() to load from the SGE status page
to prevent the compiler loading multiple times.

Based on original work by Mike Werner <werner@chelsio.com>

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4/cxgb4vf: update Kconfig file to include T6 adapter

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Align rest of the ethtool get stats

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns: optimize XGE capability by reducing cpu usage

here is the patch raising the performance of XGE by:
1)changes the way page management method for enet momery, and
2)reduces the count of rmb, and
3)adds Memory prefetching

Signed-off-by: Kejian Yan <yankejian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock, cgroup: add sock->sk_cgroup

In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound.  As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.

net_cls and net_prio controllers are examples of the latter.  They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.

Both net_cls and net_prio aren't properly hierarchical.  Both inherit
configuration from the parent on creation but there's no interaction
afterwards.  An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level.  net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.

While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.

In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used.  Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid.  This is to avoid adding yet another cgroup related field
to struct sock.

As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead.  It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs.  Non-critical inaccuracies from small race windows won't
make any noticeable difference.

This patch doesn't make use of the pointer yet.  The following patch
will implement netfilter match for cgroup2 membership.

v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
    cgroup specific field.

v3: Add comments explaining why sock_data_prioidx() and
    sock_data_classid() use different fallback values.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct

Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
and its accessors are defined in cgroup-defs.h.  This is to prepare
for overloading the fields with a cgroup pointer.

This patch mostly performs equivalent conversions but the followings
are noteworthy.

* Equality test before updating classid is removed from
  sock_update_classid().  This shouldn't make any noticeable
  difference and a similar test will be implemented on the helper side
  later.

* sock_update_netprioidx() now takes struct sock_cgroup_data and can
  be moved to netprio_cgroup.h without causing include dependency
  loop.  Moved.

* The dummy version of sock_update_netprioidx() converted to a static
  inline function while at it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netprio_cgroup: limit the maximum css->id to USHRT_MAX

netprio builds per-netdev contiguous priomap array which is indexed by
css->id.  The array is allocated using kzalloc() effectively limiting
the maximum ID supported to some thousand range.  This patch caps the
maximum supported css->id to USHRT_MAX which should be way above what
is actually useable.

This allows reducing sock->sk_cgrp_prioidx to u16 from u32.  The freed
up part will be used to overload the cgroup related fields.
sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
two cgroup related fields are adjacent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-4.5-ancestor-test' of git://git./linux/kernel/git/tj/cgroup

Preparatory changes for some new socket cgroup infrastructure
and netfilter targets.

Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "Merge branch 'vsock-virtio'"

This reverts commit 0d76d6e8b2507983a2cae4c09880798079007421 and merge
commit c402293bd76fbc93e52ef8c0947ab81eea3ae019, reversing changes made
to c89359a42e2a49656451569c382eed63e781153c.

The virtio-vsock device specification is not finalized yet. Michael
Tsirkin voiced concerned about merging this code when the hardware
interface (and possibly the userspace interface) could still change.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sh_eth-optimize-mdio'

Sergei Shtylyov says:

====================
sh_eth: optimize MDIO code

Here's a set of 3 patches against DaveM's 'net-next.git' repo which
gets rid of ~35 LoCs in the MDIO bitbang methods.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: get rid of bb_{set|clr|read}()

After the MDIO bitbang code consolidation, there's no need anymore for
bb_{set|clr}() as well as bb_read() -- just expand them inline, thus
saving more LoCs...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: factor out common code from MDIO bitbang methods

sh_mm[cd]_ctrl() and sh_set_mdio() all look mostly the same -- factor out
their common code and put it into sh_mdio_ctrl().

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: remove mask fields from 'struct bb_info'

The MDIO control bits are always mapped to the same bits of the same register
(PIR), so there's no need to store their masks in the 'struct bb_info'...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: constify xgene_mac_ops and xgene_port_ops structures

The xgene_mac_ops and xgene_port_ops structures are never modified, so
declare them as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'wireless-drivers-next-for-davem-2015-12-07' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Vallo says:

====================
brcfmac

* support bcm4359 which can operate in two bands concurrently
* disable runtime pm for USB avoiding issues
* use generic pm callback in PCIe driver
* support wowlan wake indication reporting
* add beamforming support
* unified handling of firmware files

ath10k

* support Manegement Frame Protection (MFP)
* add thermal throttling support for 10.4 firmware
* add support for pktlog in QCA99X0
* add debugfs file to enable Bluetooth coexistence feature
* use firmware's native mesh interface type instead of raw mode

iwlwifi

* BT coex improvements
* D3 operation bugfixes
* rate control improvements
* firmware debugging infra improvements
* ground work for multi Rx
* various security fixes
====================

Conflicts:
drivers/net/wireless/ath/ath10k/pci.c

The conflict resolution at:

http://article.gmane.org/gmane.linux.kernel.next/37391

by Stephen Rothwell was used.

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Fix inverted test in __skb_recv_datagram

As the kernel generally uses negated error numbers, *err needs to be
compared with -EAGAIN (d'oh).

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Fixes: ea3793ee29d3 ("core: enable more fine-grained datagram reception control")
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb3: Convert simple_strtoul to kstrtox

the simple_strtoul function is obsolete. This patch replace it by
kstrtox.

Signed-off-by: LABBE Corentin <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'more-dsa-unbinding-fixes'

Neil Armstrong says:

====================
Further fix for dsa unbinding

This series fixes further issues for DSA dynamic unbinding.
The first patch completely removes the PHY link state polling.
The two following cleans up the dsa state upon removal.
The last patch moves slave destroy code as slave function and
adds missing netdev and phy cleanup calls.

v1: http://lkml.kernel.org/r/562F8ECB.6050709@baylibre.com
v2: http://lkml.kernel.org/r/56321D9A.8010109@baylibre.com
remove phy fix and add missing calls in dsa_switch_destroy
then add dedicated dsa_slave_destroy

v3: remove polling instead of fixing it, make single patch for
dsa slave destroy
====================

Acked-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: move dsa slave destroy code to slave.c

Move dsa slave dedicated code from dsa_switch_destroy to a new
dsa_slave_destroy function in slave.c.
Add the netif_carrier_off and phy_disconnect calls in order to
correctly cleanup the netdev state and PHY state machine.

Signed-off-by: Frode Isaksen <fisaksen@baylibre.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Add missing master netdev dev_put() calls

Upon probe failure or unbinding, add missing dev_put() calls.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: cleanup resources upon module removal

Make sure that we unassign the master_netdev dsa_ptr to make the packet
processing go through the regular Ethernet receive path.

Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: remove DSA link polling

Since no more DSA driver uses the polling callback, and since
the phylib handles the link detection, remove the link polling
work and timer code.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-next-for-davem-2015-12-07' of git://git./linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This pull request got a bit bigger than I wanted, due to
needing to reshuffle and fix some bugs. I merged mac80211
to get the right base for some of these changes.

* new mac80211 API for upcoming driver changes: EOSP handling,
key iteration
* scan abort changes allowing to cancel an ongoing scan
* VHT IBSS 80+80 MHz support
* re-enable full AP client state tracking after fixes
* various small fixes (that weren't relevant for mac80211)
* various cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'thunderx-cleanups'

Sunil Goutham says:

====================
net: thunderx: Miscellaneous cleanups

This patch series contains contains couple of cleanup patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net, thunderx: Remove unnecessary rcv buffer start address management

Since we have moved on to using allocated pages to carve receive
buffers instead of netdev_alloc_skb() there is no need to store
any pointers for later retrieval. Earlier we had to store
skb and skb->data pointers which later are used to handover
received packet to network stack.

This will avoid an unnecessary cache miss as well.

Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: nicvf_queues: nivc_*_intr: remove duplication

The same switch-case repeates for nivc_*_intr functions.
In this patch it is moved to a helper nicvf_int_type_to_mask().

By the way:
- Unneeded write to NICVF register dropped if int_type is unknown.
- netdev_dbg() is used instead of netdev_err().

Signed-off-by: Yury Norov <yury.norov@auriga.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com>
Acked-by: Vadim Lomovtsev <Vadim.Lomovtsev@caiumnetworks.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mac80211: handle HW ROC expired properly

In case of HW ROC, when the driver reports that the ROC expired,
it is not sufficient to purge the ROCs based on the remaining
time, as it possible that the device finished the ROC session
before the actual requested duration.

To handle such cases, in case of ROC expired notification from
the driver, complete all the ROCs which are marked with hw_begun,
regardless of the remaining duration.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

af_unix: fix unix_dgram_recvmsg entry locking

The current unix_dgram_recvsmg code acquires the u->readlock mutex in
order to protect access to the peek offset prior to calling
__skb_recv_datagram for actually receiving data. This implies that a
blocking reader will go to sleep with this mutex held if there's
presently no data to return to userspace. Two non-desirable side effects
of this are that a later non-blocking read call on the same socket will
block on the ->readlock mutex until the earlier blocking call releases it
(or the readers is interrupted) and that later blocking read calls
will wait longer than the effective socket read timeout says they
should: The timeout will only start 'ticking' once such a reader hits
the schedule_timeout in wait_for_more_packets (core.c) while the time it
already had to wait until it could acquire the mutex is unaccounted for.

The patch avoids both by using the __skb_try_recv_datagram and
__skb_wait_for_more packets functions created by the first patch to
implement a unix_dgram_recvmsg read loop which releases the readlock
mutex prior to going to sleep and reacquires it as needed
afterwards. Non-blocking readers will thus immediately return with
-EAGAIN if there's no data available regardless of any concurrent
blocking readers and all blocking readers will end up sleeping via
schedule_timeout, thus honouring the configured socket receive timeout.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

core: enable more fine-grained datagram reception control

The __skb_recv_datagram routine in core/ datagram.c provides a general
skb reception factility supposed to be utilized by protocol modules
providing datagram sockets. It encompasses both the actual recvmsg code
and a surrounding 'sleep until data is available' loop. This is
inconvenient if a protocol module has to use additional locking in order
to maintain some per-socket state the generic datagram socket code is
unaware of (as the af_unix code does). The patch below moves the recvmsg
proper code into a new __skb_try_recv_datagram routine which doesn't
sleep and renames wait_for_more_packets to
__skb_wait_for_more_packets, both routines being exported interfaces. The
original __skb_recv_datagram routine is reimplemented on top of these
two functions such that its user-visible behaviour remains unchanged.

Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

PHY: DP83867: Remove looking in parent device for OF properties

Device tree properties for a phy device are expected to be in the phy
node. The current code for the DP83867 also tries to look in the
parent node. The devices binding documentation does not mention this,
no current device tree file makes use of this, and it is not behaviour
we want. So remove looking in the parent device.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: cdc_ncm: add "ndp_to_end" sysfs attribute

Adding a writable sysfs attribute for the "NDP to end"
quirk flag.

This makes it easier for end users to test new devices for
this firmware bug. We've been lucky so far, but we should
not depend on reporters capable of rebuilding the driver.

Cc: Enrico Mioso <mrkiko.rs@gmail.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx4-HA-LAG-SRIOV-VF'

Or Gerlitz says:

====================
Add HA and LAG support for mlx4 SRIOV VFs

This series is built upon the code added in commit ce388ff "Merge branch
'mlx4-next'" which added HA and LAG support to mlx4 RoCE and SRIOV services.

We add HA and Link Aggregation support to single ported mlx4 Ethernet VFs.

In this case, the PF Ethernet interfaces are bonded, the VFs see single
port HW devices (already supported) -- however, now this port is highly
available. This means that all VF HW QPs (both VF Ethernet driver and VF
RoCE / RAW QPs) are subject to the V2P (Virtual-To-Physical) mapping which
is managed by the PF driver, and hence resilient across link failures and
such events.

When bonding operates in Dynamic link aggregation (802.3ad) mode, traffic
from each VF will go over the VF "base port" (the port this VF is assigned
to by the admin) as long as this port is up. When the port fails, traffic
from all VFs that are defined on this port will move to the other port, and
be back to their base-port when it recovers.

Moni and Or.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Support the HA mode for SRIOV VFs too

When the mlx4 driver runs in HA mode, and all VFs are single ported
ones, we make their single port Highly-Available.

This is done by taking advantage of the HA mode properties (following
bonding changes with programming the port V2P map, etc) and adding
the missing parts which are unique to SRIOV such as mirroring VF
steering rules on both ports.

Due to limits on the MAC and VLAN table this mode is enabled only when
number of total VFs is under 64.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

IB/mlx4: Use the VF base-port when demuxing mad from wire

Under HA mode, it's possible that the VF registered its GID
(and expects to get mads through the PV scheme) on a port which is
different from the one this mad arrived on, due to HA fail over.

Therefore, if the gid is not matched on the port that the packet arrived
on, check for a match on the other port if HA mode is active -- and if a
match is found on the other port, continue processing the mad using that
other port.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Keep VLAN/MAC tables mirrored in multifunc HA mode

Due to HW limitations, indexes to MAC and VLAN tables are always taken
from the table of the actual port. So, if a resource holds an index to
a table, it may refer to different values during the lifetime of the
resource, unless the tables are mirrored. Also, even when
driver is not in HA mode the policy of allocating an index to these
tables is such to make sure, as much as possible, that when the time
comes the mirroring will be successful. This means that in multifunction
mode the allocation of a free index in a port's table tries to make sure
that the same index in the other's port table is also free.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Support mirroring VF DMFS rules on both ports

Under HA mode, steering rules set by VFs should be mirrored on both
ports of the device so packets will be accepted no matter on which
port they arrived.

Since getting into HA mode is done dynamically when the user bonds mlx4
Ethernet netdevs, we keep hold of the VF DMFS rule mbox with the port
value flipped (1->2,2->1) and execute the mirroring when getting into
HA mode. Later, when going out of HA mode, we unset the mirrored rules.
In that context note that mirrored rules cannot be removed explicitly.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Use both physical ports to dispatch link state events to VF

Under HA mode, the link down event should be sent to VFs only if both
ports are down.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Use both physical ports to set the VF link state

In HA mode, the link state for VFs for which the policy is "auto"
(i.e. follow the physical link state) should be ORed from both ports.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>

VSOCK: fix returnvar.cocci warnings

Remove unneeded variable used to store return value.

Generated by: scripts/coccinelle/misc/returnvar.cocci

CC: Asias He <asias@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qmi_wwan: should hold RTNL while changing netdev type

The notifier calls were thrown in as a last-minute fix for an
imagined "this device could be part of a bridge" problem. That
revealed a certain lack of locking. Not to mention testing...

Avoid this splat:

RTNL: assertion failed at net/core/dev.c (1639)
CPU: 0 PID: 4293 Comm: bash Not tainted 4.4.0-rc3+ #358
Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
0000000000000000 ffff8800ad253d60 ffffffff8122f7cf ffff8800ad253d98
ffff8800ad253d88 ffffffff813833ab 0000000000000002 ffff880230f48560
ffff880230a12900 ffff8800ad253da0 ffffffff813833da 0000000000000002
Call Trace:
[<ffffffff8122f7cf>] dump_stack+0x4b/0x63
[<ffffffff813833ab>] call_netdevice_notifiers_info+0x3d/0x59
[<ffffffff813833da>] call_netdevice_notifiers+0x13/0x15
[<ffffffffa09be227>] raw_ip_store+0x81/0x193 [qmi_wwan]
[<ffffffff8131e149>] dev_attr_store+0x20/0x22
[<ffffffff811d858b>] sysfs_kf_write+0x49/0x50
[<ffffffff811d8027>] kernfs_fop_write+0x10a/0x151
[<ffffffff8117249a>] __vfs_write+0x26/0xa5
[<ffffffff81085ed4>] ? percpu_down_read+0x53/0x7f
[<ffffffff81174c9e>] ? __sb_start_write+0x5f/0xb0
[<ffffffff81174c9e>] ? __sb_start_write+0x5f/0xb0
[<ffffffff81172c37>] vfs_write+0xa3/0xe7
[<ffffffff811734ad>] SyS_write+0x50/0x7e
[<ffffffff8145c517>] entry_SYSCALL_64_fastpath+0x12/0x6f

Fixes: 32f7adf633b9 ("net: qmi_wwan: support "raw IP" mode")
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "i40e: remove CONFIG_I40E_VXLAN"

This reverts commit 8fe269991aece394a7ed274f525d96c73f94109a.
The case where VXLAN is a module and i40e driver is inbuilt
will not be handled properly with this change since i40e
will have an undefined symbol vxlan_get_rx_port in it.

v2: Add a signed-off-by.

Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2015-12-05

This series contains updates to fm10k only.

Jacob provides the remaining fm10k patches in the series.  First change
ensures that all the logic regarding the setting of netdev features is
consolidated in one place of the driver.  Fixed an issue where an assumption
was being made on how many queues are available, especially when init_hw_vf()
errors out.  Fixed up an number of issues with init_hw() where failures
were not being handled properly or at all, so update the driver to check
returned error codes and respond appropriately.  Fixed up typecasting
issues found where either the incorrect typecast size was used or
explicitly typecast values.  Added additional debugging statistics and
rename statistic to better reflect its true value.  Added support for
ITR scaling based on PCIe link speed for fm10k.  Fixed up code comment
where "hardware" was misspelled.

v2: Dropped patches #1 and #10 from original submission, patch #1 was from
    Nick Krause and due to his past kernel interactions, dropping his patch.
    Patch #10 had questions and concerns from Tom Herbert which cannot be
    addressed at this time since the author (Jacob Keller) is currently on
    sabbatical, so dropping this patch for now until we can properly address
    Tom's questions and concerns.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

fm10k: TRIVIAL cleanup order at top of fm10k_xmit_frame

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: TRIVIAL fix typo of hardware

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: change default Tx ITR to 25usec

The current default ITR for Tx is overly restrictive. Using a simple
netperf TCP_STREAM test, we top out at about 10Gb/s for a single thread
when running using 1500 byte frames. By reducing the ITR value to 25usec
(up to 40K interrupts a second from 10K), we are able to achieve 36Gb/s
for a single thread TCP stream test.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: use macro for default Tx and Rx ITR values

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: Update adaptive ITR algorithm

The existing adaptive ITR algorithm is overly restrictive. It throttles
incorrectly for various traffic rates, and does not produce good
performance. The algorithm now allows for more interrupts per second,
and does some calculation to help improve for smaller packet loads. In
addition, take into account the new itr_scale from the hardware which
indicates how much to scale due to PCIe link speed.

Reported-by: Matthew Vick <matthew.vick@intel.com>
Reported-by: Alex Duyck <alexander.duyck@gmail.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: introduce ITR_IS_ADAPTIVE macro

Define a macro for identifying when the itr value is dynamic or
adaptive. The concept was taken from i40e. This helps make clear what
the check is, and reduces the line length to something more reasonable
in a few places.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: Add support for ITR scaling based on PCIe link speed

The Intel Ethernet Switch FM10000 Host Interface interrupt throttle
timers are based on the PCIe link speed. Because of this, the value
being programmed into the ITR registers must be scaled accordingly.

For the PF, this is as simple as reading the PCIe link speed and storing
the result. However, in the case of SR-IOV, the VF's interrupt throttle
timers are based on the link speed of the PF. However, the VF is unable
to get the link speed information from its configuration space, so the
PF must inform it of what scale to use.

Rather than pass this scale via mailbox message, take advantage of
unused bits in the TDLEN register to pass the scale. It is the
responsibility of the PF to program this for the VF while setting up the
VF queues and the responsibility of the VF to get the information
accordingly. This is preferable because it allows the VF to set up the
interrupts properly during initialization and matches how the MAC
address is passed in the TDBAL/TDBAH registers.

Since we're modifying fm10k_type.h, we may as well also update the
copyright year.

Reported-by: Matthew Vick <matthew.vick@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: rename mbx_tx_oversized statistic to mbx_tx_dropped

Originally this statistic was renamed because the method of dropping was
called "drop_oversized_messages", but this logic has changed much, and
this counter does actually represent messages which we failed to
transmit for a number of reasons. Rename the counter back to tx_dropped
since this is when it will increment, and it is less confusing.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: add statistics for actual DWORD count of mbmem mailbox

A previous bug was uncovered by addition of a debug stat to indicate the
actual number of DWORDS we pulled from the mbmem. It turned out this was
not the same as the tx_dwords counter. While the previous bug fix should
have corrected this in all cases, add some debug stats that count the
number of DWORDs pushed or pulled from the mbmem. A future debugger may
take advantage of this statistic for debugging purposes. Since we're
modifying fm10k_mbx.h, update the copyright year as well.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: explicitly typecast vlan values to u16

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: Correct typecast in fm10k_update_xc_addr_pf

Since the resultant data type of the mac_update.mac_upper field is u16,
it does not make sense to typecast u8 variables to u32 first. Since
we're modifying fm10k_pf.c, also update the copyright year.

Reported-by: Matthew Vick <matthew.vick@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: reinitialize queuing scheme after calling init_hw

The init_hw function may fail, and in the case of VFs, it might change
the number of maximum queues available. Thus, for every flow which
checks init_hw, we need to ensure that we clear the queue scheme before,
and initialize it after. The fm10k_io_slot_reset path will end up
triggering a reset so fm10k_reinit needs this change. The
fm10k_io_error_detected and fm10k_io_resume also need to properly clear
and reinitialize the queue scheme.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: always check init_hw for errors

A recent change modified init_hw in some flows the function may fail on
VF devices. For example, if a VF doesn't yet own its own queues.
However, many callers of init_hw didn't bother to check the error code.
Other callers checked but only displayed diagnostic messages without
actually handling the consequences.

Fix this by (a) always returning and preventing the netdevice from going
up, and (b) printing the diagnostic in every flow for consistency. This
should resolve an issue where VF drivers would attempt to come up
before the PF has finished assigning queues.

In addition, change the dmesg output to explicitly show the actual
function that failed, instead of combining reset_hw and init_hw into a
single check, to help for future debugging.

Fixes: 1d568b0f6424 ("fm10k: do not assume VF always has 1 queue")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: reset max_queues on init_hw_vf failure

VF drivers must detect how many queues are available. Previously, the
driver assumed that each VF has at minimum 1 queue. This assumption is
incorrect, since it is possible that the PF has not yet assigned the
queues to the VF by the time the VF checks. To resolve this, we added a
check first to ensure that the first queue is infact owned by the VF at
init_hw_vf time. However, the code flow did not reset hw->mac.max_queues
to 0. In some cases, such as during reinit flows, we call init_hw_vf
without clearing the previous value of hw->mac.max_queues. Due to this,
when init_hw_vf errors out, if its error code is not properly handled
the VF driver may still believe it has queues which no longer belong to
it. Fix this by clearing the hw->mac.max_queues on exit due to errors.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <Krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: set netdev features in one location

Don't change netdev hw_features later in fm10k_probe, instead set all
values inside fm10k_alloc_netdev. To do so, we need to know the MAC type
(whether it is PF or VF) in order to determine what to do. This helps
ensure that all logic regarding features is co-located.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'renesas-read-mac'

Sergei Shtylyov says:

====================
Renesas: read MAC address registers only once

Here's 2 patches against DaveM's 'net-next.git' repo. Here we optimize
the MAC address register reading (left over from a bootloader).

[1/2] ravb: read MAC address registers only once
[2/2] sh_eth: read MAC address registers only once
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: read MAC address registers only once

The code reading the MAHR/MALR registers in read_mac_address() is terribly
ineffective -- it reads MAHR 4 times and MALR 2 times, while it's enough to
read each register only once. Use the local variables to achieve that,
somewhat beautifying the code while at it...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: read MAC address registers only once

The code reading the MAHR/MALR registers in ravb_read_mac_address() is
terribly ineffective -- it reads MAHR 4 times and MALR 2 times, while
it's enough to read each register only once. Use the local variables to
achieve that, somewhat beautifying the code while at it...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnx2x'

Michal Schmidt says:

====================
bnx2x: fewer error messages, simplification

This removes one redundant error message in bnx2x and changes another one to
WARN_ONCE. The third patch is a small simplification in ethtool stats.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: simplify distinction between port and func stats

The 'flags' field in bnx2x_stats_arr[] serves only one purpose - to tell
us if the statistic is a per-port stat and thus should not be shown for
virtual functions. It's strange that the field can have three different
values. A boolean will do just fine.

Also remove IS_FUNC_STAT(). It was used only once and it's in fact just
a negation of IS_PORT_STAT().

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: change FW GRO error message to WARN_ONCE

It's supposed to be impossible for TPA to give us anything else
than IPv4 or IPv6 here. But in case there is a way to reach this error
by some strange received frames, we don't want to flood the kernel log.
WARN_ONCE is better for this purpose.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: drop redundant error message about allocation failure

alloc_pages() already prints a warning when it fails. No need to emit
another message. Certainly not at KERN_ERR level, because it is no big
deal if this GFP_ATOMIC allocation fails occasionally.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: constify netif_is_* helpers net_device param

As suggested by Eric, these helpers should have const dev param.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>