openwrt/staging/blogic.git
12 years agoethtool: Define and apply a default policy for RX flow hash indirection
Ben Hutchings [Thu, 15 Dec 2011 13:56:49 +0000 (13:56 +0000)]
ethtool: Define and apply a default policy for RX flow hash indirection

All drivers that support modification of the RX flow hash indirection
table initialise it in the same way: RX rings are assigned to table
entries in rotation.  Make that default policy explicit by having them
call a ethtool_rxfh_indir_default() function.

In the ethtool core, add support for a zero size value for
ETHTOOL_SRXFHINDIR, which resets the table to this default.

Partly-suggested-by: Matt Carlson <mcarlson@broadcom.com>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoethtool: Centralise validation of ETHTOOL_{G, S}RXFHINDIR parameters
Ben Hutchings [Thu, 15 Dec 2011 13:55:01 +0000 (13:55 +0000)]
ethtool: Centralise validation of ETHTOOL_{G, S}RXFHINDIR parameters

Add a new ethtool operation (get_rxfh_indir_size) to get the
indirectional table size.  Use this to validate the user buffer size
before calling get_rxfh_indir or set_rxfh_indir.  Use get_rxnfc to get
the number of RX rings, and validate the contents of the new
indirection table before calling set_rxfh_indir.  Remove this
validation from drivers.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoethtool: Clarify use of size field for ETHTOOL_GRXFHINDIR
Ben Hutchings [Thu, 15 Dec 2011 13:51:16 +0000 (13:51 +0000)]
ethtool: Clarify use of size field for ETHTOOL_GRXFHINDIR

In order to find out the device's RX flow hash table size, ethtool
initially uses ETHTOOL_GRXFHINDIR with a buffer size of zero.  This
must be supported, but it is not necessary to support any other user
buffer size less than the device table size.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Write it into kbuild
Pavel Emelyanov [Thu, 15 Dec 2011 02:46:50 +0000 (02:46 +0000)]
unix_diag: Write it into kbuild

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Receive queue lenght NLA
Pavel Emelyanov [Thu, 15 Dec 2011 02:46:31 +0000 (02:46 +0000)]
unix_diag: Receive queue lenght NLA

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Pending connections IDs NLA
Pavel Emelyanov [Thu, 15 Dec 2011 02:46:14 +0000 (02:46 +0000)]
unix_diag: Pending connections IDs NLA

When establishing a unix connection on stream sockets the
server end receives an skb with socket in its receive queue.

Report who is waiting for these ends to be accepted for
listening sockets via NLA.

There's a lokcing issue with this -- the unix sk state lock is
required to access the peer, and it is taken under the listening
sk's queue lock. Strictly speaking the queue lock should be taken
inside the state lock, but since in this case these two sockets
are different it shouldn't lead to deadlock.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Unix peer inode NLA
Pavel Emelyanov [Thu, 15 Dec 2011 02:45:58 +0000 (02:45 +0000)]
unix_diag: Unix peer inode NLA

Report the peer socket inode ID as NLA. With this it's finally
possible to find out the other end of an interesting unix connection.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Unix inode info NLA
Pavel Emelyanov [Thu, 15 Dec 2011 02:45:43 +0000 (02:45 +0000)]
unix_diag: Unix inode info NLA

Actually, the socket path if it's not anonymous doesn't give
a clue to which file the socket is bound to. Even if the path
is absolute, it can be unlinked and then new socket can be
bound to it.

With this NLA it's possible to check which file a particular
socket is really bound to.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Unix socket name NLA
Pavel Emelyanov [Thu, 15 Dec 2011 02:45:24 +0000 (02:45 +0000)]
unix_diag: Unix socket name NLA

Report the sun_path when requested as NLA. With leading '\0' if
present but without the leading AF_UNIX bits.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Dumping exact socket core
Pavel Emelyanov [Thu, 15 Dec 2011 02:45:07 +0000 (02:45 +0000)]
unix_diag: Dumping exact socket core

The socket inode is used as a key for lookup. This is effectively
the only really unique ID of a unix socket, but using this for
search currently has one problem -- it is O(number of sockets) :(

Does it worth fixing this lookup or inventing some other ID for
unix sockets?

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Dumping all sockets core
Pavel Emelyanov [Thu, 15 Dec 2011 02:44:52 +0000 (02:44 +0000)]
unix_diag: Dumping all sockets core

Walk the unix sockets table and fill the core response structure,
which includes type, state and inode.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agounix_diag: Basic module skeleton
Pavel Emelyanov [Thu, 15 Dec 2011 02:44:35 +0000 (02:44 +0000)]
unix_diag: Basic module skeleton

Includes basic module_init/_exit functionality, dump/get_exact stubs
and declares the basic API structures for request and response.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoaf_unix: Export stuff required for diag module
Pavel Emelyanov [Thu, 15 Dec 2011 02:44:03 +0000 (02:44 +0000)]
af_unix: Export stuff required for diag module

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agosock_diag: Generalize requests cookies managements
Pavel Emelyanov [Thu, 15 Dec 2011 02:43:44 +0000 (02:43 +0000)]
sock_diag: Generalize requests cookies managements

The sk address is used as a cookie between dump/get_exact calls.
It will be required for unix socket sdumping, so move it from
inet_diag to sock_diag.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agosock_diag: Fix module netlink aliases
Pavel Emelyanov [Thu, 15 Dec 2011 02:43:27 +0000 (02:43 +0000)]
sock_diag: Fix module netlink aliases

I've made a mistake when fixing the sock_/inet_diag aliases :(

1. The sock_diag layer should request the family-based alias,
   not just the IPPROTO_IP one;
2. The inet_diag layer should request for AF_INET+protocol alias,
   not just the protocol one.

Thus fix this.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agosock_diag: Move the SOCK_DIAG_BY_FAMILY cmd declaration
Pavel Emelyanov [Thu, 15 Dec 2011 02:42:42 +0000 (02:42 +0000)]
sock_diag: Move the SOCK_DIAG_BY_FAMILY cmd declaration

It should belong to sock_diag, not inet_diag.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Fri, 16 Dec 2011 07:11:14 +0000 (02:11 -0500)]
Merge git://git./linux/kernel/git/davem/net

Conflicts:
drivers/net/ethernet/freescale/fsl_pq_mdio.c
net/batman-adv/translation-table.c
net/ipv6/route.c

12 years agotg3: Break out RSS indir table init and assignment
Matt Carlson [Wed, 14 Dec 2011 11:10:01 +0000 (11:10 +0000)]
tg3: Break out RSS indir table init and assignment

This patch creates a new device member to hold the RSS indirection table
and separates out the code that initializes the table from the code that
programs the table into device registers.

Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotg3: Use mii_advertise_flowctrl
Matt Carlson [Wed, 14 Dec 2011 11:10:00 +0000 (11:10 +0000)]
tg3: Use mii_advertise_flowctrl

This patch replaces tg3's internal tg3_advert_flowctrl_1000T function
with mii_advertise_flowctrl provided by the kernel headers.

Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotg3: Add 57766 ASIC rev support
Matt Carlson [Wed, 14 Dec 2011 11:09:59 +0000 (11:09 +0000)]
tg3: Add 57766 ASIC rev support

This patch adds support for the 57766 ASIC revision.

Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotg3: Make the TX BD DMA limit configurable
Matt Carlson [Wed, 14 Dec 2011 11:09:58 +0000 (11:09 +0000)]
tg3: Make the TX BD DMA limit configurable

The 57766 ASIC rev will impose a new TX BD DMA limit on the driver.
This patch prepares for 57766 support by making the tx BD DMA limit
tunable.

Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotg3: Enable EEE support for capable 10/100 devs
Matt Carlson [Wed, 14 Dec 2011 11:09:57 +0000 (11:09 +0000)]
tg3: Enable EEE support for capable 10/100 devs

There are some devices in the 57765 ASIC rev that are EEE capable.
Unfortunately the EEE setup code only gets executed if the device is
gigabit capable.  This patch fixes the problem.

Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp_memcontrol: fix reversed if condition
Dan Carpenter [Thu, 15 Dec 2011 01:05:10 +0000 (01:05 +0000)]
tcp_memcontrol: fix reversed if condition

We should only dereference the pointer if it's valid, not the other way
round.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoMove limit definitions outside CONFIG_INET
Glauber Costa [Wed, 14 Dec 2011 23:34:31 +0000 (23:34 +0000)]
Move limit definitions outside CONFIG_INET

They need to be available for other protocols as well, since
they are used in sock.c openly

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet: remove rcu protection on tw_net
Eric Dumazet [Wed, 14 Dec 2011 04:58:33 +0000 (04:58 +0000)]
inet: remove rcu protection on tw_net

commit b099ce2602d806 (net: Batch inet_twsk_purge) added rcu protection
on tw_net for no obvious reason.

struct net are refcounted anyway since timewait sockets escape from rcu
protected sections. tw_net stay valid for the whole timwait lifetime.

This also removes a lot of sparse errors.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: ping: remove some sparse errors
Eric Dumazet [Wed, 14 Dec 2011 03:51:28 +0000 (03:51 +0000)]
net: ping: remove some sparse errors

net/ipv4/sysctl_net_ipv4.c:78:6: warning: symbol 'inet_get_ping_group_range_table'
was not declared. Should it be static?

net/ipv4/sysctl_net_ipv4.c:119:31: warning: incorrect type in argument 2
(different signedness)
net/ipv4/sysctl_net_ipv4.c:119:31: expected int *range
net/ipv4/sysctl_net_ipv4.c:119:31: got unsigned int *<noident>

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agocls_flow: remove one dynamic array
Eric Dumazet [Wed, 14 Dec 2011 02:30:00 +0000 (02:30 +0000)]
cls_flow: remove one dynamic array

Its better to use a predefined size for this small automatic variable.

Removes a sparse error as well :

net/sched/cls_flow.c:288:13: error: bad constant expression

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobnx2x: handle vpd data longer than 128 bytes
Barak Witkowski [Wed, 14 Dec 2011 00:14:53 +0000 (00:14 +0000)]
bnx2x: handle vpd data longer than 128 bytes

Signed-off-by: Barak Witkowski <barak@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agovlan: static functions
Eric Dumazet [Tue, 13 Dec 2011 20:57:54 +0000 (20:57 +0000)]
vlan: static functions

commit 6d4cdf47d2 (vlan: add 802.1q netpoll support) forgot to declare
as static some private functions.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agovlan: add rtnl_dereference() annotations
Dan Carpenter [Tue, 13 Dec 2011 20:29:43 +0000 (20:29 +0000)]
vlan: add rtnl_dereference() annotations

The original code generates a Sparse warning:
net/8021q/vlan_core.c:336:9:
error: incompatible types in comparison expression (different address spaces)

It's ok to dereference __rcu pointers here because we are holding the
RTNL lock.  I've added some calls to rtnl_dereference() to silence the
warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agortnetlink: rtnl_link_register() sanity test
Eric Dumazet [Tue, 13 Dec 2011 11:38:00 +0000 (11:38 +0000)]
rtnetlink: rtnl_link_register() sanity test

Before adding a struct rtnl_link_ops into link_ops list, check it doesnt
clash with a prior one.

Based on a previous patch from Alexander Smirnov

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoipv6: Check dest prefix length on original route not copied one in rt6_alloc_cow().
David S. Miller [Tue, 13 Dec 2011 22:35:06 +0000 (17:35 -0500)]
ipv6: Check dest prefix length on original route not copied one in rt6_alloc_cow().

After commit 8e2ec639173f325977818c45011ee176ef2b11f6 ("ipv6: don't
use inetpeer to store metrics for routes.") the test in rt6_alloc_cow()
for setting the ANYCAST flag is now wrong.

'rt' will always now have a plen of 128, because it is set explicitly
to 128 by ip6_rt_copy.

So to restore the semantics of the test, check the destination prefix
length of 'ort'.

Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoipv6: If neigh lookup fails during icmp6 dst allocation, propagate error.
David S. Miller [Tue, 13 Dec 2011 21:48:21 +0000 (16:48 -0500)]
ipv6: If neigh lookup fails during icmp6 dst allocation, propagate error.

Don't just succeed with a route that has a NULL neighbour attached.
This follows the behavior of addrconf_dst_alloc().

Allowing this kind of route to end up with a NULL neigh attached will
result in packet drops on output until the route is somehow
invalidated, since nothing will meanwhile try to lookup the neigh
again.

A statistic is bumped for the case where we see a neigh-less route on
output, but the resulting packet drop is otherwise silent in nature,
and frankly it's a hard error for this to happen and ipv6 should do
what ipv4 does which is say something in the kernel logs.

Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: Remove unused neighbour layer ops.
David S. Miller [Tue, 13 Dec 2011 21:44:22 +0000 (16:44 -0500)]
net: Remove unused neighbour layer ops.

It's simpler to just keep these things out until there is a real user
of them, so we can see what the needs actually are, rather than keep
these things around as useless overhead.

Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_en: updated driver version to 2.0
Yevgeny Petrilin [Tue, 13 Dec 2011 04:19:34 +0000 (04:19 +0000)]
mlx4_en: updated driver version to 2.0

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: updated driver version to 1.1
Yevgeny Petrilin [Tue, 13 Dec 2011 04:18:45 +0000 (04:18 +0000)]
mlx4_core: updated driver version to 1.1

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: Modify driver initialization flow to accommodate SRIOV for Ethernet
Jack Morgenstein [Tue, 13 Dec 2011 04:18:30 +0000 (04:18 +0000)]
mlx4_core: Modify driver initialization flow to accommodate SRIOV for Ethernet

1. Added module parameters sr_iov and probe_vf for controlling enablement of
   SRIOV mode.
2. Increased default max num-qps, num-mpts and log_num_macs to accomodate
   SRIOV mode
3. Added port_type_array as a module parameter to allow driver startup with
   ports configured as desired.
   In SRIOV mode, only ETH is supported, and this array is ignored; otherwise,
   for the case where the FW supports both port types (ETH and IB), the
   port_type_array parameter is used.
   By default, the port_type_array is set to configure both ports as IB.
4. When running in sriov mode, the master needs to initialize the ICM eq table
   to hold the eq's for itself and also for all the slaves.
5. mlx4_set_port_mask() now invoked from mlx4_init_hca, instead of in mlx4_dev_cap.
6. Introduced sriov VF (slave) device startup/teardown logic (mainly procedures
   mlx4_init_slave, mlx4_slave_exit, mlx4_slave_cap, mlx4_slave_exit and flow
   modifications in __mlx4_init_one, mlx4_init_hca, and mlx4_setup_hca).
   VFs obtain their startup information from the PF (master) device via the
   comm channel.
7. In SRIOV mode (both PF and VF), MSI_X must be enabled, or the driver
   aborts loading the device.
8. Do not allow setting port type via sysfs when running in SRIOV mode.
9. mlx4_get_ownership:  Currently, only one PF is supported by the driver.
   If the HCA is burned with FW which enables more than one PF, only one
   of the PFs is allowed to run.  The first one up grabs a FW ownership
   semaphone -- all other PFs will find that semaphore taken, and the
   driver will not allow them to run.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Liran Liss <liranl@mellanox.co.il>
Signed-off-by: Marcel Apfelbaum <marcela@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: adjust catas operation for SRIOV mode
Jack Morgenstein [Tue, 13 Dec 2011 04:17:16 +0000 (04:17 +0000)]
mlx4_core: adjust catas operation for SRIOV mode

When running in SRIOV mode, driver should not automatically start/stop
the mlx4_core upon sensing an HCA internal error -- doing this disables/enables
sriov, which will cause the hypervisor to hang if there are running VMs with
attached VFs.

In addition, on VMs the catas process should not run at all, since the HCA
error buffer is not available to VMs in the BARs.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: mtts resources units changed to offset
Marcel Apfelbaum [Tue, 13 Dec 2011 04:16:56 +0000 (04:16 +0000)]
mlx4_core: mtts resources units changed to offset

In the previous implementation mtts are managed by:
1. order     - log(mtt segments), 'mtt segment' groups several mtts together.
2. first_seg - segment location relative to mtt table.
In the current implementation:
1. order     - log(mtts) rather than segments
2. offset    - mtt index in mtt table

Note: The actual mtt allocation is made in segments but it is
      transparent to callers.

Rational: The mtt resource holders are not interested on how the allocation
          of mtt is done, but rather on how they will use it.

Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_en: Allow communication between functions on same host
Eugenia Emantayev [Tue, 13 Dec 2011 04:16:38 +0000 (04:16 +0000)]
mlx4_en: Allow communication between functions on same host

To enable internal loopback, always fill DMAC in control segment
when transmitting the packet, once this is done, the packet is subject
for loopback for if the DMAC mathces one of the multicast/unicast addresses
registered on the physical port.
In receive path if source MAC is our own MAC and we are not in selftest,
or not in force LB mode - drop this packet.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4: Ethernet port management modifications
Eugenia Emantayev [Tue, 13 Dec 2011 04:16:21 +0000 (04:16 +0000)]
mlx4: Ethernet port management modifications

The physical port is now common to the PF and VFs.
The port resources and configuration is managed by the PF, VFs can
only influence the MTU of the port, it is set as max among all functions,
Each function allocates RX buffers of required size to meet it's MTU enforcement.
Port management code was moved to mlx4_core, as the mlx4_en module is
virtualization unaware

Move handling qp functionality to mlx4_get_eth_qp/mlx4_put_eth_qp
including reserve/release range and add/release unicast steering.
Let mlx4_register/unregister_mac deal only with MAC (un)registration.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4: Traffic steering management support for SRIOV
Eugenia Emantayev [Tue, 13 Dec 2011 04:16:02 +0000 (04:16 +0000)]
mlx4: Traffic steering management support for SRIOV

Let multicast/unicast attaching flow go through resource tracker.
The PF is the one responsible for managing all the steering entries.
Define and use module parameter that determines the number of qps
per multicast group.
Minor changes in function calls according to changed prototype.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_ib: disable SRIOV mode for IB ports (not yet supported)
Jack Morgenstein [Tue, 13 Dec 2011 04:15:46 +0000 (04:15 +0000)]
mlx4_ib: disable SRIOV mode for IB ports (not yet supported)

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: resource tracking for HCA resources used by guests
Eli Cohen [Tue, 13 Dec 2011 04:15:24 +0000 (04:15 +0000)]
mlx4_core: resource tracking for HCA resources used by guests

The resource tracker is used to track usage of HCA resources by the different
guests.

Virtual functions (VFs) are attached to guest operating systems but
resources are allocated from the same pool and are assigned to VFs. It is
essential that hostile/buggy guests not be able to affect the operation of
other VFs, possibly attached to other guest OSs since ConnectX firmware is not
tolerant to misuse of resources.

The resource tracker module associates each resource with a VF and maintains
state information for the allocated object. It also defines allowed state
transitions and enforces them.

Relationships between resources are also referred to. For example, CQs are
pointed to by QPs, so it is forbidden to destroy a CQ if a QP refers to it.

ICM memory is always accessible through the primary function and hence it is
allocated by the owner of the primary function.

When a guest dies, an FLR is generated for all the VFs it owns and all the
resources it used are freed.

The tracked resource types are: QPs, CQs, SRQs, MPTs, MTTs, MACs, RES_EQs,
and XRCDNs.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: Add wrapper functions and comm channel and slave event support to EQs
Jack Morgenstein [Tue, 13 Dec 2011 04:13:58 +0000 (04:13 +0000)]
mlx4_core: Add wrapper functions and comm channel and slave event support to EQs

Passing async events to slaves:
In SRIOV mode, each slave creates its own async EQ, but only the master can
register directly with the FW to receive async events.  Async events which
should be passed to slaves (such as a WQ_ACCESS_ERROR for a QP owned by a slave)
are generated at the slave by the master using the GEN_EQE FW command.

Wrapper functions: mlx4_MAP_EQ_wrapper
Only the master can map an EQ. The slave commands to map their EQs arrive
at the master via the comm channel.  The master then invokes the wrapper
function to do the work (and enter the resource in the tracking database).

New events: COMM_CHANNEL and FLR
The COMM_CHANNEL event arrives only at the master, and signals that
a slave has posted a command on the comm channel.
The FLR event is generated by the FW when a guest operating a VF
unexpectedly goes down.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: mtt modifications for SRIOV
Jack Morgenstein [Tue, 13 Dec 2011 04:13:48 +0000 (04:13 +0000)]
mlx4_core: mtt modifications for SRIOV

MTTs are resources which are allocated and tracked by the PF driver.
In multifunction mode, the allocation and icm mapping is done in
the resource tracker (later patch in this sequence).

To accomplish this, we have "work" functions whose names start with
"__", and "request" functions (same name, no __). If we are operating
in multifunction mode, the request function actually results in
comm-channel commands being sent (ALLOC_RES or FREE_RES).
The PF-driver comm-channel handler will ultimately invoke the
"work" (__) function and return the result.

If we are not in multifunction mode, the "work" handler is invoked
immediately.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: cq modifications for SRIOV
Jack Morgenstein [Tue, 13 Dec 2011 04:13:36 +0000 (04:13 +0000)]
mlx4_core: cq modifications for SRIOV

CQs are resources which are allocated and tracked by the PF driver.
In multifunction mode, the allocation and icm mapping is done in
the resource tracker (later patch in this sequence).

To accomplish this, we have "work" functions whose names start with
"__", and "request" functions (same name, no __). If we are operating
in multifunction mode, the request function actually results in
comm-channel commands being sent (ALLOC_RES or FREE_RES).
The PF-driver comm-channel handler will ultimately invoke the
"work" (__) function and return the result.

If we are not in multifunction mode, the "work" handler is invoked
immediately.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: qp modifications for SRIOV
Jack Morgenstein [Tue, 13 Dec 2011 04:13:22 +0000 (04:13 +0000)]
mlx4_core: qp modifications for SRIOV

QPs are resources which are allocated and tracked by the PF driver.
In multifunction mode, the allocation and icm mapping is done in
the resource tracker (later patch in this sequence).

To accomplish this, we have "work" functions whose names start with
"__", and "request" functions (same name, no __). If we are operating
in multifunction mode, the request function actually results in
comm-channel commands being sent (ALLOC_RES or FREE_RES).
The PF-driver comm-channel handler will ultimately invoke the
"work" (__) function and return the result.

If we are not in multifunction mode, the "work" handler is invoked
immediately.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: srq modifications for SRIOV
Jack Morgenstein [Tue, 13 Dec 2011 04:13:05 +0000 (04:13 +0000)]
mlx4_core: srq modifications for SRIOV

SRQs are resources which are allocated and tracked by the PF driver.
In multifunction mode, the allocation and icm mapping is done in
the resource tracker (later patch in this sequence).

To accomplish this, we have "work" functions whose names start with
"__", and "request" functions (same name, no __). If we are operating
in multifunction mode, the request function actually results in
comm-channel commands being sent (ALLOC_RES or FREE_RES).
The PF-driver comm-channel handler will ultimately invoke the
"work" (__) function and return the result.

If we are not in multifunction mode, the "work" handler is invoked
immediately.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: Added FW commands and their wrappers for supporting SRIOV
Marcel Apfelbaum [Tue, 13 Dec 2011 04:12:40 +0000 (04:12 +0000)]
mlx4_core: Added FW commands and their wrappers for supporting SRIOV

The following commands are added here:
1. QUERY_FUNC_CAP and its wrapper.  This function is used by VFs when
   they start up to receive configuration information from the PF, such
   as resource quotas for this VF, which ports should be used (currently
   two), what protocol is running on the port (currently Ethernet ONLY,
   or port not active).

2. QUERY_PORT and its wrapper. Previously, this FW command was invoked directly
   by the ETH driver (en_port.c) using mlx4_cmd_box. Virtualization is now
   required here (the VF's MAC address must be substituted for the PFs
   MAC address returned by the FW). We changed the invocation
   in the ETH driver to use mlx4_QUERY_PORT, and added the wrapper.

3. QUERY_HCA. Used by the VF to determine how the HCA was initialized.
   For now, we need only the multicast table member entry size
   (log2_mc_table_entry_sz, in the ConnectX PRM).  No wrapper is needed
   here, because the data may be passed as is to the VF without modification).

   In this command, we have added a GLOBAL_CAPS field for passing required
   configuration information from FW to a VF (this field is to allow safely
                   adding new SRIOV capabilities which require support in VF drivers, too).
   Bits will set here by FW in response to PF-driver configuration commands which
   will activate as yet undefined new SRIOV features. The VF will test to see that
   all required capabilities indicated by this field are supported (i.e., if a bit
   is set and the VF driver does not recognize that bit, it must abort
   its initialization).  Currently, no bits are set.

4. Added a CLOSE_PORT wrapper.  The PF context needs to keep track of how many VF contexts
   have the port open.  The PF context will not actually issue the FW close port command
   until the last port user issues a CLOSE_PORT request.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Marcel Apfelbaum <marcela@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet/mlx4_core: Implement the master-slave communication channel
Yevgeny Petrilin [Tue, 13 Dec 2011 04:12:25 +0000 (04:12 +0000)]
net/mlx4_core: Implement the master-slave communication channel

When SRIOV is enabled, pf and vfs communicate via shared comm channel.
The vf gets its side of the comm channel via a VF BAR.
Each VF (slave) creates its vHCR (virtual HCA Command Register),
Its DMA address is passed to the PF (master) using Communication Channel Register.
The same Register is used to notify the master of commands posted by the
slaves and for the master to pass events to the slaves, such as command completions
and asynchronous events.

The vHCR format is identical to the HCR format, except for the 'go' and 't' bits,
which are reserved in the vHCR. Posting commands to the vHCR is identical to
the way it is done with the HCR, albeit that the function/PF token fields are
used instead of the HCR go bit.
Specifically:
- When the function prepares a new command in the vHCR, it issues the Post_vHCR_cmd
  communication channel command and toggles the value of the function token;
  when PF token has an equal value, the command has been accepted and a new command may be posted.
- When the PF detects a Post_vHCR_cmd command, it concludes that a new command is available in the vHCR;
  after processing the command, the PF toggles the PF token to match the function token.

When the 'e' bit is not set, the completion of a Post_vHCR_cmd command also indicates
the completion the vHCR command. If, however, the 'e' bit is set, the completion of a
Post_vHCR_cmd command only indicates that the vHCR command has been accepted for execution by the PF.

Function commands are processed by the PF as follows:
-DMA (using the ACCESS_MEM command) the vHCR image into a shadow buffer.
-Validate that the opcode is non-privileged, and that the opcode- and input-modifiers are legal.
-DMA the in-box (if required) into a shadow buffer.
-Validate the command:
o Resource ranges (e.g., QP ranges).
o Partition key.
o Ranges of referenced resources (e.g., CQs within QP contexts).
-If the 'e' bit is set
o complete the Post_vHCR_cmd command
-Execute the command on the HCR.
-DMA the results to the vHCR out-box (if required).
-If the 'e' bit is set
o Indicate command completion by generating a completion event using the GEN_EQE command
-Otherwise
o DMA the command status to the vHCR
o Complete the Post_vHCR_cmd command

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yevgeny Petrillin <yevgenyp@mellanox.com>
Signed-off-by: Liran Liss <liranl@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: Reduce number of PD bits to 17
Jack Morgenstein [Tue, 13 Dec 2011 04:12:13 +0000 (04:12 +0000)]
mlx4_core: Reduce number of PD bits to 17

When SRIOV is enabled on the chip (at FW burning time),
the HCA uses only 17 bits for the PD. The remaining 7 high-order bits
are ignored.

Change the allocator to return only 17 bits for the PD.  The MSB 7
bits will be used to encode the slave number for consistency
checking later on in the resource tracker.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: Add "native" argument to mlx4_cmd and its callers (where needed)
Jack Morgenstein [Tue, 13 Dec 2011 04:10:51 +0000 (04:10 +0000)]
mlx4_core: Add "native" argument to mlx4_cmd and its callers (where needed)

For SRIOV, some Hypervisor commands can be executed directly (native = 1).
Others should go through the command wrapper flow (for tracking resource
usage, for example, or for changing some HCA configurations that slaves
need to be notified of).

This patch sets the groundwork for this capability -- adding the correct
value of "native" in each case.

Note that if SRIOV is not activated, this parameter has no effect.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4: Extanding port_mask functionality
Jack Morgenstein [Tue, 13 Dec 2011 04:10:41 +0000 (04:10 +0000)]
mlx4: Extanding port_mask functionality

Port mask now has additional state.
Port can be set as "none". In this case neither the mlx4_en or mlx4_ib
drivers take ownership of the port.
In multifunction mode there is an option to set the vfs as single ported devices.
(in single function mode, both physical ports belong to same function)

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agomlx4_core: initial header-file changes for SRIOV support
Jack Morgenstein [Tue, 13 Dec 2011 04:10:33 +0000 (04:10 +0000)]
mlx4_core: initial header-file changes for SRIOV support

These changes will not affect module operation as yet. They
are only to get some structs and enums in place for use by
subsequent patches (making those smaller).

Added here:
* sriov state structs and inlines (mlx4_is_master/slave/mfunc)
* comm-channel and vhcr support structures
* enum values for new FW and comm-channel virtual commands
  (i.e., commands, passed via the comm channel to the PF-driver).
* prototypes for many command wrapper functions (used by the
  PF context for processing FW commands passed to it by the VFs).
* struct mlx4_eqe is moved from eq.c to mlx4.h (it will be used
  by other mlx4_core source files).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonet: fix build error if CONFIG_CGROUPS=n
Eric Dumazet [Tue, 13 Dec 2011 03:59:08 +0000 (03:59 +0000)]
net: fix build error if CONFIG_CGROUPS=n

Reported-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobe2net: refactor/cleanup vf configuration code
Sathya Perla [Tue, 13 Dec 2011 00:58:50 +0000 (00:58 +0000)]
be2net: refactor/cleanup vf configuration code

- use adapter->num_vfs (and not the module param) to store the actual
number of vfs created. Use the same variable to reflect SRIOV
enable/disable state. So, drop the adapter->sriov_enabled field.

- use for_all_vfs() macro in VF configuration code

- drop the "vf_" prefix for the fields of be_vf_cfg; the prefix is
redundant and removing it helps reduce line wrap

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobe2net: fix ethtool ringparam reporting
Sathya Perla [Tue, 13 Dec 2011 00:58:49 +0000 (00:58 +0000)]
be2net: fix ethtool ringparam reporting

The ethtool "-g" option is supposed to report the max queue length and
user modified queue length for RX and TX queues.  be2net doesn't support
user modification of queue lengths. So, the correct values for these
would be the max numbers.
be2net incorrectly reports the queue used values for these fields.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobnx2x: properly update skb when mtu > 1500
Dmitry Kravkov [Mon, 12 Dec 2011 23:40:53 +0000 (23:40 +0000)]
bnx2x: properly update skb when mtu > 1500

Since commit e52fcb2462ac484e6dd6e68869536609f0216938 newly allocated
skb for small packets are not updated properly and dropped by stack.

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agonetem: add cell concept to simulate special MAC behavior
Hagen Paul Pfeifer [Mon, 12 Dec 2011 14:30:00 +0000 (14:30 +0000)]
netem: add cell concept to simulate special MAC behavior

This extension can be used to simulate special link layer
characteristics. Simulate because packet data is not modified, only the
calculation base is changed to delay a packet based on the original
packet size and artificial cell information.

packet_overhead can be used to simulate a link layer header compression
scheme (e.g. set packet_overhead to -20) or with a positive
packet_overhead value an additional MAC header can be simulated. It is
also possible to "replace" the 14 byte Ethernet header with something
else.

cell_size and cell_overhead can be used to simulate link layer schemes,
based on cells, like some TDMA schemes. Another application area are MAC
schemes using a link layer fragmentation with a (small) header each.
Cell size is the maximum amount of data bytes within one cell. Cell
overhead is an additional variable to change the per-cell-overhead
(e.g.  5 byte header per fragment).

Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
cell overhead 5 byte):

  tc qdisc add dev eth0 root netem rate 5kbit 20 100 5

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoMerge branch 'batman-adv/next' of git://git.open-mesh.org/linux-merge
David S. Miller [Tue, 13 Dec 2011 00:26:07 +0000 (19:26 -0500)]
Merge branch 'batman-adv/next' of git://git.open-mesh.org/linux-merge

12 years agosch_gred: should not use GFP_KERNEL while holding a spinlock
Eric Dumazet [Sun, 11 Dec 2011 23:42:53 +0000 (23:42 +0000)]
sch_gred: should not use GFP_KERNEL while holding a spinlock

gred_change_vq() is called under sch_tree_lock(sch).

This means a spinlock is held, and we are not allowed to sleep in this
context.

We might pre-allocate memory using GFP_KERNEL before taking spinlock,
but this is not suitable for stable material.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoDisplay maximum tcp memory allocation in kmem cgroup
Glauber Costa [Sun, 11 Dec 2011 21:47:09 +0000 (21:47 +0000)]
Display maximum tcp memory allocation in kmem cgroup

This patch introduces kmem.tcp.max_usage_in_bytes file, living in the
kmem_cgroup filesystem. The root cgroup will display a value equal
to RESOURCE_MAX. This is to avoid introducing any locking schemes in
the network paths when cgroups are not being actively used.

All others, will see the maximum memory ever used by this cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoDisplay current tcp failcnt in kmem cgroup
Glauber Costa [Sun, 11 Dec 2011 21:47:08 +0000 (21:47 +0000)]
Display current tcp failcnt in kmem cgroup

This patch introduces kmem.tcp.failcnt file, living in the
kmem_cgroup filesystem. Following the pattern in the other
memcg resources, this files keeps a counter of how many times
allocation failed due to limits being hit in this cgroup.
The root cgroup will always show a failcnt of 0.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoDisplay current tcp memory allocation in kmem cgroup
Glauber Costa [Sun, 11 Dec 2011 21:47:07 +0000 (21:47 +0000)]
Display current tcp memory allocation in kmem cgroup

This patch introduces kmem.tcp.usage_in_bytes file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp buffer limitation: per-cgroup limit
Glauber Costa [Sun, 11 Dec 2011 21:47:06 +0000 (21:47 +0000)]
tcp buffer limitation: per-cgroup limit

This patch uses the "tcp.limit_in_bytes" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.

This value is ignored in the root cgroup, and in all others,
caps the value specified by the admin in the net namespaces'
view of tcp_sysctl_mem.

If namespaces are being used, the admin is allowed to set a
value bigger than cgroup's maximum, the same way it is allowed
to set pretty much unlimited values in a real box.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoper-netns ipv4 sysctl_tcp_mem
Glauber Costa [Sun, 11 Dec 2011 21:47:05 +0000 (21:47 +0000)]
per-netns ipv4 sysctl_tcp_mem

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agotcp memory pressure controls
Glauber Costa [Sun, 11 Dec 2011 21:47:04 +0000 (21:47 +0000)]
tcp memory pressure controls

This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agosocket: initial cgroup code.
Glauber Costa [Sun, 11 Dec 2011 21:47:03 +0000 (21:47 +0000)]
socket: initial cgroup code.

The goal of this work is to move the memory pressure tcp
controls to a cgroup, instead of just relying on global
conditions.

To avoid excessive overhead in the network fast paths,
the code that accounts allocated memory to a cgroup is
hidden inside a static_branch(). This branch is patched out
until the first non-root cgroup is created. So when nobody
is using cgroups, even if it is mounted, no significant performance
penalty should be seen.

This patch handles the generic part of the code, and has nothing
tcp-specific.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtsu.com>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agofoundations of per-cgroup memory pressure controlling.
Glauber Costa [Sun, 11 Dec 2011 21:47:02 +0000 (21:47 +0000)]
foundations of per-cgroup memory pressure controlling.

This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.

Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoBasic kernel memory functionality for the Memory Controller
Glauber Costa [Sun, 11 Dec 2011 21:47:01 +0000 (21:47 +0000)]
Basic kernel memory functionality for the Memory Controller

This patch lays down the foundation for the kernel memory component
of the Memory Controller.

As of today, I am only laying down the following files:

 * memory.independent_kmem_limit
 * memory.kmem.limit_in_bytes (currently ignored)
 * memory.kmem.usage_in_bytes (always zero)

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: Paul Menage <paul@paulmenage.org>
CC: Greg Thelen <gthelen@google.com>
CC: Johannes Weiner <jweiner@redhat.com>
CC: Michal Hocko <mhocko@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoxen-netfront: delay gARP until backend switches to Connected
Laszlo Ersek [Sun, 11 Dec 2011 01:48:59 +0000 (01:48 +0000)]
xen-netfront: delay gARP until backend switches to Connected

After a guest is live migrated, the xen-netfront driver emits a gratuitous
ARP message, so that networking hardware on the target host's subnet can
take notice, and public routing to the guest is re-established. However,
if the packet appears on the backend interface before the backend is added
to the target host's bridge, the packet is lost, and the migrated guest's
peers become unable to talk to the guest.

A sufficient two-parts condition to prevent the above is:

(1) ensure that the backend only moves to Connected xenbus state after its
hotplug scripts completed, ie. the netback interface got added to the
bridge; and

(2) ensure the frontend only queues the gARP when it sees the backend move
to Connected.

These two together provide complete ordering. Sub-condition (1) is already
satisfied by commit f942dc2552b8 in Linus' tree, based on commit
6b0b80ca7165 from [1].

In general, the full condition is sufficient, not necessary, because,
according to [2], live migration has been working for a long time without
satisfying sub-condition (2). However, after 6b0b80ca7165 was backported
to the RHEL-5 host to ensure (1), (2) still proved necessary in the RHEL-6
guest. This patch intends to provide (2) for upstream.

The Reviewed-by line comes from [3].

[1] git://xenbits.xen.org/people/ianc/linux-2.6.git#upstream/dom0/backend/netback-history
[2] http://old-list-archives.xen.org/xen-devel/2011-06/msg01969.html
[3] http://old-list-archives.xen.org/xen-devel/2011-07/msg00484.html

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoipip, sit: copy parms.name after register_netdevice
Ted Feng [Thu, 8 Dec 2011 00:46:21 +0000 (00:46 +0000)]
ipip, sit: copy parms.name after register_netdevice

Same fix as 731abb9cb2 for ipip and sit tunnel.
Commit 1c5cae815d removed an explicit call to dev_alloc_name in
ipip_tunnel_locate and ipip6_tunnel_locate, because register_netdevice
will now create a valid name, however the tunnel keeps a copy of the
name in the private parms structure. Fix this by copying the name back
after register_netdevice has successfully returned.

This shows up if you do a simple tunnel add, followed by a tunnel show:

$ sudo ip tunnel add mode ipip remote 10.2.20.211
$ ip tunnel
tunl0: ip/ip  remote any  local any  ttl inherit  nopmtudisc
tunl%d: ip/ip  remote 10.2.20.211  local any  ttl inherit
$ sudo ip tunnel add mode sit remote 10.2.20.212
$ ip tunnel
sit0: ipv6/ip  remote any  local any  ttl 64  nopmtudisc 6rd-prefix 2002::/16
sit%d: ioctl 89f8 failed: No such device
sit%d: ipv6/ip  remote 10.2.20.212  local any  ttl inherit

Cc: stable@vger.kernel.org
Signed-off-by: Ted Feng <artisdom@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoipv6: Fix for adding multicast route for loopback device automatically.
Li Wei [Tue, 6 Dec 2011 21:23:45 +0000 (21:23 +0000)]
ipv6: Fix for adding multicast route for loopback device automatically.

There is no obvious reason to add a default multicast route for loopback
devices, otherwise there would be a route entry whose dst.error set to
-ENETUNREACH that would blocking all multicast packets.

====================

[ more detailed explanation ]

The problem is that the resulting routing table depends on the sequence
of interface's initialization and in some situation, that would block all
muticast packets. Suppose there are two interfaces on my computer
(lo and eth0), if we initailize 'lo' before 'eth0', the resuting routing
table(for multicast) would be

# ip -6 route show | grep ff00::
unreachable ff00::/8 dev lo metric 256 error -101
ff00::/8 dev eth0 metric 256

When sending multicasting packets, routing subsystem will return the first
route entry which with a error set to -101(ENETUNREACH).

I know the kernel will set the default ipv6 address for 'lo' when it is up
and won't set the default multicast route for it, but there is no reason to
stop 'init' program from setting address for 'lo', and that is exactly what
systemd did.

I am sure there is something wrong with kernel or systemd, currently I preferred
kernel caused this problem.

====================

Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wirel...
John W. Linville [Mon, 12 Dec 2011 19:19:43 +0000 (14:19 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/linville/wireless-next into for-davem

12 years agobatman-adv: Only write requested number of byte to user buffer
Sven Eckelmann [Sat, 10 Dec 2011 14:28:36 +0000 (15:28 +0100)]
batman-adv: Only write requested number of byte to user buffer

Don't write more than the requested number of bytes of an batman-adv icmp
packet to the userspace buffer. Otherwise unrelated userspace memory might get
overridden by the kernel.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
12 years agobatman-adv: Directly check read of icmp packet in copy_from_user
Sven Eckelmann [Sat, 10 Dec 2011 14:28:35 +0000 (15:28 +0100)]
batman-adv: Directly check read of icmp packet in copy_from_user

The access_ok read check can be directly done in copy_from_user since a failure
of access_ok is handled the same way as an error in __copy_from_user.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
12 years agobatman-adv: bat_socket_read missing checks
Paul Kot [Sat, 10 Dec 2011 14:28:34 +0000 (15:28 +0100)]
batman-adv: bat_socket_read missing checks

Writing a icmp_packet_rr and then reading icmp_packet can lead to kernel
memory corruption, if __user *buf is just below TASK_SIZE.

Signed-off-by: Paul Kot <pawlkt@gmail.com>
[sven@narfation.org: made it checkpatch clean]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
12 years agonet: use IS_ENABLED(CONFIG_IPV6)
Eric Dumazet [Sat, 10 Dec 2011 09:48:31 +0000 (09:48 +0000)]
net: use IS_ENABLED(CONFIG_IPV6)

Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobe2net: workaround to fix a bug in BE
Ajit Khaparde [Fri, 9 Dec 2011 13:53:17 +0000 (13:53 +0000)]
be2net: workaround to fix a bug in BE

disable Tx vlan offloading in certain cases.

Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agobe2net: update some counters to display via ethtool
Ajit Khaparde [Fri, 9 Dec 2011 13:53:09 +0000 (13:53 +0000)]
be2net: update some counters to display via ethtool

update pmem_fifo_overflow_drop, rx_priority_pause_frames counters.

Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoudp_diag: Fix the !ipv6 case
Pavel Emelyanov [Fri, 9 Dec 2011 23:35:07 +0000 (23:35 +0000)]
udp_diag: Fix the !ipv6 case

Wrap the udp6 lookup into the proper ifdef-s.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoudp_diag: Make it module when ipv6 is a module
Pavel Emelyanov [Fri, 9 Dec 2011 23:33:43 +0000 (23:33 +0000)]
udp_diag: Make it module when ipv6 is a module

Eric Dumazet reported, that when inet_diag is built-in the udp_diag also goes
built-in and when ipv6 is a module the udp6 lookup symbol is not found.

  LD      .tmp_vmlinux1
net/built-in.o: In function `udp_dump_one':
udp_diag.c:(.text+0xa2b40): undefined reference to `__udp6_lib_lookup'
make: *** [.tmp_vmlinux1] Erreur 1

Fix this by making udp diag build mode depend on both -- inet diag and ipv6.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoiwlwifi regression in 20111205 merge
Nikolay Martynov [Fri, 9 Dec 2011 02:43:39 +0000 (21:43 -0500)]
iwlwifi regression in 20111205 merge

  It looks like the regression was introduced between 20111202 and
20111205 (linux-next tree). Symptoms: connection to AP seem to be
established, but no data goes though it in any way. Tested on intel
5300.
  Peek at the changes have shown that it looks like at least part of
the code wasn't merged properly. It was originally committed into
iwl_agn.c but code in question was moved to iwl-mac80211.c.
  This patch puts code in place and my card works again.

Signed-off-by: Nikolay Martynov <mar.kolya@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
12 years agowl12xx: silence tx_attr uninitialized warning in wl1271_tx_fill_hdr
John W. Linville [Thu, 8 Dec 2011 21:15:58 +0000 (16:15 -0500)]
wl12xx: silence tx_attr uninitialized warning in wl1271_tx_fill_hdr

  CC [M]  drivers/net/wireless/wl12xx/tx.o
drivers/net/wireless/wl12xx/tx.c: In function ‘wl1271_tx_fill_hdr’:
drivers/net/wireless/wl12xx/tx.c:288:6: warning: ‘tx_attr’ may be used uninitialized in this function

Signed-off-by: John W. Linville <linville@tuxdriver.com>
12 years agoudp_diag: Wire the udp_diag module into kbuild
Pavel Emelyanov [Fri, 9 Dec 2011 06:24:36 +0000 (06:24 +0000)]
udp_diag: Wire the udp_diag module into kbuild

Copy-s/tcp/udp/-paste from TCP bits.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoudp_diag: Implement the dump-all functionality
Pavel Emelyanov [Fri, 9 Dec 2011 06:24:21 +0000 (06:24 +0000)]
udp_diag: Implement the dump-all functionality

Do the same as TCP does -- iterate the given udp_table, filter
sockets with bytecode and dump sockets into reply message.

The same filtering as for TCP applies, though only some of the
state bits really matter.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoudp_diag: Implement the get_exact dumping functionality
Pavel Emelyanov [Fri, 9 Dec 2011 06:24:06 +0000 (06:24 +0000)]
udp_diag: Implement the get_exact dumping functionality

Do the same as TCP does -- lookup a socket in the given udp_table,
check cookie, fill the reply message with existing inet socket dumping
helper and send one back.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoudp_diag: Basic skeleton
Pavel Emelyanov [Fri, 9 Dec 2011 06:23:51 +0000 (06:23 +0000)]
udp_diag: Basic skeleton

Introduce the transport level diag handler module for UDP (and UDP-lite)
sockets and register (empty for now) callbacks in the inet_diag module.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoudp: Export code sk lookup routines
Pavel Emelyanov [Fri, 9 Dec 2011 06:23:34 +0000 (06:23 +0000)]
udp: Export code sk lookup routines

The UDP diag get_exact handler will require them to find a
socket by provided net, [sd]addr-s, [sd]ports and device.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet_diag: Generalize inet_diag dump and get_exact calls
Pavel Emelyanov [Fri, 9 Dec 2011 06:23:18 +0000 (06:23 +0000)]
inet_diag: Generalize inet_diag dump and get_exact calls

Introduce two callbacks in inet_diag_handler -- one for dumping all
sockets (with filters) and the other one for dumping a single sk.

Replace direct calls to icsk handlers with indirect calls to callbacks
provided by handlers.

Make existing TCP and DCCP handlers use provided helpers for icsk-s.

The UDP diag module will provide its own.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet_diag: Introduce the inet socket dumping routine
Pavel Emelyanov [Fri, 9 Dec 2011 06:23:00 +0000 (06:23 +0000)]
inet_diag: Introduce the inet socket dumping routine

The existing inet_csk_diag_fill dumps the inet connection sock info
into the netlink inet_diag_message. Prepare this routine to be able
to dump only the inet_sock part of a socket if the icsk part is missing.

This will be used by UDP diag module when dumping UDP sockets.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet_diag: Introduce the byte-code run on an inet socket
Pavel Emelyanov [Fri, 9 Dec 2011 06:22:44 +0000 (06:22 +0000)]
inet_diag: Introduce the byte-code run on an inet socket

The upcoming UDP module will require exactly this ability, so just
move the existing code to provide one.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet_diag: Split inet_diag_get_exact into parts
Pavel Emelyanov [Fri, 9 Dec 2011 06:22:26 +0000 (06:22 +0000)]
inet_diag: Split inet_diag_get_exact into parts

Similar to previous patch: the 1st part locks the inet handler
and will get generalized and the 2nd one dumps icsk-s and will
be used by TCP and DCCP handlers.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet_diag: Split inet_diag_get_exact into parts
Pavel Emelyanov [Fri, 9 Dec 2011 06:22:10 +0000 (06:22 +0000)]
inet_diag: Split inet_diag_get_exact into parts

The 1st part locks the inet handler and the 2nd one dump the
inet connection sock.

In the next patches the 1st part will be generalized to call
the socket dumping routine indirectly (i.e. TCP/UDP/DCCP) and
the 2nd part will be used by TCP and DCCP handlers.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet_diag: Export inet diag cookie checking routine
Pavel Emelyanov [Fri, 9 Dec 2011 06:21:53 +0000 (06:21 +0000)]
inet_diag: Export inet diag cookie checking routine

The netlink diag susbsys stores sk address bits in the nl message
as a "cookie" and uses one when dumps details about particular
socket.

The same will be required for udp diag module, so introduce a heler
in inet_diag module

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet_diag: Reduce the number of args for bytecode run routine
Pavel Emelyanov [Fri, 9 Dec 2011 06:21:34 +0000 (06:21 +0000)]
inet_diag: Reduce the number of args for bytecode run routine

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoinet_diag: Remove indirect sizeof from inet diag handlers
Pavel Emelyanov [Fri, 9 Dec 2011 06:21:16 +0000 (06:21 +0000)]
inet_diag: Remove indirect sizeof from inet diag handlers

There's an info_size value stored on inet_diag_handler, but for existing
code this value is effectively constant, so just use sizeof(struct tcp_info)
where required.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
12 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wirel...
John W. Linville [Fri, 9 Dec 2011 19:07:12 +0000 (14:07 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/linville/wireless into for-davem

12 years agosch_red: generalize accurate MAX_P support to RED/GRED/CHOKE
Eric Dumazet [Fri, 9 Dec 2011 02:46:45 +0000 (02:46 +0000)]
sch_red: generalize accurate MAX_P support to RED/GRED/CHOKE

Now RED uses a Q0.32 number to store max_p (max probability), allow
RED/GRED/CHOKE to use/report full resolution at config/dump time.

Old tc binaries are non aware of new attributes, and still set/get Plog.

New tc binary set/get both Plog and max_p for backward compatibility,
they display "probability value" if they get max_p from new kernels.

# tc -d  qdisc show dev ...
...
qdisc red 10: parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma 5
probability 0.09 Scell_log 15

Make sure we avoid potential divides by 0 in reciprocal_value(), if
(max_th - min_th) is big.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>