git.openwrt.org Git - openwrt/staging/blogic.git/log

DMA, CMA: support arbitrary bitmap granularity

PPC KVM's CMA area management requires arbitrary bitmap granularity,
since they want to reserve very large memory and manage this region with
bitmap that one bit for several pages to reduce management overheads.
So support arbitrary bitmap granularity for following generalization.

[akpm@linux-foundation.org: s/1/1UL/]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

DMA, CMA: support alignment constraint on CMA region

PPC KVM's CMA area management needs alignment constraint on CMA region.
So support it to prepare generalization of CMA area management
functionality.

Additionally, add some comments which tell us why alignment constraint
is needed on CMA region.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

DMA, CMA: separate core CMA management codes from DMA APIs

To prepare future generalization work on CMA area management code, we
need to separate core CMA management codes from DMA APIs. We will
extend these core functions to cover requirements of PPC KVM's CMA area
management functionality in following patches. This separation helps us
not to touch DMA APIs while extending core functions.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/internal.h: use nth_page

Use nth_page instead of pfn_to_page(page_to_pfn

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: page_alloc: simplify drain_zone_pages by using min()

Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change
type of batch local variable to match type of per_cpu_pages::batch
(which is int).

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mem-hotplug: introduce MMOP_OFFLINE to replace the hard coding -1

In store_mem_state(), we have:

  ...
  334         else if (!strncmp(buf, "offline", min_t(int, count, 7)))
  335                 online_type = -1;
  ...
  355         case -1:
  356                 ret = device_offline(&mem->dev);
  357                 break;
  ...

Here, "offline" is hard coded as -1.

This patch does the following renaming:

ONLINE_KEEP     ->  MMOP_ONLINE_KEEP
ONLINE_KERNEL   ->  MMOP_ONLINE_KERNEL
ONLINE_MOVABLE  ->  MMOP_ONLINE_MOVABLE

and introduces MMOP_OFFLINE = -1 to avoid hard coding.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Hu Tao <hutao@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mem-hotplug: avoid illegal state prefixed with legal state when changing state of memory_block

We use the following command to online a memory_block:

    echo online|online_kernel|online_movable > /sys/devices/system/memory/memoryXXX/state

But, if we do the following:

    echo online_fhsjkghfkd > /sys/devices/system/memory/memoryXXX/state

the block will also be onlined.

This is because the following code in store_mem_state() does not compare
the whole string, but only the prefix of the string.

  store_mem_state()
  {
       ......
   328         if (!strncmp(buf, "online_kernel", min_t(int, count, 13)))

Here, only compare the first 13 letters of the string. If we give "online_kernelXXXXXX",
it will be recognized as online_kernel, which is incorrect.

   329                 online_type = ONLINE_KERNEL;
   330         else if (!strncmp(buf, "online_movable", min_t(int, count, 14)))

We have the same problem here,

   331                 online_type = ONLINE_MOVABLE;
   332         else if (!strncmp(buf, "online", min_t(int, count, 6)))

here,

(Here is more problematic.  If we give online_movalbe, which is a typo
of online_movable, it will be recognized as online without noticing the
author.)

   333                 online_type = ONLINE_KEEP;
   334         else if (!strncmp(buf, "offline", min_t(int, count, 7)))

and here.

   335                 online_type = -1;
   336         else {
   337                 ret = -EINVAL;
   338                 goto err;
   339         }
       ......
  }

This patch fixes this problem by using sysfs_streq() to compare the
whole string.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Hu Tao <hutao@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()

Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
orig_pte upon which all subsequent decisions and pte_same() tests will
be made.

I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
trinity, and I am not optimistic that it will fix it. But I have found
no other explanation, and ACCESS_ONCE() here will surely not hurt.

If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped
in a second time on top of itself: mapcount 2 for a single pte).

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

vmalloc: use rcu list iterator to reduce vmap_area_lock contention

Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.

  https://lkml.org/lkml/2014/4/10/416

Although I'm not sure that this is right usage or not, there is a
solution reducing vmap_area_lock contention with no side-effect.  That
is just to use rcu list iterator in get_vmalloc_info().

rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1 ("mm:
rewrite vmap layer") back in linux-2.6.28

Specifically :
   insertions use list_add_rcu(),
   deletions use list_del_rcu() and kfree_rcu().

Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.

Note that __purge_vmap_area_lazy() already uses this rcu protection.

        rcu_read_lock();
        list_for_each_entry_rcu(va, &vmap_area_list, list) {
                if (va->flags & VM_LAZY_FREE) {
                        if (va->va_start < *start)
                                *start = va->va_start;
                        if (va->va_end > *end)
                                *end = va->va_end;
                        nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
                        list_add_tail(&va->purge_list, &valist);
                        va->flags |= VM_LAZY_FREEING;
                        va->flags &= ~VM_LAZY_FREE;
                }
        }
        rcu_read_unlock();

Peter:

: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.

Joonsoo:

: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time.  While we try to get
: certain stats, the other stats can change.  And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.

[edumazet@google.com: add more commit description]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Richard Yao <ryao@gentoo.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

include/linux/memblock.h: add __init to memblock_set_bottom_up()

memblock_set_bottom_up() is only called by __init
cmdline_parse_movable_node() and __init numa_init().

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/page_alloc.c: unexport alloc_pages_exact_nid()

It is only called by mm/page_cgroup.c whcih cannot be modular.

Reported-by: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/page_alloc.c: add __meminit to alloc_pages_exact_nid()

alloc_pages_exact_nid() is only called by __meminit alloc_page_cgroup()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug.c: add __meminit to grow_zone_span/grow_pgdat_span

grow_zone_span and grow_pgdat_span are only called by
__meminit __add_zone

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Toshi Kani <toshi.kani@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/readahead.c: remove unused file_ra_state from count_history_pages

count_history_pages does only call page_cache_prev_hole in rcu_lock
context using address_space mapping. There's no need to have
file_ra_state here.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: convert last use of __FUNCTION__ to __func__

Just about all of these have been converted to __func__, so convert the
last use.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: fix the alias count (via sysfs) of slab cache

We mark some slab caches (e.g. kmem_cache_node) as unmergeable by
setting refcount to -1, and their alias should be 0, not refcount-1, so
correct it here.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm, slub: fix some indenting in cmpxchg_double_slab()

The return statement goes with the cmpxchg_double() condition so it needs
to be indented another tab.

Also these days the fashion is to line function parameters up, and it
looks nicer that way because then the "freelist_new" is not at the same
indent level as the "return 1;".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/slab.c: fix comments

Current struct kmem_cache has no 'lock' field, and slab page is managed by
struct kmem_cache_node, which has 'list_lock' field.

Clean up the related comment.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: move slab related stuff from util.c to slab_common.c

Functions krealloc(), __krealloc(), kzfree() belongs to slab API, so
should be placed in slab_common.c

Also move slab allocator's tracepoints defenitions to slab_common.c No
functional changes here.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slub: avoid duplicate creation on the first object

When a kmem_cache is created with ctor, each object in the kmem_cache
will be initialized before ready to use. While in slub implementation,
the first object will be initialized twice.

This patch reduces the duplication of initialization of the first
object.

Fix commit 7656c72b ("SLUB: add macros for scanning objects in a slab").

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: change int to size_t for representing allocation size

It is better to represent allocation size in size_t rather than int. So
change it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: remove BAD_ALIEN_MAGIC

BAD_ALIEN_MAGIC value isn't used anymore. So remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: remove a useless lockdep annotation

Now, there is no code to hold two lock simultaneously, since we don't
call slab_destroy() with holding any lock.  So, lockdep annotation is
useless now.  Remove it.

v2: don't remove BAD_ALIEN_MAGIC in this patch. It will be removed
    in the following patch.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: destroy a slab without holding any alien cache lock

I haven't heard that this alien cache lock is contended, but to reduce
chance of contention would be better generally. And with this change,
we can simplify complex lockdep annotation in slab code. In the
following patch, it will be implemented.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: use the lock on alien_cache, instead of the lock on array_cache

Now, we have separate alien_cache structure, so it'd be better to hold
the lock on alien_cache while manipulating alien_cache. After that, we
don't need the lock on array_cache, so remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: introduce alien_cache

Currently, we use array_cache for alien_cache.  Although they are mostly
similar, there is one difference, that is, need for spinlock.  We don't
need spinlock for array_cache itself, but to use array_cache for
alien_cache, array_cache structure should have spinlock.  This is
needless overhead, so removing it would be better.  This patch prepare
it by introducing alien_cache and using it.  In the following patch, we
remove spinlock in array_cache.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: factor out initialization of array cache

Factor out initialization of array cache to use it in following patch.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: defer slab_destroy in free_block()

In free_block(), if freeing object makes new free slab and number of
free_objects exceeds free_limit, we start to destroy this new free slab
with holding the kmem_cache node lock.  Holding the lock is useless and,
generally, holding a lock as least as possible is good thing.  I never
measure performance effect of this, but we'd be better not to hold the
lock as much as possible.

Commented by Christoph:
  This is also good because kmem_cache_free is no longer called while
  holding the node lock. So we avoid one case of recursion.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: move up code to get kmem_cache_node in free_block()

node isn't changed, so we don't need to retreive this structure
everytime we move the object. Maybe compiler do this optimization, but
making it explicitly is better.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: add unlikely macro to help compiler

This patchset does some cleanup and tries to remove lockdep annotation.

Patches 1~2 are just for really really minor improvement.
Patches 3~9 are for clean-up and removing lockdep annotation.

There are two cases that lockdep annotation is needed in SLAB.
1) holding two node locks
2) holding two array cache(alien cache) locks

I looked at the code and found that we can avoid these cases without any
negative effect.

1) occurs if freeing object makes new free slab and we decide to
   destroy it.  Although we don't need to hold the lock during destroying
   a slab, current code do that.  Destroying a slab without holding the
   lock would help the reduction of the lock contention.  To do it, I
   change the implementation that new free slab is destroyed after
   releasing the lock.

2) occurs on similar situation.  When we free object from non-local
   node, we put this object to alien cache with holding the alien cache
   lock.  If alien cache is full, we try to flush alien cache to proper
   node cache, and, in this time, new free slab could be made.  Destroying
   it would be started and we will free metadata object which comes from
   another node.  In this case, we need another node's alien cache lock to
   free object.  This forces us to hold two array cache locks and then we
   need lockdep annotation although they are always different locks and
   deadlock cannot be possible.  To prevent this situation, I use same way
   as 1).

In this way, we can avoid 1) and 2) cases, and then, can remove lockdep
annotation. As short stat noted, this makes SLAB code much simpler.

This patch (of 9):

slab_should_failslab() is called on every allocation, so to optimize it
is reasonable.  We normally don't allocate from kmem_cache.  It is just
used when new kmem_cache is created, so it's very rare case.  Therefore,
add unlikely macro to help compiler optimization.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y

There are two versions of alloc/free hooks now - one for
CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.

I see no reason why calls to other debugging subsystems (LOCKDEP,
DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
All this features should work regardless of SLUB_DEBUG config, as all of
them already have own Kconfig options.

This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration.  It
simply has not worked before because should_failslab() call was in a
hook hidden under "#ifdef CONFIG_SLUB_DEBUG #else".

Note: There is one concealed change in allocation path for SLUB_DEBUG=n
and all other debugging features disabled.  The might_sleep_if() call
can generate some code even if DEBUG_ATOMIC_SLEEP=n.  For
PREEMPT_VOLUNTARY=y might_sleep() inserts _cond_resched() call, but I
think it should be ok.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm, slub: mark resiliency_test as init text

resiliency_test() is only called for bootstrap, so it may be moved to
init.text and freed after boot.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: slab.h: wrap the whole file with guarding macro

Guarding section:
#ifndef MM_SLAB_H
#define MM_SLAB_H
...
#endif
currently doesn't cover the whole mm/slab.h. It seems like it was
done unintentionally.

Wrap the whole file by moving closing #endif to the end of it.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab: use get_node() and kmem_cache_node() functions

Use the two functions to simplify the code avoiding numerous explicit
checks coded checking for a certain node to be online.

Get rid of various repeated calculations of kmem_cache_node structures.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slub: use new node functions

Make use of the new node functions in mm/slab.h to reduce code size and
simplify.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slab common: add functions for kmem_cache_node access

The patchset provides two new functions in mm/slab.h and modifies SLAB
and SLUB to use these. The kmem_cache_node structure is shared between
both allocators and the use of common accessors will allow us to move
more code into slab_common.c in the future.

This patch (of 3):

These functions allow to eliminate repeatedly used code in both SLAB and
SLUB and also allow for the insertion of debugging code that may be
needed in the development process.

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/slab.c: add __init to init_lock_keys

init_lock_keys is only called by __init kmem_cache_init_late

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kernel/watchdog.c: convert printk/pr_warning to pr_foo()

Replace some obsolete functions.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/ocfs2/slot_map.c: replace count*size kzalloc by kcalloc

kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ocfs2: race between umount and unfinished remastering during recovery

Orabug: 19074140

When umount is issued during recovery on the new master that has not
finished remastering locks, it triggers BUG() in
dlm_send_mig_lockres_msg().  Here is the situation:

1) node A has a lock on resource X mastered by node B.

2) node B dies ->  node A sets recovering flag for res X

3) Node C becomes the new master for resources owned by the
    dead node and is remastering locks of the dead node but
    has not finished the remastering process yet.

4) umount is issued on node C.

5) During processing of umount, ignoring unfished recovery,
    node C attempts to migrate resource X to node A.

6) node A finds res X in DLM_LOCK_RES_RECOVERING state, considers
    it a logic error and sends back -EFAULT.

7) node C asserts BUG() upon seeing EFAULT resp from node B.

Fix is to delay migrating res X till remastering is finished at which
point recovering flag will be cleared on both A and C.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ocfs2: remove conversion of total_backoff in dlm_join_domain()

The unit of total_backoff is msecs not jiffies, so no need to do the
conversion. Otherwise, the join timeout is not 90 sec.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ocfs2: correctly check the return value of ocfs2_search_extent_list

ocfs2_search_extent_list may return -1, so we should check the return
value in ocfs2_split_and_insert, otherwise it may cause array index out of
bound.

And ocfs2_search_extent_list can only return value less than
el->l_next_free_rec, so check if it is equal or larger than
le16_to_cpu(el->l_next_free_rec) is meaningless.

Signed-off-by: Yingtai Xie <xieyingtai@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/squashfs/super.c: logging cleanup

- Convert printk to pr_foo()
- Add pr_fmt for future logging entries
- Coalesce formats

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/squashfs/file_direct.c: replace count*size kmalloc by kmalloc_array

kmalloc_array() manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sh: fix build error by adding generic ioport_{map/unmap}()

Fix build error as reported by Geert Uytterhoeven here:

http://kisskb.ellerman.id.au/kisskb/buildresult/11607865/

The error happens when CONFIG_HAS_IOPORT_MAP=n because of which there
are missing definitions of ioport_map/unmap(). Fix this build error by
adding these prototypes.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@vger.kernel.org> [3.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sh: SH7724 clock fixes

Fix the device name for the CMT.

Add clocks called usb0 and usb1 so that r8a66597_hcd works again on the
ecovec24 board

Signed-off-by: Daniel Palmer <danieruru@gmail.com>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

arch/sh/kernel/time.c: use PTR_ERR_OR_ZERO

Replace IS_ERR/PTR_ERR.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

arch/sh/mm/asids-debugfs.c: use PTR_ERR_OR_ZERO

Replace IS_ERR/PTR_ERR.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

sh: remove CPU_SUBTYPE_SH7764

The symbol is an orphan, get rid of it.

Submitted by Richard a few months ago as "[PATCH 21/28] Remove
CPU_SUBTYPE_SH7764".

[pebolle@tiscali.nl: re-added dropped ||]
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

score, ptrace: remove unused macros

'COUNTER' and other same kind macros are too common to use, and easy to
get conflict with other modules.

At present, they are not used, so it is OK to simply remove them.  And
the related warning (allmodconfig with score):

    CC [M]  drivers/md/raid1.o
  In file included from drivers/md/raid1.c:42:0:
  drivers/md/bitmap.h:93:0: warning: "COUNTER" redefined
   #define COUNTER(x) (((bitmap_counter_t) x) & COUNTER_MAX)
   ^
  In file included from ./arch/score/include/asm/ptrace.h:4:0,
                   from include/linux/sched.h:31,
                   from include/linux/blkdev.h:4,
                   from drivers/md/raid1.c:36:
  ./arch/score/include/uapi/asm/ptrace.h:13:0: note: this is the location of the previous definition
   #define COUNTER  38

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ntfs: kernel-doc warning fixes

cached_page and lru_pvec were removed from ntfs_attr_extend_initialized
in commit 2ec93b0bf35f ("ntfs: clean up ntfs_attr_extend_initialized")

lru_pvec has been removed from __ntfs_grab_cache_pages in commit
4c99000ac47c ("ntfs: use add_to_page_cache_lru()")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/logfs/readwrite.c: kernel-doc warning fixes

s/-/:/ and fix variable names.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

./Makefile: explain stack-protector-strong CONFIG logic

This adds a hopefully helpful comment above the (seemingly weird) compiler
flag selection logic.

Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fanotify: fix double free of pending permission events

Commit 85816794240b ("fanotify: Fix use after free for permission
events") introduced a double free issue for permission events which are
pending in group's notification queue while group is being destroyed.
These events are freed from fanotify_handle_event() but they are not
removed from groups notification queue and thus they get freed again
from fsnotify_flush_notify().

Fix the problem by removing permission events from notification queue
before freeing them if we skip processing access response. Also expand
comments in fanotify_release() to explain group shutdown in detail.

Fixes: 85816794240b9659e66e4d9b0df7c6e814e5f603
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Douglas Leeder <douglas.leeder@sophos.com>
Tested-by: Douglas Leeder <douglas.leeder@sophos.com>
Reported-by: Heinrich Schuchard <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fsnotify: rename event handling functions

Rename fsnotify_add_notify_event() to fsnotify_add_event() since the
"notify" part is duplicit. Rename fsnotify_remove_notify_event() and
fsnotify_peek_notify_event() to fsnotify_remove_first_event() and
fsnotify_peek_first_event() respectively since "notify" part is duplicit
and they really look at the first event in the queue.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs/fscache: make ctl_table static

fscache_sysctls and fscache_sysctls_root are only used in main.c

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kernel/auditfilter.c: replace count*size kmalloc by kcalloc

kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge git://git./linux/kernel/git/davem/ide

Pull IDE cleanup from David Miller:
"Just one minor cleanup"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide:
ide: use module_platform_driver()

Merge git://git./linux/kernel/git/davem/sparc-next

Pull sparc updates from David Miller:

1) Add sparc RAM output to /proc/iomem, from Bob Picco.

2) Allow seeks on /dev/mdesc, from Khalid Aziz.

3) Cleanup sparc64 I/O accessors, from Sam Ravnborg.

4) If update_mmu_cache{,_pmd}() is called with an not-valid mapping, do
    not insert it into the TLB miss hash tables otherwise we'll
    livelock.  Based upon work by Christopher Alexander Tobias Schulze.

5) Fix BREAK detection in sunsab driver when no actual characters are
    pending, from Christopher Alexander Tobias Schulze.

6) Because we have modules --> openfirmware --> vmalloc ordering of
    virtual memory, the lazy VMAP TLB flusher can cons up an invocation
    of flush_tlb_kernel_range() that covers the openfirmware address
    range.  Unfortunately this will flush out the firmware's locked TLB
    mapping which causes all kinds of trouble.  Just split up the flush
    request if this happens, but in the long term the lazy VMAP flusher
    should probably be made a little bit smarter.

    Based upon work by Christopher Alexander Tobias Schulze.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
  sparc64: Fix up merge thinko.
  sparc: Add "install" target
  arch/sparc/math-emu/math_32.c: drop stray break operator
  sparc64: ldc_connect() should not return EINVAL when handshake is in progress.
  sparc64: Guard against flushing openfirmware mappings.
  sunsab: Fix detection of BREAK on sunsab serial console
  bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000
  sparc64: Do not insert non-valid PTEs into the TSB hash table.
  sparc64: avoid code duplication in io_64.h
  sparc64: reorder functions in io_64.h
  sparc64: drop unused SLOW_DOWN_IO definitions
  sparc64: remove macro indirection in io_64.h
  sparc64: update IO access functions in PeeCeeI
  sparcspkr: use sbus_*() primitives for IO
  sparc: Add support for seek and shorter read to /dev/mdesc
  sparc: use %s for unaligned panic
  drivers/sbus/char: Micro-optimization in display7seg.c
  display7seg: Introduce the use of the managed version of kzalloc
  sparc64 - add mem to iomem resource

Merge git://git./linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"Highlights:

   1) Steady transitioning of the BPF instructure to a generic spot so
      all kernel subsystems can make use of it, from Alexei Starovoitov.

   2) SFC driver supports busy polling, from Alexandre Rames.

   3) Take advantage of hash table in UDP multicast delivery, from David
      Held.

   4) Lighten locking, in particular by getting rid of the LRU lists, in
      inet frag handling.  From Florian Westphal.

   5) Add support for various RFC6458 control messages in SCTP, from
      Geir Ola Vaagland.

   6) Allow to filter bridge forwarding database dumps by device, from
      Jamal Hadi Salim.

   7) virtio-net also now supports busy polling, from Jason Wang.

   8) Some low level optimization tweaks in pktgen from Jesper Dangaard
      Brouer.

   9) Add support for ipv6 address generation modes, so that userland
      can have some input into the process.  From Jiri Pirko.

  10) Consolidate common TCP connection request code in ipv4 and ipv6,
      from Octavian Purdila.

  11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

  12) Generic resizable RCU hash table, with intial users in netlink and
      nftables.  From Thomas Graf.

  13) Maintain a name assignment type so that userspace can see where a
      network device name came from (enumerated by kernel, assigned
      explicitly by userspace, etc.) From Tom Gundersen.

  14) Automatic flow label generation on transmit in ipv6, from Tom
      Herbert.

  15) New packet timestamping facilities from Willem de Bruijn, meant to
      assist in measuring latencies going into/out-of the packet
      scheduler, latency from TCP data transmission to ACK, etc"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
  cxgb4 : Disable recursive mailbox commands when enabling vi
  net: reduce USB network driver config options.
  tg3: Modify tg3_tso_bug() to handle multiple TX rings
  amd-xgbe: Perform phy connect/disconnect at dev open/stop
  amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
  net: sun4i-emac: fix memory leak on bad packet
  sctp: fix possible seqlock seadlock in sctp_packet_transmit()
  Revert "net: phy: Set the driver when registering an MDIO bus device"
  cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
  team: Simplify return path of team_newlink
  bridge: Update outdated comment on promiscuous mode
  net-timestamp: ACK timestamp for bytestreams
  net-timestamp: TCP timestamping
  net-timestamp: SCHED timestamp on entering packet scheduler
  net-timestamp: add key to disambiguate concurrent datagrams
  net-timestamp: move timestamp flags out of sk_flags
  net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
  cxgb4i : Move stray CPL definitions to cxgb4 driver
  tcp: reduce spurious retransmits due to transient SACK reneging
  qlcnic: Initialize dcbnl_ops before register_netdev
  ...

Merge tag 'random_for_linus' of git://git./linux/kernel/git/tytso/random

Pull randomness updates from Ted Ts'o:
"Cleanups and bug fixes to /dev/random, add a new getrandom(2) system
  call, which is a superset of OpenBSD's getentropy(2) call, for use
  with userspace crypto libraries such as LibreSSL.

  Also add the ability to have a kernel thread to pull entropy from
  hardware rng devices into /dev/random"

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
  hwrng: Pass entropy to add_hwgenerator_randomness() in bits, not bytes
  random: limit the contribution of the hw rng to at most half
  random: introduce getrandom(2) system call
  hw_random: fix sparse warning (NULL vs 0 for pointer)
  random: use registers from interrupted code for CPU's w/o a cycle counter
  hwrng: add per-device entropy derating
  hwrng: create filler thread
  random: add_hwgenerator_randomness() for feeding entropy from devices
  random: use an improved fast_mix() function
  random: clean up interrupt entropy accounting for archs w/o cycle counters
  random: only update the last_pulled time if we actually transferred entropy
  random: remove unneeded hash of a portion of the entropy pool
  random: always update the entropy pool under the spinlock

Merge branch 'next' of git://git./linux/kernel/git/jmorris/linux-security

Pull security subsystem updates from James Morris:
"In this release:

   - PKCS#7 parser for the key management subsystem from David Howells
   - appoint Kees Cook as seccomp maintainer
   - bugfixes and general maintenance across the subsystem"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
  X.509: Need to export x509_request_asymmetric_key()
  netlabel: shorter names for the NetLabel catmap funcs/structs
  netlabel: fix the catmap walking functions
  netlabel: fix the horribly broken catmap functions
  netlabel: fix a problem when setting bits below the previously lowest bit
  PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
  tpm: simplify code by using %*phN specifier
  tpm: Provide a generic means to override the chip returned timeouts
  tpm: missing tpm_chip_put in tpm_get_random()
  tpm: Properly clean sysfs entries in error path
  tpm: Add missing tpm_do_selftest to ST33 I2C driver
  PKCS#7: Use x509_request_asymmetric_key()
  Revert "selinux: fix the default socket labeling in sock_graft()"
  X.509: x509_request_asymmetric_keys() doesn't need string length arguments
  PKCS#7: fix sparse non static symbol warning
  KEYS: revert encrypted key change
  ima: add support for measuring and appraising firmware
  firmware_class: perform new LSM checks
  security: introduce kernel_fw_from_file hook
  PKCS#7: Missing inclusion of linux/err.h
  ...

ide: use module_platform_driver()

Eliminate boilerplate code by using module_platform_driver().

Signed-off-by: Christoph Jaeger <christophjaeger@linux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sparc64: Fix up merge thinko.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/davem/sparc

Conflicts:
arch/sparc/mm/init_64.c

Conflict was simple non-overlapping additions.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/davem/net

Conflicts:
drivers/net/Makefile
net/ipv6/sysctl_net_ipv6.c

Two ipv6_table_template[] additions overlap, so the index
of the ipv6_table[x] assignments needed to be adjusted.

In the drivers/net/Makefile case, we've gotten rid of the
garbage whereby we had to list every single USB networking
driver in the top-level Makefile, there is just one
"USB_NETWORKING" that guards everything.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'timers-core-for-linus' of git://git./linux/kernel/git/tip/tip

Pull timer and time updates from Thomas Gleixner:
"A rather large update of timers, timekeeping & co

   - Core timekeeping code is year-2038 safe now for 32bit machines.
     Now we just need to fix all in kernel users and the gazillion of
     user space interfaces which rely on timespec/timeval :)

   - Better cache layout for the timekeeping internal data structures.

   - Proper nanosecond based interfaces for in kernel users.

   - Tree wide cleanup of code which wants nanoseconds but does hoops
     and loops to convert back and forth from timespecs.  Some of it
     definitely belongs into the ugly code museum.

   - Consolidation of the timekeeping interface zoo.

   - A fast NMI safe accessor to clock monotonic for tracing.  This is a
     long standing request to support correlated user/kernel space
     traces.  With proper NTP frequency correction it's also suitable
     for correlation of traces accross separate machines.

   - Checkpoint/restart support for timerfd.

   - A few NOHZ[_FULL] improvements in the [hr]timer code.

   - Code move from kernel to kernel/time of all time* related code.

   - New clocksource/event drivers from the ARM universe.  I'm really
     impressed that despite an architected timer in the newer chips SoC
     manufacturers insist on inventing new and differently broken SoC
     specific timers.

[ Ed. "Impressed"? I don't think that word means what you think it means ]

   - Another round of code move from arch to drivers.  Looks like most
     of the legacy mess in ARM regarding timers is sorted out except for
     a few obnoxious strongholds.

   - The usual updates and fixlets all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  timekeeping: Fixup typo in update_vsyscall_old definition
  clocksource: document some basic timekeeping concepts
  timekeeping: Use cached ntp_tick_length when accumulating error
  timekeeping: Rework frequency adjustments to work better w/ nohz
  timekeeping: Minor fixup for timespec64->timespec assignment
  ftrace: Provide trace clocks monotonic
  timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
  seqcount: Add raw_write_seqcount_latch()
  seqcount: Provide raw_read_seqcount()
  timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
  timekeeping: Create struct tk_read_base and use it in struct timekeeper
  timekeeping: Restructure the timekeeper some more
  clocksource: Get rid of cycle_last
  clocksource: Move cycle_last validation to core code
  clocksource: Make delta calculation a function
  wireless: ath9k: Get rid of timespec conversions
  drm: vmwgfx: Use nsec based interfaces
  drm: i915: Use nsec based interfaces
  timekeeping: Provide ktime_get_raw()
  hangcheck-timer: Use ktime_get_ns()
  ...

Merge branch 'irq-core-for-linus' of git://git./linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
"Nothing spectacular from the irq department this time:
   - overhaul of the crossbar chip driver
   - overhaul of the spear shirq chip driver
   - support for the atmel-aic chip
   - code move from arch to drivers
   - the usual tiny fixlets
   - two reverts worth to mention which undo the too simple attempt of
     supporting wakeup interrupts on shared interrupt lines"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  Revert "irq: Warn when shared interrupts do not match on NO_SUSPEND"
  Revert "PM / sleep / irq: Do not suspend wakeup interrupts"
  irq: Warn when shared interrupts do not match on NO_SUSPEND
  irqchip: atmel-aic: Define irq fixups for atmel SoCs
  irqchip: atmel-aic: Implement RTC irq fixup
  irqchip: atmel-aic: Add irq fixup infrastructure
  irqchip: atmel-aic: Add atmel AIC/AIC5 drivers
  irqchip: atmel-aic: Move binding doc to interrupt-controller directory
  genirq: generic chip: Export irq_map_generic_chip function
  PM / sleep / irq: Do not suspend wakeup interrupts
  irqchip: or1k-pic: Migrate from arch/openrisc/
  irqchip: crossbar: Allow for quirky hardware with direct hardwiring of GIC
  documentation: dt: omap: crossbar: Add description for interrupt consumer
  irqchip: crossbar: Introduce centralized check for crossbar write
  irqchip: crossbar: Introduce ti, max-crossbar-sources to identify valid crossbar mapping
  irqchip: crossbar: Add kerneldoc for crossbar_domain_unmap callback
  irqchip: crossbar: Set cb pointer to null in case of error
  irqchip: crossbar: Change the goto naming
  irqchip: crossbar: Return proper error value
  irqchip: crossbar: Fix kerneldoc warning
  ...

x86: MCE: Add raw_lock conversion again

Commit ea431643d6c3 ("x86/mce: Fix CMCI preemption bugs") breaks RT by
the completely unrelated conversion of the cmci_discover_lock to a
regular (non raw) spinlock.  This lock was annotated in commit
59d958d2c7de ("locking, x86: mce: Annotate cmci_discover_lock as raw")
with a proper explanation why.

The argument for converting the lock back to a regular spinlock was:

- it does percpu ops without disabling preemption. Preemption is not
   disabled due to the mistaken use of a raw spinlock.

Which is complete nonsense.  The raw_spinlock is disabling preemption in
the same way as a regular spinlock.  In mainline spinlock maps to
raw_spinlock, in RT spinlock becomes a "sleeping" lock.

raw_spinlock has on RT exactly the same semantics as in mainline.  And
because this lock is taken in non preemptible context it must be raw on
RT.

Undo the locking brainfart.

Reported-by: Clark Williams <williams@redhat.com>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cxgb4 : Disable recursive mailbox commands when enabling vi

Enabling a Virtual Interface can result in an interrupt during the processing
of the VI Enable command and, in some paths, result in an attempt to issue
another command in the interrupt context, eventually crashing the system. Thus,
we disable interrupts during the course of the VI Enable command and ensure
enable doesn't sleep.

Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: reduce USB network driver config options.

USB network drivers are already handled in drivers/net/usb/Kconfig.
Let's save the maintenance burden of dependencies in drivers/net/Makefile.

The newly introduced USB_NET_DRIVERS umbrella config option defaults
to 'y' so as to minimize the changes of behavior.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tg3: Modify tg3_tso_bug() to handle multiple TX rings

tg3_tso_bug() was originally designed to handle only HW TX ring 0, Commit
d3f6f3a1d818410c17445bce4f4caab52eb102f1 ("tg3: Prevent page allocation failure
during TSO workaround") changed the driver logic to use tg3_tso_bug() for all
HW TX rings that are enabled. This patch fixes the regression by modifying
tg3_tso_bug() to handle multiple HW TX rings.

Signed-off-by: Prashant Sreedharan <prashant@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'amd-xgbe'

Tom Lendacky says:

====================
amd-xgbe: AMD XGBE driver update 2014-08-05

The following series of patches includes fixes/updates to the driver.

- Use dma_set_mask_and_coherent to set the DMA mask
- Move the phy connect/disconnect logic to allow for module unloading

Changes in V2:
- Check the return value of the dma_set_mask_and_coherent call
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Perform phy connect/disconnect at dev open/stop

A change added to the mdiobus/phy api added a module_get/module_put
during phy connect/disconnect processing. Currently, the driver
performs a phy connect during module probe and a phy disconnect during
module remove. With the addition of the module_get during phy connect
the amd-xgbe module use count is incremented and can no longer be
unloaded.

Move the phy connect/disconnect from the driver probe/remove functions
to the net_device_ops ndo_open/ndo_stop functions. This allows the
module use count to be decremented when the device(s) are brought down
and allows the module to be unloaded.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask

Use the dma_set_mask_and_coherent function to set the DMA mask rather
than setting the DMA mask fields directly. This was originally done
to work around a bug in the arm64 DMA support when RAM started above
the 4GB boundary which has since been fixed.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sun4i-emac: fix memory leak on bad packet

Upon reception of a new frame, the emac driver checks for a number
of error conditions, and flag the packet as "bad" if any of these
are present. It then allocates a skb unconditionally, but only uses
it if the packet is "good". On the error path, the skb is just forgotten,
and the system leaks memory.

The piece of junk I have on my desk seems to encounter such error
frequently enough so that the box goes OOM after a couple of days,
which makes me grumpy.

Fix this by moving the allocation on the "good_packet" path (and
convert it to netdev_alloc_skb while we're at it).

Tested on a random Allwinner A20 board.

Cc: Stefan Roese <sr@denx.de>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: <stable@vger.kernel.org> # 3.11+
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: fix possible seqlock seadlock in sctp_packet_transmit()

Dave reported following splat, caused by improper use of
IP_INC_STATS_BH() in process context.

BUG: using __this_cpu_add() in preemptible [00000000] code: trinity-c117/14551
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 3 PID: 14551 Comm: trinity-c117 Not tainted 3.16.0+ #33
ffffffff9ec898f0 0000000047ea7e23 ffff88022d32f7f0 ffffffff9e7ee207
0000000000000003 ffff88022d32f818 ffffffff9e397eaa ffff88023ee70b40
ffff88022d32f970 ffff8801c026d580 ffff88022d32f828 ffffffff9e397ee3
Call Trace:
[<ffffffff9e7ee207>] dump_stack+0x4e/0x7a
[<ffffffff9e397eaa>] check_preemption_disabled+0xfa/0x100
[<ffffffff9e397ee3>] __this_cpu_preempt_check+0x13/0x20
[<ffffffffc0839872>] sctp_packet_transmit+0x692/0x710 [sctp]
[<ffffffffc082a7f2>] sctp_outq_flush+0x2a2/0xc30 [sctp]
[<ffffffff9e0d985c>] ? mark_held_locks+0x7c/0xb0
[<ffffffff9e7f8c6d>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
[<ffffffffc082b99a>] sctp_outq_uncork+0x1a/0x20 [sctp]
[<ffffffffc081e112>] sctp_cmd_interpreter.isra.23+0x1142/0x13f0 [sctp]
[<ffffffffc081c86b>] sctp_do_sm+0xdb/0x330 [sctp]
[<ffffffff9e0b8f1b>] ? preempt_count_sub+0xab/0x100
[<ffffffffc083b350>] ? sctp_cname+0x70/0x70 [sctp]
[<ffffffffc08389ca>] sctp_primitive_ASSOCIATE+0x3a/0x50 [sctp]
[<ffffffffc083358f>] sctp_sendmsg+0x88f/0xe30 [sctp]
[<ffffffff9e0d673a>] ? lock_release_holdtime.part.28+0x9a/0x160
[<ffffffff9e0d62ce>] ? put_lock_stats.isra.27+0xe/0x30
[<ffffffff9e73b624>] inet_sendmsg+0x104/0x220
[<ffffffff9e73b525>] ? inet_sendmsg+0x5/0x220
[<ffffffff9e68ac4e>] sock_sendmsg+0x9e/0xe0
[<ffffffff9e1c0c09>] ? might_fault+0xb9/0xc0
[<ffffffff9e1c0bae>] ? might_fault+0x5e/0xc0
[<ffffffff9e68b234>] SYSC_sendto+0x124/0x1c0
[<ffffffff9e0136b0>] ? syscall_trace_enter+0x250/0x330
[<ffffffff9e68c3ce>] SyS_sendto+0xe/0x10
[<ffffffff9e7f9be4>] tracesys+0xdd/0xe2

This is a followup of commits f1d8cba61c3c4b ("inet: fix possible
seqlock deadlocks") and 7f88c6b23afbd315 ("ipv6: fix possible seqlock
deadlock in ip6_finish_output2")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "net: phy: Set the driver when registering an MDIO bus device"

Commit a71e3c37960ce5f9 ("net: phy: Set the driver when registering an MDIO bus
device") caused the following regression on the fec driver:

root@imx6qsabresd:~# echo mem > /sys/power/state
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.003 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
Unable to handle kernel NULL pointer dereference at virtual address 0000002c
pgd = bcd14000
[0000002c] *pgd=4d9e0831, *pte=00000000, *ppte=00000000
Internal error: Oops: 17 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 617 Comm: sh Not tainted 3.16.0 #17
task: bc0c4e00 ti: bceb6000 task.ti: bceb6000
PC is at fec_suspend+0x10/0x70
LR is at dpm_run_callback.isra.7+0x34/0x6c
pc : [<803f8a98>]    lr : [<80361f44>]    psr: 600f0013
sp : bceb7d70  ip : bceb7d88  fp : bceb7d84
r10: 8091523c  r9 : 00000000  r8 : bd88f478
r7 : 803f8a88  r6 : 81165988  r5 : 00000000  r4 : 00000000
r3 : 00000000  r2 : 00000000  r1 : bd88f478  r0 : bd88f478
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 10c5387d  Table: 4cd1404a  DAC: 00000015
Process sh (pid: 617, stack limit = 0xbceb6240)
Stack: (0xbceb7d70 to 0xbceb8000)
....

The problem with the original commit is explained by Russell King:

"It has the effect (as can be seen from the oops) of attaching the MDIO bus
device (itself is a bus-less device) to the platform driver, which means
that if the platform driver supports power management, it will be called
to power manage the MDIO bus device.

Moreover, drivers do not expect to be called for power management
operations for devices which they haven't probed, and certainly not for
devices which aren't part of the same bus that the driver is registered
against."

This reverts commit a71e3c37960ce5f9c6a519bc1215e3ba9fa83e75.

Cc: <stable@vger.kernel.org> #3.16
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine

Need to turn off SGE RX/TX Callback Timers & interrupt in cxgb4vf PCI Shutdown
routine in order to prevent crashes during reboot/poweroff when traffic is
running.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
pull request: batman-adv 2014-08-05

this is a pull request intended for net-next/linux-3.17 (yeah..it's really
late).

Patches 1, 2 and 4 are really minor changes:
- kmalloc_array is substituted to kmalloc when possible (as suggested by
checkpatch);
- net_ratelimited() is now used properly and the "suppressed" message is not
printed anymore if not needed;
- the internal version number has been increased to reflect our current version.

Patch 3 instead is introducing a change in the metric computation function
by changing the penalty applied at each mesh hop from 15/255 (~6%) to
30/255 (~11%). This change is introduced by Simon Wunderlich after having
observed a performance improvement in several networks when using the new value.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

team: Simplify return path of team_newlink

The variable "err" is not necessary.
Return register_netdevice() directly.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: Update outdated comment on promiscuous mode

Now bridge ports can be non-promiscuous, vlan_vid_add() is no longer an
unnecessary operation.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'v4l_for_linus' of git://git./linux/kernel/git/mchehab/linux-media

Pull media updates from Mauro Carvalho Chehab:
- removal of sn9c102.  This device driver was replaced a long time ago
   by gspca
- solo6x10 and go7007 webcam drivers moved from staging into
   mainstream.  They were waiting for an API to allow setting the image
   detection matrix
- SDR drivers moved from staging into mainstream: sdr-msi3101 (renamed
   as msi2500) and rtl2832
- added SDR driver for airspy
- added demux driver: si2165
- rework at several RC subsystem, making the code for RC-5 SZ variant
   to be added at the standard RC5 decoder
- added decoder for the XMP IR protocol
- tuner driver moved from staging into mainstream: msi3101 (renamed as
   msi001)
- added documentation for some additional SDR pixfmt
- some device tree bindings documented
- added support for exynos3250 at s5p-jpeg
- remove the obsolete, unmaintained and broken mx1_camera driver
- added support for remote controllers at au0828 driver
- added a RC driver: sunxi-cir
- several driver fixes, enhancements and cleanups.

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (455 commits)
  [media] cx23885: fix UNSET/TUNER_ABSENT confusion
  [media] coda: fix build error by making reset control optional
  [media] radio-miropcm20: fix sparse NULL pointer warning
  [media] MAINTAINERS: Update go7007 pattern
  [media] MAINTAINERS: Update solo6x10 patterns
  [media] media: atmel-isi: add primary DT support
  [media] media: atmel-isi: convert the pdata from pointer to structure
  [media] media: atmel-isi: add v4l2 async probe support
  [media] rcar_vin: add devicetree support
  [media] media: pxa_camera device-tree support
  [media] media: mt9m111: add device-tree suppport
  [media] soc_camera: add support for dt binding soc_camera drivers
  [media] media: soc_camera: pxa_camera documentation device-tree support
  [media] media: mt9m111: add device-tree documentation
  [media] s5p-mfc: remove unnecessary calling to function video_devdata()
  [media] s5p-jpeg: add chroma subsampling adjustment for Exynos3250
  [media] s5p-jpeg: Prevent erroneous downscaling for Exynos3250 SoC
  [media] s5p-jpeg: Assure proper crop rectangle initialization
  [media] s5p-jpeg: fix g_selection op
  [media] s5p-jpeg: Adjust jpeg_bound_align_image to Exynos3250 needs
  ...

Merge branch 'net-timestamp-next'

Willem de Bruijn says:

====================
net-timestamp: new tx tstamps and tcp

Extend socket tx timestamping:
- allow multiple types of software timestamps aside from send (1)
- add software timestamp on enter packet scheduling (4)
- add software timestamp for TCP (5)
- add software timestamp for TCP on ACK (6)

The sk_flags option space is nearly exhausted. Also move the
many timestamp options to a new sk->sk_tstamps (2).

To disambiguate data when tstamps may arrive out of order,
optionally return a sequential ID assigned at send (3).

Extend Linux tx timestamping to monitoring of latency
incurred within the kernel stack and to protocols embedded in TCP.
Complex kernel setups may have multiple layers of queueing, including
multiple instances of packet scheduling, and many classes per layer.
Many applications embed discrete payloads into TCP bytestreams for
reliability, flow control, etcetera. Detecting application tail
latency in such scenarios relies on identifying the exact queue
responsible if on the host, or the network latency if otherwise.

Changelog:
v4->v5
  - define SCM_TSTAMP_SND == 0, for legacy behavior
  - add TCP tstamps without changing the generated byte stream
    - modify GSO and ACK to find offset: slightly more complex
      than previous invariant that it is the last byte
  - consistent naming of packet scheduling
    - rename SCM_TSTAMP_ENQ to SCM_TSTAMP_SCHED
  - add unique key in ee_data
  - add id field in ee_info to disambiguate tstamps
    - optional, only on new flag SOF_TIMESTAMPING_OPT_ID
    - for bytestream, in bytes

v3->v4
  - (v3 review comment) removed skb->mark packet identification (*A)
  - (v3 review comment) fixed indentation
  - tcp: fixed poll() to return POLLERR on non-zero queue
  - rebased to work without syststamp
  - comments: removed all traces of MSG_TSTAMP_.. (*B)

v2->v3
  - extend the SO_TIMESTAMPING API, instead of defining a new one.
  - add protocol independent support to correlate tstamps with data,
    based on returning skb->mark.
  - removed no-payload optimization and documentation (for now):

    I have a follow-on patch that reintroduces MSG_TSTAMP along with a
    new socket option SOF_TIMESTAMPING_OPT_ONFLAG. This is equivalent
    to sequence setsockopt(<enable>); send(..); setsockopt(<disable>),
    but avoids the need to define a MSG_TSTAMP_<TYPE> for each type.

    I will leave these three patches as follow-on, as this patchset is
    large enough as is.

v1->v2
  - expand timestamping (existing and new) to SOCK_RAW and ping sockets
  - rename sock_errqueue_timestamping to scm_timestamping
  - change timestamp data format: do not add fields to scm_timestamping.
      Doing so could break legacy applications. Instead, communicate
      through an existing, but unused, field in the error message.
  - rename SOF_.._OPT_TX_NO_PAYLOAD to shorter SOF_.._OPT_TSONLY
  - move msg_tstamp test app out of patchset and to github
      git://github.com/wdebruij/kerneltools.git
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: ACK timestamp for bytestreams

Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
in the send() call is acknowledged. It implements the feature for TCP.

The timestamp is generated when the TCP socket cumulative ACK is moved
beyond the tracked seqno for the first time. The feature ignores SACK
and FACK, because those acknowledge the specific byte, but not
necessarily the entire contents of the buffer up to that byte.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: TCP timestamping

TCP timestamping extends SO_TIMESTAMPING to bytestreams.

Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call on a bytestream as
a request for a timestamp for the last byte in that send() buffer.

The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.

This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.

This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.

If ID generation in ee_data is enabled, bytestream timestamps return a
byte offset, instead of the packet counter for datagrams.

The implementation supports a single timestamp per packet. It silenty
replaces requests for previous timestamps. To avoid missing tstamps,
flush the tcp queue by disabling Nagle, cork and autocork. Missing
tstamps can be detected by offset when the ee_data ID is enabled.

Implementation details:

- On GSO, the timestamping code can be included in the main loop. I
moved it into its own loop to reduce the impact on the common case
to a single branch.

- To avoid leaking the absolute seqno to userspace, the offset
returned in ee_data must always be relative. It is an offset between
an skb and sk field. The first is always set (also for GSO & ACK).
The second must also never be uninitialized. Only allow the ID
option on sockets in the ESTABLISHED state, for which the seqno
is available. Never reset it to zero (instead, move it to the
current seqno when reenabling the option).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: SCHED timestamp on entering packet scheduler

Kernel transmit latency is often incurred in the packet scheduler.
Introduce a new timestamp on transmission just before entering the
scheduler. When data travels through multiple devices (bonding,
tunneling, ...) each device will export an individual timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: add key to disambiguate concurrent datagrams

Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.

Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.

The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.

The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: move timestamp flags out of sk_flags

sk_flags is reaching its limit. New timestamping options will not fit.
Move all of them into a new field sk->sk_tsflags.

Added benefit is that this removes boilerplate code to convert between
SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.

SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
timestamp logic (netstamp_needed). That can be simplified and this
last key removed, but will leave that for a separate patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>
----

The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
though that scatters tstamp fields throughout the struct.
Signed-off-by: David S. Miller <davem@davemloft.net>

net-timestamp: extend SCM_TIMESTAMPING ancillary data struct

Applications that request kernel tx timestamps with SO_TIMESTAMPING
read timestamps as recvmsg() ancillary data. The response is defined
implicitly as timespec[3].

1) define struct scm_timestamping explicitly and

2) add support for new tstamp types. On tx, scm_timestamping always
   accompanies a sock_extended_err. Define previously unused field
   ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
   define the existing behavior.

The reception path is not modified. On rx, no struct similar to
sock_extended_err is passed along with SCM_TIMESTAMPING.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4i : Move stray CPL definitions to cxgb4 driver

These belong to the t4 msg header, will ensure there is no accidental code
duplication in the future

Signed-off-by: Anish Bhatt <anish@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: reduce spurious retransmits due to transient SACK reneging

This commit reduces spurious retransmits due to apparent SACK reneging
by only reacting to SACK reneging that persists for a short delay.

When a sequence space hole at snd_una is filled, some TCP receivers
send a series of ACKs as they apparently scan their out-of-order queue
and cumulatively ACK all the packets that have now been consecutiveyly
received. This is essentially misbehavior B in "Misbehaviors in TCP
SACK generation" ACM SIGCOMM Computer Communication Review, April
2011, so we suspect that this is from several common OSes (Windows
2000, Windows Server 2003, Windows XP). However, this issue has also
been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
into spurious retransmissions by lack of timestamps?" from March 2014,
where the receiver was thought to be a BSD box.

Since snd_una would temporarily be adjacent to a previously SACKed
range in these scenarios, this receiver behavior triggered the Linux
SACK reneging code path in the sender. This led the sender to clear
the SACK scoreboard, enter CA_Loss, and spuriously retransmit
(potentially) every packet from the entire write queue at line rate
just a few milliseconds before the ACK for each packet arrives at the
sender.

To avoid such situations, now when a sender sees apparent reneging it
does not yet retransmit, but rather adjusts the RTO timer to give the
receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
that will restore sanity to the SACK scoreboard. If the reneging
persists until this RTO then, as before, we clear the SACK scoreboard
and enter CA_Loss.

A 10ms delay tolerates a receiver sending such a stream of ACKs at
56Kbit/sec. And to allow for receivers with slower or more congested
paths, we wait for at least RTT/2.

We validated the resulting max(RTT/2, 10ms) delay formula with a mix
of North American and South American Google web server traffic, and
found that for ACKs displaying transient reneging:

(1) 90% of inter-ACK delays were less than 10ms
(2) 99% of inter-ACK delays were less than RTT/2

In tests on Google web servers this commit reduced reneging events by
75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
any measurable impact on latency for user HTTP and SPDY requests.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'qlcnic'

Rajesh Borundia says:

====================
qlcnic: Bug fixes

The patch series contains following bug fixes.

* Aggregating tx stats in adapter variable was resulting
  in increase of stats when user runs ifconfig command
  and no traffic is running. Instead aggregate tx stats
  in local variable and then assign it to adapter struct
  variable.
* Set_driver_version was called after registering netdev
  which was resulting in a race between FLR in open
  handler and set_driver_version command as open handler
  can be called simulatneously on another cpu even if probe
  is not complete. So call this command before registering
  netdev.
* dcbnl_ops should be initialized before registering netdev
  as they are referenced in open handler.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

qlcnic: Initialize dcbnl_ops before register_netdev

o Initialization of dcbnl_ops after register netdev may result in
dcbnl_ops not getting set before it is being accessed from open.
So, moving it before register_netdev.

Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlcnic: Set driver version before registering netdev

o Earlier, set_drv_version was getting called after register_netdev.
  This was resulting in a race between set_drv_version and FLR called
  from open(). Moving set_drv_version before register_netdev avoids
  the race.

o Log response code in error message on CDRP failure.

Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlcnic: Fix update of ethtool stats.

o Aggregating tx stats in adapter variable was resulting in
  an increase in stats even after no traffic was run and
  user runs ifconfig/ethtool command.
o qlcnic_update_stats used to accumulate stats in adapter
  struct at each function call, instead accumulate tx stats
  in local variable and then assign it to adapter structure.

Reported-by: Holger Kiehl <holger.kiehl@dwd.de>
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
pull request net: batman-adv 2014-08-04

this is a pull request intended for net.

It just contains a patch by Sven Eckelmann that fixes the
reception of out-of-order fragments. As explained in the
commit message, the issue was due to a wrong assumption
about hlist_for_each_entry() in batadv_frag_insert_packet().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'regulator-v3.17' of git://git./linux/kernel/git/broonie/regulator

Pull regulator updates from Mark Brown:
"A couple of nice new features this month, the ability to map
  regulators in order to allow voltage control by external coprocessors
  is something people have been asking for for a long time.

   - improved support for switch only "regulators", allowing current
     state to be read from the parent regulator but no setting.

   - support for obtaining the register access method used to set
     voltages, for use in systems which can offload control of this to a
     coprocessor (typically for DVFS).

   - support for Active-Semi AC8846, Dialog DA9211 and Texas Instruments
     TPS65917"

* tag 'regulator-v3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (58 commits)
  regulator: act8865: fix build when OF is not enabled
  regulator: act8865: add act8846 to DT binding documentation
  regulator: act8865: add support for act8846
  regulator: act8865: prepare support for other act88xx devices
  regulator: act8865: set correct number of regulators in pdata
  regulator: act8865: Remove error variable in act8865_pmic_probe
  regulator: act8865: fix parsing of platform data
  regulator: tps65090: Set voltage for fixed regulators
  regulator: core: Allow to get voltage count and list from parent
  regulator: core: Get voltage from parent if not available
  regulator: Add missing statics and inlines for stub functions
  regulator: lp872x: Don't set constraints within the regulator driver
  regmap: Fix return code for stub regmap_get_device()
  regulator: s2mps11: Update module description and Kconfig to add S2MPU02 support
  regulator: Add helpers for low-level register access
  regmap: Allow regmap_get_device() to be used by modules
  regmap: Add regmap_get_device
  regulator: da9211: Remove unnecessary devm_regulator_unregister() calls
  regulator: Add DT bindings for tps65218 PMIC regulators.
  regulator: da9211: new regulator driver
  ...

Merge tag 'spi-v3.17' of git://git./linux/kernel/git/broonie/spi

Pull spi updates from Mark Brown:
"A quiet release, more bug fixes than anything else.  A few things do
  stand out though:

   - updates to several drivers to move towards the standard GPIO chip
     select handling in the core.
   - DMA support for the SH MSIOF driver.
   - support for Rockchip SPI controllers (their first mainline
     submission)"

* tag 'spi-v3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (64 commits)
  spi: davinci: use spi_device.cs_gpio to store gpio cs per spi device
  spi: davinci: add support to configure gpio cs through dt
  spi/pl022: Explicitly truncate large bitmask
  spi/atmel: Fix pointer to int conversion warnings on 64 bit builds
  spi: davinci: fix to support more than 2 chip selects
  spi: topcliff-pch: don't hardcode PCI slot to get DMA device
  spi: orion: fix incorrect handling of cell-index DT property
  spi: orion: Fix error return code in orion_spi_probe()
  spi/rockchip: fix error return code in rockchip_spi_probe()
  spi/rockchip: remove redundant dev_err call in rockchip_spi_probe()
  spi/rockchip: remove duplicated include from spi-rockchip.c
  ARM: dts: fix the chip select gpios definition in the SPI nodes
  spi: s3c64xx: Update binding documentation
  spi: s3c64xx: use the generic SPI "cs-gpios" property
  spi: s3c64xx: Revert "spi: s3c64xx: Added provision for dedicated cs pin"
  spi: atmel: Use dmaengine_prep_slave_sg() API
  spi: topcliff-pch: Update error messages for dmaengine_prep_slave_sg() API
  spi: sh-msiof: Use correct device for DMA mapping with IOMMU
  spi: sh-msiof: Handle dmaengine_prep_slave_single() failures gracefully
  spi: rspi: Handle dmaengine_prep_slave_sg() failures gracefully
  ...

Merge branch 'xen-netback-next'

Zoltan Kiss says:

====================
xen-netback: Changes around carrier handling

This series starts using carrier off as a way to purge packets when the guest is
not able (or willing) to receive them. It is a much faster way to get rid of
packets waiting for an overwhelmed guest.
The first patch changes current netback code where it relies currently on
netif_carrier_ok.
The second turns off the carrier if the guest times out on a queue, and only
turn it on again if that queue (or queues) resurrects.
====================

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>