Mike Kravetz [Wed, 22 Feb 2017 23:42:49 +0000 (15:42 -0800)]
userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support
userfaultfd UFFDIO_COPY allows user level code to copy data to a page at
fault time. The data is copied from user space to a newly allocated
huge page. The new routine copy_huge_page_from_user performs this copy.
Link: http://lkml.kernel.org/r/20161216144821.5183-17-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:46 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: wake userfaults after UFFDIO_UNREGISTER
Userfaults may still happen after the userfaultfd monitor thread
received a UFFD_EVENT_MADVDONTNEED until UFFDIO_UNREGISTER is run.
Wake any pending userfault within UFFDIO_UNREGISTER protected by the
mmap_sem for writing, so they will not be reported to userland leading
to UFFDIO_COPY returning -EINVAL (as the range was already unregistered)
and they will not hang permanently either.
Link: http://lkml.kernel.org/r/20161216144821.5183-16-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:43 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: avoid MADV_DONTNEED race condition
MADV_DONTNEED must be notified to userland before the pages are zapped.
This allows userland to immediately stop adding pages to the userfaultfd
ranges before the pages are actually zapped or there could be
non-zeropage leftovers as result of concurrent UFFDIO_COPY run in
between zap_page_range and madvise_userfault_dontneed (both
MADV_DONTNEED and UFFDIO_COPY runs under the mmap_sem for reading, so
they can run concurrently).
Link: http://lkml.kernel.org/r/20161216144821.5183-15-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:40 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED request
If the page is punched out of the address space the uffd reader should
know this and zeromap the respective area in case of the #PF event.
Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:37 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: optimize mremap_userfaultfd_complete()
Optimize the mremap_userfaultfd_complete() interface to pass only the
vm_userfaultfd_ctx pointer through the stack as a microoptimization.
Link: http://lkml.kernel.org/r/20161216144821.5183-13-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:34 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: add mremap() event
The event denotes that an area [start:end] moves to different location.
Length change isn't reported as "new" addresses, if they appear on the
uffd reader side they will not contain any data and the latter can just
zeromap them.
Waiting for the event ACK is also done outside of mmap sem, as for fork
event.
Link: http://lkml.kernel.org/r/20161216144821.5183-12-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Wed, 22 Feb 2017 23:42:30 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users
Since commit
d2005e3f41d4 ("userfaultfd: don't pin the user memory in
userfaultfd_file_create()") userfaultfd uses mm_count rather than
mm_users to pin mm_struct.
Make dup_userfaultfd consistent with this behaviour
Link: http://lkml.kernel.org/r/20161216144821.5183-11-aarcange@redhat.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:27 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: Add fork() event
When the mm with uffd-ed vmas fork()-s the respective vmas notify their
uffds with the event which contains a descriptor with new uffd. This
new descriptor can then be used to get events from the child and
populate its mm with data. Note, that there can be different uffd-s
controlling different vmas within one mm, so first we should collect all
those uffds (and ctx-s) in a list and then notify them all one by one
but only once per fork().
The context is created at fork() time but the descriptor, file struct
and anon inode object is created at event read time. So some trickery
is added to the userfaultfd_ctx_read() to handle the ctx queues' locking
vs file creation.
Another thing worth noticing is that the task that fork()-s waits for
the uffd event to get processed WITHOUT the mmap sem.
[aarcange@redhat.com: build warning fix]
Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com
Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:24 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: report all available features to userland
This will allow userland to probe all features available in the kernel.
It will however only enable the requested features in the open userfaultfd
context.
Link: http://lkml.kernel.org/r/20161216144821.5183-8-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:21 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor
The custom events are queued in ctx->event_wqh not to disturb the
fast-path-ed PF queue-wait-wakeup functions.
The events to be generated (other than PF-s) are requested in UFFD_API
ioctl with the uffd_api.features bits. Those, known by the kernel, are
then turned on and reported back to the user-space.
Link: http://lkml.kernel.org/r/20161216144821.5183-7-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Wed, 22 Feb 2017 23:42:18 +0000 (15:42 -0800)]
userfaultfd: non-cooperative: Split the find_userfault() routine
I will need one to lookup for userfaultfd_wait_queue-s in different
wait queue
Link: http://lkml.kernel.org/r/20161216144821.5183-6-aarcange@redhat.com
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:15 +0000 (15:42 -0800)]
userfaultfd: use vma_is_anonymous
Cleanup the vma->vm_ops usage.
Side note: it would be more robust if vma_is_anonymous() would also
check that vm_flags hasn't VM_PFNMAP set.
Link: http://lkml.kernel.org/r/20161216144821.5183-5-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:12 +0000 (15:42 -0800)]
userfaultfd: convert BUG() to WARN_ON_ONCE()
Avoid BUG_ON()s and only WARN instead. This is just a cleanup, it can't
make any runtime difference. This BUG_ON has never triggered and cannot
trigger.
Link: http://lkml.kernel.org/r/20161216144821.5183-4-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:09 +0000 (15:42 -0800)]
userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP
Minor comment correction.
Link: http://lkml.kernel.org/r/20161216144821.5183-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Wed, 22 Feb 2017 23:42:06 +0000 (15:42 -0800)]
userfaultfd: document _IOR/_IOW
Patch series "userfaultfd tmpfs/hugetlbfs/non-cooperative", v2
These userfaultfd features are finished and are ready for larger
exposure in -mm and upstream merging.
1) tmpfs non present userfault
2) hugetlbfs non present userfault
3) non cooperative userfault for fork/madvise/mremap
qemu development code is already exercising 2) and container postcopy
live migration needs 3).
1) is not currently used but there's a self test and we know some qemu
user for various reasons uses tmpfs as backing for KVM so it'll need it
too to use postcopy live migration with tmpfs memory.
All review feedback from the previous submit has been handled and the
fixes are included. There's no outstanding issue AFIK.
Upstream code just did a s/fe/vmf/ conversion in the page faults and
this has been converted as well incrementally.
In addition to the previous submits, this also wakes up stuck userfaults
during UFFDIO_UNREGISTER. The non cooperative testcase actually
reproduced this problem by getting stuck instead of quitting clean in
some rare case as it could call UFFDIO_UNREGISTER while some userfault
could be still in flight. The other option would have been to keep
leaving it up to userland to serialize itself and to patch the testcase
instead but the wakeup during unregister I think is preferable.
David also asked the UFFD_FEATURE_MISSING_HUGETLBFS and
UFFD_FEATURE_MISSING_SHMEM feature flags to be added so QEMU can avoid
to probe if the hugetlbfs/shmem missing support is available by calling
UFFDIO_REGISTER. QEMU already checks HUGETLBFS_MAGIC with fstatfs so if
UFFD_FEATURE_MISSING_HUGETLBFS is also set, it knows UFFDIO_REGISTER
will succeed (or if it fails, it's for some other more concerning
reason). There's no reason to worry about adding too many feature
flags. There are 64 available and worst case we've to bump the API if
someday we're really going to run out of them.
The round-trip network latency of hugetlbfs userfaults during postcopy
live migration is still of the order of dozen milliseconds on 10GBit if
at 2MB hugepage granularity so it's working perfectly and it should
provide for higher bandwidth or lower CPU usage (which makes it
interesting to add an option in the future to support THP granularity
too for anonymous memory, UFFDIO_COPY would then have to create THP if
alignment/len allows for it). 1GB hugetlbfs granularity will require
big changes in hugetlbfs to work so it's deferred for later.
This patch (of 42):
This adds proper documentation (inline) to avoid the risk of further
misunderstandings about the semantics of _IOW/_IOR and it also reminds
whoever will bump the UFFDIO_API in the future, to change the two ioctl
to _IOW.
This was found while implementing strace support for those ioctl,
otherwise we could have never found it by just reviewing kernel code and
testing it.
_IOC_READ or _IOC_WRITE alters nothing but the ioctl number itself, so
it's only worth fixing if the UFFDIO_API is bumped someday.
Link: http://lkml.kernel.org/r/20161216144821.5183-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:42:03 +0000 (15:42 -0800)]
oom, trace: add compaction retry tracepoint
Higher order requests oom debugging is currently quite hard. We do have
some compaction points which can tell us how the compaction is operating
but there is no trace point to tell us about compaction retry logic.
This patch adds a one which will have the following format
bash-3126 [001] .... 1498.220001: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 max_retries=16 should_retry=0
we can see that the order 9 request is not retried even though we are in
the highest compaction priority mode becase the last compaction attempt
was withdrawn. This means that compaction_zonelist_suitable must have
returned false and there is no suitable zone to compact for this request
and so no need to retry further.
another example would be
<...>-3137 [001] .... 81.501689: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=failed retries=0 max_retries=16 should_retry=0
in this case the order-9 compaction failed to find any suitable block.
We do not retry anymore because this is a costly request and those do
not go below COMPACT_PRIO_SYNC_LIGHT priority.
Link: http://lkml.kernel.org/r/20161220130135.15719-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:42:00 +0000 (15:42 -0800)]
oom, trace: add oom detection tracepoints
should_reclaim_retry is the central decision point for declaring the
OOM. It might be really useful to expose data used for this decision
making when debugging an unexpected oom situations.
Say we have an OOM report:
[ 52.264001] mem_eater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
[ 52.267549] CPU: 3 PID: 3148 Comm: mem_eater Tainted: G W
4.8.0-oomtrace3-00006-gb21338b386d2 #1024
Now we can check the tracepoint data to see how we have ended up in this
situation:
mem_eater-3148 [003] .... 52.432801: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11134 min_wmark=11084 no_progress_loops=1 wmark_check=1
mem_eater-3148 [003] .... 52.433269: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11103 min_wmark=11084 no_progress_loops=1 wmark_check=1
mem_eater-3148 [003] .... 52.433712: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11100 min_wmark=11084 no_progress_loops=2 wmark_check=1
mem_eater-3148 [003] .... 52.434067: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11097 min_wmark=11084 no_progress_loops=3 wmark_check=1
mem_eater-3148 [003] .... 52.434414: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11094 min_wmark=11084 no_progress_loops=4 wmark_check=1
mem_eater-3148 [003] .... 52.434761: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11091 min_wmark=11084 no_progress_loops=5 wmark_check=1
mem_eater-3148 [003] .... 52.435108: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11087 min_wmark=11084 no_progress_loops=6 wmark_check=1
mem_eater-3148 [003] .... 52.435478: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11084 min_wmark=11084 no_progress_loops=7 wmark_check=0
mem_eater-3148 [003] .... 52.435478: reclaim_retry_zone: node=0 zone=DMA order=0 reclaimable=0 available=1126 min_wmark=179 no_progress_loops=7 wmark_check=0
The above shows that we can quickly deduce that the reclaim stopped
making any progress (see no_progress_loops increased in each round) and
while there were still some 51 reclaimable pages they couldn't be
dropped for some reason (vmscan trace points would tell us more about
that part). available will represent reclaimable + free_pages scaled
down per no_progress_loops factor. This is essentially an optimistic
estimate of how much memory we would have when reclaiming everything.
This can be compared to min_wmark to get a rought idea but the
wmark_check tells the result of the watermark check which is more
precise (includes lowmem reserves, considers the order etc.). As we can
see no zone is eligible in the end and that is why we have triggered the
oom in this situation.
Please note that higher order requests might fail on the wmark_check
even when there is much more memory available than min_wmark - e.g.
when the memory is fragmented. A follow up tracepoint will help to
debug those situations.
Link: http://lkml.kernel.org/r/20161220130135.15719-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:41:57 +0000 (15:41 -0800)]
mm, trace: extract COMPACTION_STATUS and ZONE_TYPE to a common header
COMPACTION_STATUS resp. ZONE_TYPE are currently used to translate enum
compact_result resp. struct zone index into their symbolic names for an
easier post processing. The follow up patch would like to reuse this as
well. The code involves some preprocessor black magic which is better not
duplicated elsewhere so move it to a common mm tracing relate header.
Link: http://lkml.kernel.org/r/20161220130135.15719-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Geliang Tang [Wed, 22 Feb 2017 23:41:54 +0000 (15:41 -0800)]
mm/vmalloc.c: use rb_entry_safe
Use rb_entry_safe() instead of open-coding it.
Link: http://lkml.kernel.org/r/81bb9820e5b9e4a1c596b3e76f88abf8c4a76cb0.1482221947.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Wed, 22 Feb 2017 23:41:51 +0000 (15:41 -0800)]
mm, page_alloc: avoid page_to_pfn() when merging buddies
On architectures that allow memory holes, page_is_buddy() has to perform
page_to_pfn() to check for the memory hole. After the previous patch,
we have the pfn already available in __free_one_page(), which is the
only caller of page_is_buddy(), so move the check there and avoid
page_to_pfn().
Link: http://lkml.kernel.org/r/20161216120009.20064-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Wed, 22 Feb 2017 23:41:48 +0000 (15:41 -0800)]
mm, page_alloc: don't convert pfn to idx when merging
In __free_one_page() we do the buddy merging arithmetics on "page/buddy
index", which is just the lower MAX_ORDER bits of pfn. The operations
we do that affect the higher bits are bitwise AND and subtraction (in
that order), where the final result will be the same with the higher
bits left unmasked, as long as these bits are equal for both buddies -
which must be true by the definition of a buddy.
We can therefore use pfn's directly instead of "index" and skip the
zeroing of >MAX_ORDER bits. This can help a bit by itself, although
compiler might be smart enough already. It also helps the next patch to
avoid page_to_pfn() for memory hole checks.
Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 22 Feb 2017 23:41:45 +0000 (15:41 -0800)]
mm: throttle show_mem() from warn_alloc()
Tetsuo has been stressing OOM killer path with many parallel allocation
requests when he has noticed that it is not all that hard to swamp
kernel logs with warn_alloc messages caused by allocation stalls. Even
though the allocation stall message is triggered only once in 10s there
might be many different tasks hitting it roughly around the same time.
A big part of the output is show_mem() which can generate a lot of
output even on a small machines. There is no reason to show the state
of memory counter for each allocation stall, especially when multiple of
them are reported in a short time period. Chances are that not much has
changed since the last report. This patch simply rate limits show_mem
called from warn_alloc to only dump something once per second. This
should be enough to give us a clue why an allocation might be stalling
while burst of warnings will not swamp log with too much data.
While we are at it, extract all the show_mem related handling (filters)
into a separate function warn_alloc_show_mem. This will make the code
cleaner and as a bonus point we can distinguish which part of warn_alloc
got throttled due to rate limiting as ___ratelimit dumps the caller.
[akpm@linux-foundation.org: reduce scope of the ratelimit_states]
Link: http://lkml.kernel.org/r/20161215101510.9030-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Wed, 22 Feb 2017 23:41:42 +0000 (15:41 -0800)]
tmpfs: change shmem_mapping() to test shmem_aops
Callers of shmem_mapping() are interested in whether the mapping is swap
backed - except for uprobes, which is interested in whether it should
use shmem_read_mapping_page(). All these callers are better served by a
shmem_mapping() which checks for shmem_aops, than the current version
which goes through several indirections to find where the inode lives -
and has the surprising effect that a private mmap of /dev/zero satisfies
both vma_is_anonymous() and shmem_mapping(), when that device node is on
devtmpfs. I don't think anything in the tree suffers from that
surprise, but it caught me out, and is better fixed.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052148530.13021@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:39 +0000 (15:41 -0800)]
slub: make sysfs directories for memcg sub-caches optional
SLUB creates a per-cache directory under /sys/kernel/slab which hosts a
bunch of debug files. Usually, there aren't that many caches on a
system and this doesn't really matter; however, if memcg is in use, each
cache can have per-cgroup sub-caches. SLUB creates the same directories
for these sub-caches under /sys/kernel/slab/$CACHE/cgroup.
Unfortunately, because there can be a lot of cgroups, active or
draining, the product of the numbers of caches, cgroups and files in
each directory can reach a very high number - hundreds of thousands is
commonplace. Millions and beyond aren't difficult to reach either.
What's under /sys/kernel/slab is primarily for debugging and the
information and control on the a root cache already cover its
sub-caches. While having a separate directory for each sub-cache can be
helpful for development, it doesn't make much sense to pay this amount
of overhead by default.
This patch introduces a boot parameter slub_memcg_sysfs which determines
whether to create sysfs directories for per-memcg sub-caches. It also
adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's
default value and defaults to 0.
[akpm@linux-foundation.org: kset_unregister(NULL) is legal]
Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:36 +0000 (15:41 -0800)]
slab: use memcg_kmem_cache_wq for slab destruction operations
If there's contention on slab_mutex, queueing the per-cache destruction
work item on the system_wq can unnecessarily create and tie up a lot of
kworkers.
Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
global and use that workqueue for the destruction work items too. While
at it, convert the workqueue from an unbound workqueue to a per-cpu one
with concurrency limited to 1. It's generally preferable to use per-cpu
workqueues and concurrency limit of 1 is safe enough.
This is suggested by Joonsoo Kim.
Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov@tarantool.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:33 +0000 (15:41 -0800)]
slab: remove slub sysfs interface files early for empty memcg caches
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
Each cache has a number of sysfs interface files under /sys/kernel/slab.
On a system with a lot of memory and transient memcgs, the number of
interface files which have to be removed once memory reclaim kicks in
can reach millions.
Link: http://lkml.kernel.org/r/20170117235411.9408-10-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:30 +0000 (15:41 -0800)]
slab: remove synchronous synchronize_sched() from memcg cache deactivation path
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't scale
when a huge number of caches are destroyed back-to-back. While there
used to be a simple batching mechanism, the batching was too restricted
to be helpful.
This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex. While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.
Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:27 +0000 (15:41 -0800)]
slab: introduce __kmemcg_cache_deactivate()
__kmem_cache_shrink() is called with %true @deactivate only for memcg
caches. Remove @deactivate from __kmem_cache_shrink() and introduce
__kmemcg_cache_deactivate() instead. Each memcg-supporting allocator
should implement it and it should deactivate and drain the cache.
This is to allow memcg cache deactivation behavior to further deviate
from simple shrinking without messing up __kmem_cache_shrink().
This is pure reorganization and doesn't introduce any observable
behavior changes.
v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:24 +0000 (15:41 -0800)]
slab: implement slab_root_caches list
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
slab_caches currently lists all caches including root and memcg ones.
This is the only data structure which lists the root caches and
iterating root caches can only be done by walking the list while
skipping over memcg caches. As there can be a huge number of memcg
caches, this can become very expensive.
This also can make /proc/slabinfo behave very badly. seq_file processes
reads in 4k chunks and seeks to the previous Nth position on slab_caches
list to resume after each chunk. With a lot of memcg cache churns on
the list, reading /proc/slabinfo can become very slow and its content
often ends up with duplicate and/or missing entries.
This patch adds a new list slab_root_caches which lists only the root
caches. When memcg is not enabled, it becomes just an alias of
slab_caches. memcg specific list operations are collected into
memcg_[un]link_cache().
Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov@tarantool.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:21 +0000 (15:41 -0800)]
slab: link memcg kmem_caches on their associated memory cgroup
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
While a memcg kmem_cache is listed on its root cache's ->children list,
there is no direct way to iterate all kmem_caches which are assocaited
with a memory cgroup. The only way to iterate them is walking all
caches while filtering out caches which don't match, which would be most
of them.
This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge. This combined with the
synchronous RCU operations can tie up a CPU and affect the whole machine
for many hours when memory reclaim triggers offlining and destruction of
the stale memcgs.
This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg. All memcg specific iterations, including
stat file access, are updated to use the new list instead.
Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:17 +0000 (15:41 -0800)]
slab: reorganize memcg_cache_params
We're going to change how memcg caches are iterated. In preparation,
clean up and reorganize memcg_cache_params.
* The shared ->list is replaced by ->children in root and
->children_node in children.
* ->is_root_cache is removed. Instead ->root_cache is moved out of
the child union and now used by both root and children. NULL
indicates root cache. Non-NULL a memcg one.
This patch doesn't cause any observable behavior changes.
Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:14 +0000 (15:41 -0800)]
slab: remove synchronous rcu_barrier() call in memcg cache release path
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
SLAB_DESTORY_BY_RCU caches need to flush all RCU operations before
destruction because slab pages are freed through RCU and they need to be
able to dereference the associated kmem_cache. Currently, it's done
synchronously with rcu_barrier(). As rcu_barrier() is expensive
time-wise, slab implements a batching mechanism so that rcu_barrier()
can be done for multiple caches at the same time.
Unfortunately, the rcu_barrier() is in synchronous path which is called
while holding cgroup_mutex and the batching is too limited to be
actually helpful.
This patch updates the cache release path so that the batching is
asynchronous and global. All SLAB_DESTORY_BY_RCU caches are queued
globally and a work item consumes the list. The work item calls
rcu_barrier() only once for all caches that are currently queued.
* release_caches() is removed and shutdown_cache() now either directly
release the cache or schedules a RCU callback to do that. This
makes the cache inaccessible once shutdown_cache() is called and
makes it impossible for shutdown_memcg_caches() to do memcg-specific
cleanups afterwards. Move memcg-specific part into a helper,
unlink_memcg_cache(), and make shutdown_cache() call it directly.
Link: http://lkml.kernel.org/r/20170117235411.9408-4-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov@tarantool.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:11 +0000 (15:41 -0800)]
slub: separate out sysfs_slab_release() from sysfs_slab_remove()
Separate out slub sysfs removal and release, and call the former earlier
from __kmem_cache_shutdown(). There's no reason to defer sysfs removal
through RCU and this will later allow us to remove sysfs files way
earlier during memory cgroup offline instead of release.
Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun Heo [Wed, 22 Feb 2017 23:41:08 +0000 (15:41 -0800)]
Revert "slub: move synchronize_sched out of slab_mutex on shrink"
Patch series "slab: make memcg slab destruction scalable", v3.
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.
I've seen machines which end up with hundred thousands of caches and
many millions of kernfs_nodes. The current code is O(N^2) on the total
number of caches and has synchronous rcu_barrier() and
synchronize_sched() in cgroup offline / release path which is executed
while holding cgroup_mutex. Combined, this leads to very expensive and
slow cache destruction operations which can easily keep running for half
a day.
This also messes up /proc/slabinfo along with other cache iterating
operations. seq_file operates on 4k chunks and on each 4k boundary
tries to seek to the last position in the list. With a huge number of
caches on the list, this becomes very slow and very prone to the list
content changing underneath it leading to a lot of missing and/or
duplicate entries.
This patchset addresses the scalability problem.
* Add root and per-memcg lists. Update each user to use the
appropriate list.
* Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
and asynchronous.
* For dying empty slub caches, remove the sysfs files after
deactivation so that we don't end up with millions of sysfs files
without any useful information on them.
This patchset contains the following nine patches.
0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
0004-slab-reorganize-memcg_cache_params.patch
0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
0006-slab-implement-slab_root_caches-list.patch
0007-slab-introduce-__kmemcg_cache_deactivate.patch
0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch
0001 reverts an existing optimization to prepare for the following
changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release
path batched and asynchronous. 0004-0006 separate out the lists.
0007-0008 replace synchronize_sched() in slub destruction path with
call_rcu_sched(). 0009 removes sysfs files early for empty dying
caches. 0010 makes destruction work items use a workqueue with limited
concurrency.
This patch (of 10):
Revert
89e364db71fb5e ("slub: move synchronize_sched out of slab_mutex on
shrink").
With kmem cgroup support enabled, kmem_caches can be created and destroyed
frequently and a great number of near empty kmem_caches can accumulate if
there are a lot of transient cgroups and the system is not under memory
pressure. When memory reclaim starts under such conditions, it can lead
to consecutive deactivation and destruction of many kmem_caches, easily
hundreds of thousands on moderately large systems, exposing scalability
issues in the current slab management code. This is one of the patches to
address the issue.
Moving synchronize_sched() out of slab_mutex isn't enough as it's still
inside cgroup_mutex. The whole deactivation / release path will be
updated to avoid all synchronous RCU operations. Revert this insufficient
optimization in preparation to ease future changes.
Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Wed, 22 Feb 2017 23:41:05 +0000 (15:41 -0800)]
mm, slab: rename kmalloc-node cache to kmalloc-<size>
SLAB as part of its bootstrap pre-creates one kmalloc cache that can fit
the kmem_cache_node management structure, and puts it into the generic
kmalloc cache array (e.g. for 128b objects). The name of this cache is
"kmalloc-node", which is confusing for readers of /proc/slabinfo as the
cache is used for generic allocations (and not just the kmem_cache_node
struct) and it appears as the kmalloc-128 cache is missing.
An easy solution is to use the kmalloc-<size> name when pre-creating the
cache, which we can get from the kmalloc_info array.
Example /proc/slabinfo before the patch:
...
kmalloc-256 1647 1984 256 16 1 : tunables 120 60 8 : slabdata 124 124 828
kmalloc-192 1974 1974 192 21 1 : tunables 120 60 8 : slabdata 94 94 133
kmalloc-96 1332 1344 128 32 1 : tunables 120 60 8 : slabdata 42 42 219
kmalloc-64 2505 5952 64 64 1 : tunables 120 60 8 : slabdata 93 93 715
kmalloc-32 4278 4464 32 124 1 : tunables 120 60 8 : slabdata 36 36 346
kmalloc-node 1352 1376 128 32 1 : tunables 120 60 8 : slabdata 43 43 53
kmem_cache 132 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0
After the patch:
...
kmalloc-256 1672 2160 256 16 1 : tunables 120 60 8 : slabdata 135 135 807
kmalloc-192 1992 2016 192 21 1 : tunables 120 60 8 : slabdata 96 96 203
kmalloc-96 1159 1184 128 32 1 : tunables 120 60 8 : slabdata 37 37 116
kmalloc-64 2561 4864 64 64 1 : tunables 120 60 8 : slabdata 76 76 785
kmalloc-32 4253 4340 32 124 1 : tunables 120 60 8 : slabdata 35 35 270
kmalloc-128 1256 1280 128 32 1 : tunables 120 60 8 : slabdata 40 40 39
kmem_cache 125 147 192 21 1 : tunables 120 60 8 : slabdata 7 7 0
[vbabka@suse.cz: export the whole kmalloc_info structure instead of just a name accessor, per Christoph Lameter]
Link: http://lkml.kernel.org/r/54e80303-b814-4232-66d4-95b34d3eb9d0@suse.cz
Link: http://lkml.kernel.org/r/20170203181008.24898-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Borislav Petkov [Wed, 22 Feb 2017 23:41:02 +0000 (15:41 -0800)]
mm/slub: add a dump_stack() to the unexpected GFP check
We wish to know who is doing such a thing. slab.c does this.
Link: http://lkml.kernel.org/r/20170116091643.15260-1-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Grygorii Maistrenko [Wed, 22 Feb 2017 23:40:59 +0000 (15:40 -0800)]
slub: do not merge cache if slub_debug contains a never-merge flag
In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from
commandline but never checks if there are features from the
SLAB_NEVER_MERGE set.
As a result selected by slub_debug caches are always mergeable if they
have been created without a custom constructor set or without one of the
SLAB_* debug features on.
This moves the SLAB_NEVER_MERGE check below the flags update from
commandline to make sure it won't merge the slab cache if one of the debug
features is on.
Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d
Signed-off-by: Grygorii Maistrenko <grygoriimkd@gmail.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prarit Bhargava [Wed, 22 Feb 2017 23:40:56 +0000 (15:40 -0800)]
kernel/watchdog.c: do not hardcode CPU 0 as the initial thread
When CONFIG_BOOTPARAM_HOTPLUG_CPU0 is enabled, the socket containing the
boot cpu can be replaced. During the hot add event, the message
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
is output implying that the NMI watchdog was disabled at some point. This
is not the case and the message has caused confusion for users of systems
that support the removal of the boot cpu socket.
The watchdog code is coded to assume that cpu 0 is always the first cpu to
initialize the watchdog, and the last to stop its watchdog thread. That
is not the case for initializing if cpu 0 has been removed and added. The
removal case has never been correct because the smpboot code will remove
the watchdog threads starting with the lowest cpu number.
This patch adds watchdog_cpus to track the number of cpus with active NMI
watchdog threads so that the first and last thread can be used to set and
clear the value of firstcpu_err. firstcpu_err is set when the first
watchdog thread is enabled, and cleared when the last watchdog thread is
disabled.
Link: http://lkml.kernel.org/r/1480425321-32296-1-git-send-email-prarit@redhat.com
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joshua Hunt <johunt@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cong Wang [Wed, 22 Feb 2017 23:40:53 +0000 (15:40 -0800)]
9p: fix a potential acl leak
posix_acl_update_mode() could possibly clear 'acl', if so we leak the
memory pointed by 'acl'. Save this pointer before calling
posix_acl_update_mode() and release the memory if 'acl' really gets
cleared.
Link: http://lkml.kernel.org/r/1486678332-2430-1-git-send-email-xiyou.wangcong@gmail.com
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Mark Salyzyn <salyzyn@android.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Wed, 22 Feb 2017 23:40:50 +0000 (15:40 -0800)]
block: use for_each_thread() in sys_ioprio_set()/sys_ioprio_get()
IOPRIO_WHO_USER case in sys_ioprio_set()/sys_ioprio_get() are using
while_each_thread(), which is unsafe under RCU lock according to commit
0c740d0afc3bff0a ("introduce for_each_thread() to replace the buggy
while_each_thread()"). Use for_each_thread() (via
for_each_process_thread()) which is safe under RCU lock.
Link: http://lkml.kernel.org/r/201702011947.DBD56740.OMVHOLOtSJFFFQ@I-love.SAKURA.ne.jp
Link: http://lkml.kernel.org/r/1486041779-4401-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Wed, 22 Feb 2017 23:40:47 +0000 (15:40 -0800)]
parisc: use generic current.h
Given that the arch does not add its own implementations, simply use the
asm-generic/current.h (generic-y) header instead of duplicating code.
Link: http://lkml.kernel.org/r/1485992878-4780-4-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Ren [Wed, 22 Feb 2017 23:40:44 +0000 (15:40 -0800)]
ocfs2: fix deadlock issue when taking inode lock at vfs entry points
Commit
743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
results in a deadlock, as the author "Tariq Saeed" realized shortly
after the patch was merged. The discussion happened here
https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html
The reason why taking cluster inode lock at vfs entry points opens up a
self deadlock window, is explained in the previous patch of this series.
So far, we have seen two different code paths that have this issue.
1. do_sys_open
may_open
inode_permission
ocfs2_permission
ocfs2_inode_lock() <=== take PR
generic_permission
get_acl
ocfs2_iop_get_acl
ocfs2_inode_lock() <=== take PR
2. fchmod|fchmodat
chmod_common
notify_change
ocfs2_setattr <=== take EX
posix_acl_chmod
get_acl
ocfs2_iop_get_acl <=== take PR
ocfs2_iop_set_acl <=== take EX
Fixes them by adding the tracking logic (in the previous patch) for these
funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
ocfs2_setattr().
Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Ren [Wed, 22 Feb 2017 23:40:41 +0000 (15:40 -0800)]
ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock
We are in the situation that we have to avoid recursive cluster locking,
but there is no way to check if a cluster lock has been taken by a precess
already.
Mostly, we can avoid recursive locking by writing code carefully.
However, we found that it's very hard to handle the routines that are
invoked directly by vfs code. For instance:
const struct inode_operations ocfs2_file_iops = {
.permission = ocfs2_permission,
.get_acl = ocfs2_iop_get_acl,
.set_acl = ocfs2_iop_set_acl,
};
Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
do_sys_open
may_open
inode_permission
ocfs2_permission
ocfs2_inode_lock() <=== first time
generic_permission
get_acl
ocfs2_iop_get_acl
ocfs2_inode_lock() <=== recursive one
A deadlock will occur if a remote EX request comes in between two of
ocfs2_inode_lock(). Briefly describe how the deadlock is formed:
On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
the remote EX lock request. Another hand, the recursive cluster lock
(the second one) will be blocked in in __ocfs2_cluster_lock() because of
OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why? because
there is no chance for the first cluster lock on this node to be
unlocked - we block ourselves in the code path.
The idea to fix this issue is mostly taken from gfs2 code.
1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
of the processes' pid who has taken the cluster lock of this lock
resource;
2. introduce a new flag for ocfs2_inode_lock_full:
OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
us if we've got cluster lock.
3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
got the cluster lock in the upper code path.
The tracking logic should be used by some of the ocfs2 vfs's callbacks,
to solve the recursive locking issue cuased by the fact that vfs
routines can call into each other.
The performance penalty of processing the holder list should only be
seen at a few cases where the tracking logic is used, such as get/set
acl.
You may ask what if the first time we got a PR lock, and the second time
we want a EX lock? fortunately, this case never happens in the real
world, as far as I can see, including permission check,
(get|set)_(acl|attr), and the gfs2 code also do so.
[sfr@canb.auug.org.au remove some inlines]
Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Wed, 22 Feb 2017 23:40:38 +0000 (15:40 -0800)]
score: remove asm/current.h
... it's already using the generic version anyways, so just drop the file
as do the other archs that do not implement their own version of the
current macro.
Link: http://lkml.kernel.org/r/1485992878-4780-5-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sudip Mukherjee [Wed, 22 Feb 2017 23:40:35 +0000 (15:40 -0800)]
m32r: fix build warning
Some m32r builds were having a warning:
arch/m32r/include/asm/cmpxchg.h:191:3: warning: value computed is not used
arch/m32r/include/asm/cmpxchg.h:68:3: warning: value computed is not used
Taking the idea from commit
e001bbae7147 ("ARM: cmpxchg: avoid warnings
from macro-ized cmpxchg() implementations") the m32r implementation is
changed to use a similar construct with a compound expression instead of
a typecast, which causes the compiler to not complain about an unused
result.
Link: http://lkml.kernel.org/r/1484432664-7015-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Wed, 22 Feb 2017 23:40:32 +0000 (15:40 -0800)]
m32r: use generic current.h
Given that the arch does not add its own implementations, simply
use the asm-generic/current.h (generic-y) header instead of
duplicating code.
Link: http://lkml.kernel.org/r/1482896994-25863-1-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hou Tao [Wed, 22 Feb 2017 23:40:29 +0000 (15:40 -0800)]
scripts/tags.sh: include arch/Kconfig* for tags generation
Kconfig files under arch/ directory are ignored by all_kconfigs(),
so include them for tags generation.
Link: http://lkml.kernel.org/r/1486206053-38223-1-git-send-email-houtao1@huawei.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Cc: Michal Marek <mmarek@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Mathieu Maret <mathieu.maret@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cheah Kok Cheong [Wed, 22 Feb 2017 23:40:26 +0000 (15:40 -0800)]
scripts/checkincludes.pl: add exit message for no duplicates found
If no duplicates found, inform user.
Link: http://lkml.kernel.org/r/1486391275-2843-1-git-send-email-thrust73@gmail.com
Signed-off-by: Cheah Kok Cheong <thrust73@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tobias Klauser [Wed, 22 Feb 2017 23:40:23 +0000 (15:40 -0800)]
scripts/checkstack.pl: add support for nios2
Adjust checkstack.pl for the nios2 architecture.
Link: http://lkml.kernel.org/r/20170116113052.15034-1-tklauser@distanz.ch
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Ley Foon Tan <lftan@altera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jean Delvare [Wed, 22 Feb 2017 23:40:20 +0000 (15:40 -0800)]
scripts/Lindent: clean up and optimize
* Add a few blank lines to improve readability.
* Don't call cut 3 times when once is enough.
* Drop a useless semicolon.
Link: http://lkml.kernel.org/r/20170104140356.162abab2@endymion
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 22 Feb 2017 23:40:17 +0000 (15:40 -0800)]
scripts/spelling.txt: fix incorrect typo-words
Fix up some incorrect typo-words.
[akpm@linux-foundation.org: "licencing" is valid British spelling and should be kept, per Joe]
Link: http://lkml.kernel.org/r/1486409689-23335-1-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Ian King [Wed, 22 Feb 2017 23:40:15 +0000 (15:40 -0800)]
scripts/spelling.txt: add several more common spelling mistakes
Lately I've been cleaning up spelling mistakes in kernel error messages
and here are some of the more common spelling mistakes that I've found
which probably should be added to this list so we don't keep on seeing
them appearing again.
Link: http://lkml.kernel.org/r/20161209173326.17662-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Thompson [Wed, 22 Feb 2017 23:40:12 +0000 (15:40 -0800)]
tools/vm: add missing Makefile rules
Currently the tools/vm Makefile has a rather arbitrary implicit build
rule; page-types is the first value in TARGETS so lets just build that
one! Additionally there is no install rule and this is needed for make -C
tools vm_install to work properly.
Provide a more sensible implicit build rule and a new install rule.
Note that the variables names used by the install rule (DESTDIR and
sbindir) are copied from prior-art in tools/power/cpupower.
Link: http://lkml.kernel.org/r/20170113165630.27541-1-daniel.thompson@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miles Chen [Wed, 22 Feb 2017 23:40:09 +0000 (15:40 -0800)]
dma-debug: add comment for failed to check map error
Add comment for failure to check a map error to help driver developers.
Link: http://lkml.kernel.org/r/1484622289-22085-1-git-send-email-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Jiang [Wed, 22 Feb 2017 23:40:06 +0000 (15:40 -0800)]
mm, dax: change pmd_fault() to take only vmf parameter
pmd_fault() and related functions really only need the vmf parameter since
the additional parameters are all included in the vmf struct. Remove the
additional parameter and simplify pmd_fault() and friends.
Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Jiang [Wed, 22 Feb 2017 23:40:03 +0000 (15:40 -0800)]
mm, dax: make pmd_fault() and friends be the same as fault()
Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.
[dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 22 Feb 2017 23:40:00 +0000 (15:40 -0800)]
dax: add tracepoints to dax_pmd_insert_mapping()
Add tracepoints to dax_pmd_insert_mapping(), following the same logging
conventions as the tracepoints in dax_iomap_pmd_fault().
Here is an example PMD fault showing the new tracepoints:
big-1504 [001] .... 326.960743: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003
big-1504 [001] .... 326.960753: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400
big-1504 [001] .... 326.960981: dax_pmd_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10505000 length 0x200000 pfn 0x100600 DEV|MAP radix_entry 0xc000e
big-1504 [001] .... 326.960986: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE
Link: http://lkml.kernel.org/r/1484085142-2297-6-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 22 Feb 2017 23:39:57 +0000 (15:39 -0800)]
dax: add tracepoints to dax_pmd_load_hole()
Add tracepoints to dax_pmd_load_hole(), following the same logging
conventions as the tracepoints in dax_iomap_pmd_fault().
Here is an example PMD fault showing the new tracepoints:
read_big-1478 [004] .... 238.242188: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003
read_big-1478 [004] .... 238.242191: dax_pmd_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400
read_big-1478 [004] .... 238.242390: dax_pmd_load_hole: dev 259:0 ino 0x1003 shared address 0x10400000 zero_page
ffffea0002c20000 radix_entry 0x1e
read_big-1478 [004] .... 238.242392: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 NOPAGE
Link: http://lkml.kernel.org/r/1484085142-2297-5-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 22 Feb 2017 23:39:54 +0000 (15:39 -0800)]
dax: update MAINTAINERS entries for FS DAX
Add the new include/trace/events/fs_dax.h tracepoint header, the existing
include/linux/dax.h header, update Matthew's email address and add myself
as a maintainer for filesystem DAX.
Link: http://lkml.kernel.org/r/1484085142-2297-4-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 22 Feb 2017 23:39:50 +0000 (15:39 -0800)]
dax: add tracepoint infrastructure, PMD tracing
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems. Create a tracepoint header for FS DAX and add the first DAX
tracepoints to the PMD fault handler. This allows the tracing for DAX to
be done in the same way as the filesystem tracing so that developers can
look at them together and get a coherent idea of what the system is doing.
I added both an entry and exit tracepoint because future patches will add
tracepoints to child functions of dax_iomap_pmd_fault() like
dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages
to be wrapped by the parent function tracepoints so the code flow is more
easily understood. Having entry and exit tracepoints for faults also
allows us to easily see what filesystems functions were called during the
fault. These filesystem functions get executed via iomap_begin() and
iomap_end() calls, for example, and will have their own tracepoints.
For PMD faults we primarily want to understand the type of mapping, the
fault flags, the faulting address and whether it fell back to 4k faults.
If it fell back to 4k faults the tracepoints should let us understand why.
I named the new tracepoint header file "fs_dax.h" to allow for device DAX
to have its own separate tracing header in the same directory at some
point.
Here is an example output for these events from a successful PMD fault:
big-1441 [005] .... 32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003
big-1441 [005] .... 32.582776: dax_pmd_fault: dev 259:0 ino 0x1003
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400
big-1441 [005] .... 32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE
Link: http://lkml.kernel.org/r/1484085142-2297-3-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ross Zwisler [Wed, 22 Feb 2017 23:39:47 +0000 (15:39 -0800)]
tracing: add __print_flags_u64()
Patch series "DAX tracepoints, mm argument simplification", v4.
This contains both my DAX tracepoint code and Dave Jiang's MM argument
simplifications. Dave's code was written with my tracepoint code as a
baseline, so it seemed simplest to keep them together in a single series.
This patch (of 7):
Add __print_flags_u64() and the helper trace_print_flags_seq_u64() in the
same spirit as __print_symbolic_u64() and trace_print_symbols_seq_u64().
These functions allow us to print symbols associated with flags that are
64 bits wide even on 32 bit machines.
These will be used by the DAX code so that we can print the flags set in a
pfn_t such as PFN_SG_CHAIN, PFN_SG_LAST, PFN_DEV and PFN_MAP.
Without this new function I was getting errors like the following when
compiling for i386:
include/linux/pfn_t.h:13:22: warning: large integer implicitly truncated to unsigned type [-Woverflow]
#define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1))
^
Link: http://lkml.kernel.org/r/1484085142-2297-2-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 22 Feb 2017 20:17:25 +0000 (12:17 -0800)]
Merge tag 'tty-4.11-rc1' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the big tty/serial driver patchset for 4.11-rc1.
Not much here, but a lot of little fixes and individual serial driver
updates all over the subsystem. Majority are for the sh-sci driver and
platform (the arch-specific changes have acks from the maintainer).
The start of the "serial bus" code is here as well, but nothing is
converted to use it yet. That work is still ongoing, hopefully will
start to show up across different subsystems for 4.12 (bluetooth is
one major place that will be used.)
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (109 commits)
tty: pl011: Work around QDF2400 E44 stuck BUSY bit
atmel_serial: Use the fractional divider when possible
tty: Remove extra include in HVC console tty framework
serial: exar: Enable MSI support
serial: exar: Move register defines from uapi header to consumer site
serial: pci: Remove unused pci_boards entries
serial: exar: Move Commtech adapters to 8250_exar as well
serial: exar: Fix feature control register constants
serial: exar: Fix initialization of EXAR registers for ports > 0
serial: exar: Fix mapping of port I/O resources
serial: sh-sci: fix hardware RX trigger level setting
tty/serial: atmel: ensure state is restored after suspending
serial: 8250_dw: Avoid "too much work" from bogus rx timeout interrupt
serdev: ttyport: check whether tty_init_dev() fails
serial: 8250_pci: make pciserial_detach_ports() static
ARM: dts: STiH410-b2260: Enable HW flow-control
ARM: dts: STiH407-family: Use new Pinctrl groups
ARM: dts: STiH407-pinctrl: Add Pinctrl group for HW flow-control
ARM: dts: STiH410-b2260: Identify the UART RTS line
dt-bindings: serial: Update 'uart-has-rtscts' description
...
Linus Torvalds [Wed, 22 Feb 2017 20:14:01 +0000 (12:14 -0800)]
Merge tag 'staging-4.11-rc1' of git://git./linux/kernel/git/gregkh/staging
Pull staging/iio driver updates from Greg KH:
"Here is the big staging and iio driver patchsets for 4.11-rc1.
We almost broke even this time around, with only a few thousand lines
added overall, as we removed the old and obsolete i4l code, but added
some new drivers for the RPi platform, as well as adding some new IIO
drivers.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (669 commits)
Staging: vc04_services: Fix the "space prohibited" code style errors
Staging: vc04_services: Fix the "wrong indent" code style errors
staging: octeon: Use net_device_stats from struct net_device
Staging: rtl8192u: ieee80211: ieee80211.h - style fix
Staging: rtl8192u: ieee80211: ieee80211_tx.c - style fix
Staging: rtl8192u: ieee80211: rtl819x_BAProc.c - style fix
Staging: rtl8192u: ieee80211: ieee80211_module.c - style fix
Staging: rtl8192u: ieee80211: rtl819x_TSProc.c - style fix
Staging: rtl8192u: r8192U.h - style fix
Staging: rtl8192u: r8192U_core.c - style fix
Staging: rtl8192u: r819xU_cmdpkt.c - style fix
staging: rtl8192u: blank lines aren't necessary before a close brace '}'
staging: rtl8192u: Adding space after enum and struct definition
staging: rtl8192u: Adding space after struct definition
Staging: ks7010: Add required and preferred spaces around operators
Staging: ks7010: ks*: Remove redundant blank lines
Staging: ks7010: ks*: Add missing blank lines after declarations
staging: visorbus, replace init_timer with setup_timer
staging: vt6656: rxtx.c Removed multiple dereferencing
staging: vt6656: Alignment match open parenthesis
...
Linus Torvalds [Wed, 22 Feb 2017 19:44:32 +0000 (11:44 -0800)]
Merge tag 'driver-core-4.11-rc1' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "small" driver core patches for 4.11-rc1.
Not much here, some firmware documentation and self-test updates, a
debugfs code formatting issue, and a new feature for call_usermodehelper
to make it more robust on systems that want to lock it down in a more
secure way.
All of these have been linux-next for a while now with no reported
issues"
* tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: handle null pointers while printing node name and path
Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()
Make static usermode helper binaries constant
kmod: make usermodehelper path a const string
firmware: revamp firmware documentation
selftests: firmware: send expected errors to /dev/null
selftests: firmware: only modprobe if driver is missing
platform: Print the resource range if device failed to claim
kref: prefer atomic_inc_not_zero to atomic_add_unless
debugfs: improve formatting of debugfs_real_fops()
Linus Torvalds [Wed, 22 Feb 2017 19:38:22 +0000 (11:38 -0800)]
Merge tag 'char-misc-4.11-rc1' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big char/misc driver patchset for 4.11-rc1.
Lots of different driver subsystems updated here: rework for the
hyperv subsystem to handle new platforms better, mei and w1 and extcon
driver updates, as well as a number of other "minor" driver updates.
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (169 commits)
goldfish: Sanitize the broken interrupt handler
x86/platform/goldfish: Prevent unconditional loading
vmbus: replace modulus operation with subtraction
vmbus: constify parameters where possible
vmbus: expose hv_begin/end_read
vmbus: remove conditional locking of vmbus_write
vmbus: add direct isr callback mode
vmbus: change to per channel tasklet
vmbus: put related per-cpu variable together
vmbus: callback is in softirq not workqueue
binder: Add support for file-descriptor arrays
binder: Add support for scatter-gather
binder: Add extra size to allocator
binder: Refactor binder_transact()
binder: Support multiple /dev instances
binder: Deal with contexts in debugfs
binder: Support multiple context managers
binder: Split flat_binder_object
auxdisplay: ht16k33: remove private workqueue
auxdisplay: ht16k33: rework input device initialization
...
Linus Torvalds [Wed, 22 Feb 2017 19:15:59 +0000 (11:15 -0800)]
Merge tag 'usb-4.11-rc1' of git://git./linux/kernel/git/gregkh/usb
Pull USB/PHY updates from Greg KH:
"Here is the big USB and PHY driver updates for 4.11-rc1.
Nothing major, just the normal amount of churn in the usb gadget and
dwc and xhci controllers, new device ids, new phy drivers, a new
usb-serial driver, and a few other minor changes in different USB
drivers.
All have been in linux-next for a long time with no reported issues"
* tag 'usb-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (265 commits)
usb: cdc-wdm: remove logically dead code
USB: serial: keyspan: drop header file
USB: serial: io_edgeport: drop io-tables header file
usb: musb: add code comment for clarification
usb: misc: add USB251xB/xBi Hi-Speed Hub Controller Driver
usb: misc: usbtest: remove redundant check on retval < 0
USB: serial: upd78f0730: sort device ids
USB: serial: upd78f0730: add ID for EVAL-ADXL362Z
ohci-hub: fix typo in dbg_port macro
usb: musb: dsps: Manage CPPI 4.1 DMA interrupt in DSPS
usb: musb: tusb6010: Clean up tusb_omap_dma structure
usb: musb: cppi_dma: Clean up cppi41_dma_controller structure
usb: musb: cppi_dma: Clean up cppi structure
usb: musb: cppi41: Detect aborted transfers in cppi41_dma_callback()
usb: musb: dma: Add a DMA completion platform callback
drivers: usb: usbip: Add missing break statement to switch
usb: mtu3: remove redundant dev_err call in get_ssusb_rscs()
USB: serial: mos7840: fix another NULL-deref at open
USB: serial: console: clean up sanity checks
USB: serial: console: fix uninitialised spinlock
...
Linus Torvalds [Wed, 22 Feb 2017 18:46:44 +0000 (10:46 -0800)]
Merge tag 'arm64-upstream' of git://git./linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
- Errata workarounds for Qualcomm's Falkor CPU
- Qualcomm L2 Cache PMU driver
- Qualcomm SMCCC firmware quirk
- Support for DEBUG_VIRTUAL
- CPU feature detection for userspace via MRS emulation
- Preliminary work for the Statistical Profiling Extension
- Misc cleanups and non-critical fixes
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (74 commits)
arm64/kprobes: consistently handle MRS/MSR with XZR
arm64: cpufeature: correctly handle MRS to XZR
arm64: traps: correctly handle MRS/MSR with XZR
arm64: ptrace: add XZR-safe regs accessors
arm64: include asm/assembler.h in entry-ftrace.S
arm64: fix warning about swapper_pg_dir overflow
arm64: Work around Falkor erratum 1003
arm64: head.S: Enable EL1 (host) access to SPE when entered at EL2
arm64: arch_timer: document Hisilicon erratum
161010101
arm64: use is_vmalloc_addr
arm64: use linux/sizes.h for constants
arm64: uaccess: consistently check object sizes
perf: add qcom l2 cache perf events driver
arm64: remove wrong CONFIG_PROC_SYSCTL ifdef
ARM: smccc: Update HVC comment to describe new quirk parameter
arm64: do not trace atomic operations
ACPI/IORT: Fix the error return code in iort_add_smmu_platform_device()
ACPI/IORT: Fix iort_node_get_id() mapping entries indexing
arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA
perf: xgene: Include module.h
...
Linus Torvalds [Wed, 22 Feb 2017 18:33:53 +0000 (10:33 -0800)]
Merge tag 'arc-4.11-rc1' of git://git./linux/kernel/git/vgupta/arc
Pull ARC updates from Vineet Gupta:
- Intc imporvements [Yuriy]
- VDK platform updates [Alexey]
* tag 'arc-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: [plat-*] ARC_HAS_COH_CACHES no longer relevant
ARCv2: intc: Delete useless comments in Device Trees
ARCv2: IDU-intc: Delete deprecated parameters in Device Trees
ARCv2: IDU-intc: mask all common interrupts by default
ARCv2: IDU-intc: Use build registers for getting numbers of interrupts
ARCv2: intc: Set default priority for all core interrupts
ARCv2: intc: Use runtime value of irq count for setting up intc
ARCv2: intc: Rework the build time irq count information
ARC: [intc-*]: confine NR_CPU_IRQS to intc code
ARCv2: intc: Use ARC_REG_STATUS32 for addressing STATUS32 reg
arc: vdk: Add support of UIO
arc: vdk: Add support of MMC controller
arc: vdk: Disable halt on reset
Linus Torvalds [Wed, 22 Feb 2017 18:30:38 +0000 (10:30 -0800)]
Merge tag 'powerpc-4.11-1' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Support for direct mapped LPC on POWER9, giving Linux direct access
to devices that may be on there such as a UART.
- Memory hotplug support for the Power9 Radix MMU.
- Add new AUX vectors describing the processor's cache geometry, to
be used by glibc.
- The ability for a guest to ask the hypervisor to resize the guest's
hash table, and in addition support for doing so automatically when
memory is hotplugged into/out-of the guest. This allows the hash
table to be sized based on the current memory usage of the guest,
rather than the maximum possible memory usage.
- Implementation of optprobes (kprobe optimisation) for powerpc.
In addition there's the topic branch shared with the KVM tree, which
includes support for guests to use the Radix MMU on Power9.
Thanks to:
Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton
Blanchard, Benjamin Herrenschmidt, Chris Packham, Daniel Axtens,
Daniel Borkmann, David Gibson, Finn Thain, Gautham R. Shenoy, Gavin
Shan, Greg Kurz, Joel Stanley, John Allen, Madhavan Srinivasan,
Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Nathan Fontenot,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Reza
Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun"
* tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (129 commits)
powerpc/mm/radix: Skip ptesync in pte update helpers
powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm
powerpc/mm/radix: Update pte update sequence for pte clear case
powerpc/mm: Update PROTFAULT handling in the page fault path
powerpc/xmon: Fix data-breakpoint
powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=y
powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y
powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n
powerpc/pseries: Fix typo in parameter description
powerpc/kprobes: Remove kprobe_exceptions_notify()
kprobes: Introduce weak variant of kprobe_exceptions_notify()
powerpc/ftrace: Fix confusing help text for DISABLE_MPROFILE_KERNEL
powerpc/powernv: Fix opal_exit tracepoint opcode
powerpc: Add a prototype for mcount() so it can be versioned
powerpc: Drop GPL from of_node_to_nid() export to match other arches
powerpc/kprobes: Optimize kprobe in kretprobe_trampoline()
powerpc/kprobes: Implement Optprobes
powerpc/kprobes: Fixes for kprobe_lookup_name() on BE
powerpc: Add helper to check if offset is within relative branch range
powerpc/bpf: Introduce __PPC_SH64()
...
Linus Torvalds [Wed, 22 Feb 2017 18:20:04 +0000 (10:20 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
- New entropy generation for the pseudo random number generator.
- Early boot printk output via sclp to help debug crashes on boot. This
needs to be enabled with a kernel parameter.
- Add proper no-execute support with a bit in the page table entry.
- Bug fixes and cleanups.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (65 commits)
s390/syscall: fix single stepped system calls
s390/zcrypt: make ap_bus explicitly non-modular
s390/zcrypt: Removed unneeded debug feature directory creation.
s390: add missing "do {} while (0)" loop constructs to multiline macros
s390/mm: add cond_resched call to kernel page table dumper
s390: get rid of MACHINE_HAS_PFMF and MACHINE_HAS_HPAGE
s390/mm: make memory_block_size_bytes available for !MEMORY_HOTPLUG
s390: replace ACCESS_ONCE with READ_ONCE
s390: Audit and remove any remaining unnecessary uses of module.h
s390: mm: Audit and remove any unnecessary uses of module.h
s390: kernel: Audit and remove any unnecessary uses of module.h
s390/kdump: Use "LINUX" ELF note name instead of "CORE"
s390: add no-execute support
s390: report new vector facilities
s390: use correct input data address for setup_randomness
s390/sclp: get rid of common response code handling
s390/sclp: don't add new lines to each printed string
s390/sclp: make early sclp code readable
s390/sclp: disable early sclp code as soon as the base sclp driver is active
s390/sclp: move early printk code to drivers
...
Linus Torvalds [Wed, 22 Feb 2017 18:15:09 +0000 (10:15 -0800)]
Merge git://git./linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
"Highlights:
1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini
Varadhan.
2) Simplify classifier state on sk_buff in order to shrink it a bit.
From Willem de Bruijn.
3) Introduce SIPHASH and it's usage for secure sequence numbers and
syncookies. From Jason A. Donenfeld.
4) Reduce CPU usage for ICMP replies we are going to limit or
suppress, from Jesper Dangaard Brouer.
5) Introduce Shared Memory Communications socket layer, from Ursula
Braun.
6) Add RACK loss detection and allow it to actually trigger fast
recovery instead of just assisting after other algorithms have
triggered it. From Yuchung Cheng.
7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot.
8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert.
9) Export MPLS packet stats via netlink, from Robert Shearman.
10) Significantly improve inet port bind conflict handling, especially
when an application is restarted and changes it's setting of
reuseport. From Josef Bacik.
11) Implement TX batching in vhost_net, from Jason Wang.
12) Extend the dummy device so that VF (virtual function) features,
such as configuration, can be more easily tested. From Phil
Sutter.
13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric
Dumazet.
14) Add new bpf MAP, implementing a longest prefix match trie. From
Daniel Mack.
15) Packet sample offloading support in mlxsw driver, from Yotam Gigi.
16) Add new aquantia driver, from David VomLehn.
17) Add bpf tracepoints, from Daniel Borkmann.
18) Add support for port mirroring to b53 and bcm_sf2 drivers, from
Florian Fainelli.
19) Remove custom busy polling in many drivers, it is done in the core
networking since 4.5 times. From Eric Dumazet.
20) Support XDP adjust_head in virtio_net, from John Fastabend.
21) Fix several major holes in neighbour entry confirmation, from
Julian Anastasov.
22) Add XDP support to bnxt_en driver, from Michael Chan.
23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan.
24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi.
25) Support GRO in IPSEC protocols, from Steffen Klassert"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits)
Revert "ath10k: Search SMBIOS for OEM board file extension"
net: socket: fix recvmmsg not returning error from sock_error
bnxt_en: use eth_hw_addr_random()
bpf: fix unlocking of jited image when module ronx not set
arch: add ARCH_HAS_SET_MEMORY config
net: napi_watchdog() can use napi_schedule_irqoff()
tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
net/hsr: use eth_hw_addr_random()
net: mvpp2: enable building on 64-bit platforms
net: mvpp2: switch to build_skb() in the RX path
net: mvpp2: simplify MVPP2_PRS_RI_* definitions
net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT
net: mvpp2: remove unused register definitions
net: mvpp2: simplify mvpp2_bm_bufs_add()
net: mvpp2: drop useless fields in mvpp2_bm_pool and related code
net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'
net: mvpp2: release reference to txq_cpu[] entry after unmapping
net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()
net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()
net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set
...
Linus Torvalds [Wed, 22 Feb 2017 17:05:47 +0000 (09:05 -0800)]
Merge tag 'gcc-plugins-v4.11-rc1' of git://git./linux/kernel/git/kees/linux
Pull gcc-plugins updates from Kees Cook:
"This includes infrastructure updates and the structleak plugin, which
performs forced initialization of certain structures to avoid possible
information exposures to userspace.
Summary:
- infrastructure updates (gcc-common.h)
- introduce structleak plugin for forced initialization of some
structures"
* tag 'gcc-plugins-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: Add structleak for more stack initialization
gcc-plugins: consolidate on PASS_INFO macro
gcc-plugins: add PASS_INFO and build_const_char_string()
Kees Cook [Wed, 22 Feb 2017 05:12:57 +0000 (21:12 -0800)]
Merge branch 'for-next/gcc-plugin/structleak' into for-linus/gcc-plugins
Kees Cook [Wed, 22 Feb 2017 05:12:38 +0000 (21:12 -0800)]
Merge branch 'for-next/gcc-plugin-infrastructure' into for-linus/gcc-plugins
Linus Torvalds [Wed, 22 Feb 2017 01:56:45 +0000 (17:56 -0800)]
Merge tag 'rodata-v4.11-rc1' of git://git./linux/kernel/git/kees/linux
Pull rodata updates from Kees Cook:
"This renames the (now inaccurate) DEBUG_RODATA and related
SET_MODULE_RONX configs to the more sensible STRICT_KERNEL_RWX and
STRICT_MODULE_RWX"
* tag 'rodata-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX
arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common
Linus Torvalds [Wed, 22 Feb 2017 01:53:23 +0000 (17:53 -0800)]
Merge tag 'usercopy-v4.11-rc1' of git://git./linux/kernel/git/kees/linux
Pull usercopy test updates from Kees Cook:
"This improves the usercopy tests:
- check zeroing on failed copy_from_user()/get_user() (caught bug on
ARM)
- adjust tests for SMAP/PAN (can't zero userspace memory on failure)"
* tag 'usercopy-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
usercopy: Add tests for all get_user() sizes
usercopy: Adjust tests to deal with SMAP/PAN
usercopy: add testcases to check zeroing on failure
Linus Torvalds [Wed, 22 Feb 2017 01:51:37 +0000 (17:51 -0800)]
Merge tag 'pstore-v4.11-rc1' of git://git./linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:
"Minor changes to pstore tree:
- update MAINTAINERS with current git repo, add more files.
- move prz allocation checks into the walker
- initialize flags correctly (by accident spinlock was technically
ok)"
* tag 'pstore-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
MAINTAINERS: Adjust pstore git repo URI, add files
pstore: Check for prz allocation in walker
pstore: Correctly initialize spinlock and flags
Linus Torvalds [Wed, 22 Feb 2017 01:28:25 +0000 (17:28 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull HID fix from Jiri Kosina:
"Regression fix for HID_RMI-driven synaptics touchpads in
!CONFIG_HID_RMI cases"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: rmi: fallback to generic/multitouch if hid-rmi is not built
Linus Torvalds [Wed, 22 Feb 2017 01:21:32 +0000 (17:21 -0800)]
Merge branch 'for-4.11' of git://git./linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:
- Bartlomiej added pata_falcon
- Christoph is trying to remove use of static 4k buf. It's still WIP
- config cleanup around HAS_DMA
- other fixes and driver-specific changes
* 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (29 commits)
ata: pata_of_platform: using of_property_read_u32() helper
pata_atiixp: Don't use unconnected secondary port on SB600/SB700
libata-sff: Don't scan disabled ports when checking for legacy mode.
pata_octeon_cf: remove unused local variables from octeon_cf_set_piomode()
ahci: qoriq: added ls2088a platforms support
ahci: qoriq: report error when ecc register address is missing in dts
ahci: qoriq: added a condition to enable dma coherence
Revert "libata: switch to dynamic allocation instead of ata_scsi_rbuf"
ahci: imx: fix building without hwmon or thermal
ata: add Atari Falcon PATA controller driver
ata: pass queued command to ->sff_data_xfer method
ata: allow subsystem to be used on m68k arch
libata: switch to dynamic allocation instead of ata_scsi_rbuf
libata: don't call ata_scsi_rbuf_fill for command without a response buffer
libata: call ->scsi_done from ata_scsi_simulate
libata: remove the done callback from ata_scsi_args
libata: move struct ata_scsi_args to libata-scsi.c
libata: avoid global response buffer in atapi_qc_complete
libata-eh: Use switch() instead of sparse array for protocol strings
ata: sata_mv: Convert to devm_ioremap_resource()
...
Linus Torvalds [Wed, 22 Feb 2017 01:06:22 +0000 (17:06 -0800)]
Merge tag 'dmaengine-4.11-rc1' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine updates from Vinod Koul:
"This time we fairly boring and bit small update.
- Support for Intel iDMA 32-bit hardware
- deprecate broken support for channel switching in async_tx
- bunch of updates on stm32-dma
- Cyclic support for zx dma and making in generic zx dma driver
- Small updates to bunch of other drivers"
* tag 'dmaengine-4.11-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (29 commits)
async_tx: deprecate broken support for channel switching
dmaengine: rcar-dmac: Widen DMA mask to 40 bits
dmaengine: sun6i: allow build on ARM64 platforms (sun50i)
dmaengine: Provide a wrapper for memcpy operations
dmaengine: zx: fix build warning
dmaengine: dw: we do support Merrifield SoC in PCI mode
dmaengine: dw: add support of iDMA 32-bit hardware
dmaengine: dw: introduce register mappings for iDMA 32-bit
dmaengine: dw: introduce block2bytes() and bytes2block()
dmaengine: dw: extract dwc_chan_pause() for future use
dmaengine: dw: replace convert_burst() with one liner
dmaengine: dw: register IRQ and DMA pool with instance ID
dmaengine: dw: Fix data corruption in large device to memory transfers
dmaengine: ste_dma40: indicate granularity on channels
dmaengine: ste_dma40: indicate directions on channels
dmaengine: stm32-dma: Add error messages if xlate fails
dmaengine: dw: pci: remove LPE Audio DMA ID
dmaengine: stm32-dma: Add max_burst support
dmaengine: stm32-dma: Add synchronization support
dmaengine: stm32-dma: Fix residue computation issue in cyclic mode
...
Linus Torvalds [Wed, 22 Feb 2017 00:58:32 +0000 (16:58 -0800)]
Merge tag 'media/v4.11-1' of git://git./linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- new drivers:
- i.MX6 Video Data Order Adapter's (VDOA)
- Toshiba et8ek8 5MP sensor
- STM DELTA multi-format video decoder V4L2 driver
- SPI connected IR LED
- Mediatek IR remote receiver
- ZyDAS ZD1301 DVB USB interface driver
- new RC keymaps
- some very old LIRC drivers got removed from staging
- RC core gained support encoding IR scan codes
- DVB si2168 gained support for DVBv5 statistics
- lirc_sir driver ported to rc-core and promoted from staging
- other bug fixes, board additions and driver improvements
* tag 'media/v4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (230 commits)
[media] mtk-vcodec: fix build warnings without DEBUG
[media] zd1301: fix building interface driver without demodulator
[media] usbtv: add sharpness control
[media] cxusb: Use a dma capable buffer also for reading
[media] ttpci: address stringop overflow warning
[media] dvb-usb-v2: avoid use-after-free
[media] add Hama Hybrid DVB-T Stick support
[media] et8ek8: Fix compiler / Coccinelle warnings
[media] media: fix semicolon.cocci warnings
[media] media: exynos4-is: add flags to dummy Exynos IS i2c adapter
[media] v4l: of: check for unique lanes in data-lanes and clock-lanes
[media] coda/imx-vdoa: constify structs
[media] st-delta: debug: trace stream/frame information & summary
[media] st-delta: add mjpeg support
[media] st-delta: EOS (End Of Stream) support
[media] st-delta: rpmsg ipc support
[media] st-delta: add memory allocator helper functions
[media] st-delta: STiH4xx multi-format video decoder v4l2 driver
[media] MAINTAINERS: add st-delta driver
[media] ARM: multi_v7_defconfig: enable STMicroelectronics DELTA Support
...
Linus Torvalds [Wed, 22 Feb 2017 00:34:22 +0000 (16:34 -0800)]
Merge tag 'pinctrl-v4.11-1' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"Pin control bulk changes for the v4.11 kernel cycle.
Core changes:
- Switch the generic pin config argument from 16 to 24 bits, only use
8 bits for the configuration type. We might need to encode more
information about a certain setting than we need to encode
different generic settings.
- Add a cross-talk API to the pin control GPIO back-end, utilizing
pinctrl_gpio_set_config() from GPIO drivers that want to set up a
certain pin configuration in the back-end.
This also includes the .set_config() refactoring of the GPIO chips,
so that they pass a generic configuration for things like
debouncing and single ended (typically open drain). This change has
also been merged in an immutable branch to the GPIO tree.
- Take hogs with a delayed work, so that we finalize probing a pin
controller before trying to get any hogs.
- For pin controllers putting all group and function definitions into
the device tree, we now have generic code to deal with this and it
is used in two drivers so far.
- Simplifications of the pin request conflict check.
- Make dt_free_map() optional.
Updates to drivers:
- pinctrl-single now use the generic helpers to generate dynamic
group and function tables from the device tree.
- Texas Instruments IOdelay configuration driver add-on to
pinctrl-single.
- i.MX: use radix trees to store groups and functions, use the new
generic group and function helpers to manage them.
- Intel: add support for hardware debouncing and 1K pull-down. New
subdriver for the Gemini Lake SoC.
- Renesas SH-PFC: drive strength and bias support, CAN bus muxing,
MSIOF, SDHI, HSCIF for r8a7796. Gyro-ADC supporton r8a7791.
- Aspeed: use syscon cross-dependencies to set up related bits in the
LPC host controller and display controller.
- Aspeed: finalize G4 and G5 support. Fix mux configuration on GPIOs.
Add banks Y, Z, AA, AB and AC.
- AMD: support additional GPIO.
- STM32: set this controller to strict muxing mode. STM32H743 MCU
support.
- Allwinner sunxi: deep simplifications on how to support subvariants
of SoCs without adding to much SoC-specific data for each
subvariant, especially for sun5i variants. New driver for V3s SoCs.
New driver for the H5 SoC. Support A31/A31s variants with the new
variant framework.
- Mvebu: simplifications to use a MMIO and regmap abstraction. New
subdrivers for the 98DX3236, 98DX5241 SoCs.
- Samsung Exynos: delete Exynos4415 support. Add crosstalk to the SoC
driver to access regmaps. Add infrastructure for pin-bank retention
control. Clean out the pin retention control from
arch/arm/mach-exynos and arch/arm/mach-s5p and put it properly in
the Samsung pin control driver(s).
- Meson: add HDMI HPD/DDC pins. Add pwm_ao_b pin.
- Qualcomm: use raw spinlock variants: this makes the qualcomm driver
realtime-safe"
* tag 'pinctrl-v4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (111 commits)
pinctrl: samsung: Fix return value check in samsung_pinctrl_get_soc_data()
pinctrl: intel: unlock on error in intel_config_set_pull()
pinctrl: berlin: make bool drivers explicitly non-modular
pinctrl: spear: make bool drivers explicitly non-modular
pinctrl: mvebu: make bool drivers explicitly non-modular
pinctrl: sunxi: make sun5i explicitly non-modular
pinctrl: sunxi: Remove stray printk call in sun5i driver's probe function
pinctrl: samsung: mark PM functions as __maybe_unused
pinctrl: sunxi: Remove redundant A31s pinctrl driver
pinctrl: sunxi: Support A31/A31s with pinctrl variants
pinctrl: Amend bindings for STM32 pinctrl
pinctrl: Add STM32 pinctrl driver DT bindings
pinctrl: stm32: Add STM32H743 MCU support
include: dt-bindings: Add STM32H7 pinctrl DT defines
gpio: aspeed: Remove dependence on GPIOF_* macros
pinctrl: stm32: fix bad location of gpiochip_lock_as_irq
drivers: pinctrl: add driver for Allwinner H5 SoC
pinctrl: intel: Add Intel Gemini Lake pin controller support
pinctrl: intel: Add support for 1k additional pull-down
pinctrl: intel: Add support for hardware debouncer
...
Jiri Kosina [Wed, 22 Feb 2017 00:13:52 +0000 (01:13 +0100)]
HID: rmi: fallback to generic/multitouch if hid-rmi is not built
Commit
279967a65b32 ("HID: rmi: Handle all Synaptics touchpads using hid-rmi")
unconditionally switches over handling of all Synaptics touchpads to hid-rmi
(to make use of extended features of the HW); in case CONFIG_HID_RMI is
disabled though this renders the touchpad unusable, as the
HID_DEVICE(HID_BUS_ANY, HID_GROUP_RMI, HID_ANY_ID, HID_ANY_ID)
match doesn't exist and generic/multitouch doesn't bind to it either (due
to hid group mismatch).
Fix this by switching over to hid-rmi only if it has been actually built.
Fixes: 279967a65b32 ("HID: rmi: Handle all Synaptics touchpads using hid-rmi")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Tested-by: Andrew Duggan <aduggan@synaptics.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Linus Torvalds [Tue, 21 Feb 2017 22:28:55 +0000 (14:28 -0800)]
Merge tag 'extable-for-linus' of git://git./linux/kernel/git/paulg/linux
Pull exception table module split from Paul Gortmaker:
"Final extable.h related changes.
This completes the separation of exception table content from the
module.h header file. This is achieved with the final commit that
removes the one line back compatible change that sourced extable.h
into the module.h file.
The commits are unchanged since January, with the exception of a
couple Acks that came in for the last two commits a bit later. The
changes have been in linux-next for quite some time[1] and have got
widespread arch coverage via toolchains I have and also from
additional ones the kbuild bot has.
Maintaners of the various arch were Cc'd during the postings to
lkml[2] and informed that the intention was to take the remaining arch
specific changes and lump them together with the final two non-arch
specific changes and submit for this merge window.
The ia64 diffstat stands out and probably warrants a mention. In an
earlier review, Al Viro made a valid comment that the original header
separation of content left something to be desired, and that it get
fixed as a part of this change, hence the larger diffstat"
* tag 'extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (21 commits)
module.h: remove extable.h include now users have migrated
core: migrate exception table users off module.h and onto extable.h
cris: migrate exception table users off module.h and onto extable.h
hexagon: migrate exception table users off module.h and onto extable.h
microblaze: migrate exception table users off module.h and onto extable.h
unicore32: migrate exception table users off module.h and onto extable.h
score: migrate exception table users off module.h and onto extable.h
metag: migrate exception table users off module.h and onto extable.h
arc: migrate exception table users off module.h and onto extable.h
nios2: migrate exception table users off module.h and onto extable.h
sparc: migrate exception table users onto extable.h
openrisc: migrate exception table users off module.h and onto extable.h
frv: migrate exception table users off module.h and onto extable.h
sh: migrate exception table users off module.h and onto extable.h
xtensa: migrate exception table users off module.h and onto extable.h
mn10300: migrate exception table users off module.h and onto extable.h
alpha: migrate exception table users off module.h and onto extable.h
arm: migrate exception table users off module.h and onto extable.h
m32r: migrate exception table users off module.h and onto extable.h
ia64: ensure exception table search users include extable.h
...
Linus Torvalds [Tue, 21 Feb 2017 22:21:11 +0000 (14:21 -0800)]
Merge tag 'mips_4.11' of git://git./linux/kernel/git/jhogan/mips
Pull MIPS updates from James Hogan:
"Here's the main MIPS pull request for 4.11.
It contains a few new features such as IRQ stacks, cacheinfo support,
and KASLR for Octeon CPUs, and a variety of smaller improvements and
fixes including devicetree additions, kexec cleanups, microMIPS stack
unwinding fixes, and a bunch of build fixes to clean up continuous
integration builds.
Its all been in linux-next for at least a couple of days, most of it
far longer.
Miscellaneous:
- Add IRQ stacks
- Add cacheinfo support
- Add "uzImage.bin" zboot target
- Unify performance counter definitions
- Export various (mainly assembly) symbols alongside their
definitions
- Audit and remove unnecessary uses of module.h
kexec & kdump:
- Lots of improvements and fixes
- Add correct copy_regs implementations
- Add debug logging of new kernel information
Security:
- Use Makefile.postlink to insert relocations into vmlinux
- Provide plat_post_relocation hook (used for Octeon KASLR)
- Add support for tuning mmap randomisation
- Relocate DTB
microMIPS:
- A load of unwind fixes
- Add some missing .insn to fix link errors
MIPSr6:
- Fix MULTU/MADDU/MSUBU sign extension in r2 emulation
- Remove r2_emul_return and use ERETNC unconditionally on MIPSr6
- Allow pre-r6 emulation on SMP MIPSr6 kernels
Cache management:
- Treat physically indexed dcache as non-aliasing
- Add return errors to protected cache ops for KVM
- CM3: Ensure L1 & L2 cache ECC checking matches
- CM3: Indicate inclusive caches
- I6400: Treat dcache as physically indexed
Memory management:
- Ensure bootmem doesn't corrupt reserved memory
- Export some TLB exception generation functions for KVM
OF:
- NULL check initial_boot_params before use in of_scan_flat_dt()
- Fix unaligned access in of_alias_scan()
SMP:
- CPS: Don't BUG if a CPU fails to start
Other fixes:
- Fix longstanding 64-bit IP checksum carry bug
- Fix KERN_CONT fallout in cpu-bugs64.c and sync-r4k.c
- Update defconfigs for NF_CT_PROTO_DCCP, DPLITE,
CPU_FREQ_STAT,SCSI_DH changes
- Disable certain builtin compiler options, stack-check (whole
kernel), asynchronous-unwind-tables (VDSO).
- A bunch of build fixes from kernelci.org testing
- Various other minor cleanups & corrections
BMIPS:
- Migrate interrupts during bmips_cpu_disable
- BCM47xx: Add Luxul devices
- BCM47xx: Fix Asus WL-500W button inversion
- BCM7xxx: Add SPI device nodes
Generic (multiplatform):
- Add kexec DTB passing
- Fix big endian
- Add cpp_its_S in ksym_dep_filter to silence build warning
IP22:
- Reformat inline assembler code to modern standards
- Fix binutils 2.25 build error
IP27:
- Fix duplicate CAC_BASE definition build error
- Disable qlge driver to workaround broken compiler
Lantiq:
- Refresh defconfig and activate more drivers
- Lock DMA register access
- Fix cascading IRQ setup
- Fix build of VPE loader
- xway: Fix ethernet packet header corruption over reboot
Loongson1
- Add watchdog support
- 1B: Reduce DEFAULT_MEMSIZE to 64MB
- 1B: Change OSC clock name to match rest of kernel
- 1C: Remove ARCH_WANT_OPTIONAL_GPIOLIB
Octeon:
- Add KASLR support
- Support Octeon III USB controller
- Fix large copy_from_user corner case
- Enable devtmpfs in defconfig
Netlogic:
- Fix non-default XLR build error due to netlogic,xlp-pic code
- Fix assembler warning from smpboot.S
pic32mzda:
- Fix linker error when early printk is disabled
Pistachio:
- Add base device tree
- Add Ci40 "Marduk" device tree
Ralink:
- Support raw appended DTB
- Add missing I2C & I2S clocks
- Add missing pinmux and fix pinmux function name typo
- Add missing clk_round_rate()
- Clean up prom_init()
- MT7621: Set SoC type
- MT7621: Support highmem
TXx9:
- Modernize printing of kernel messages and resolve KERN_CONT fallout
- 7segled: use permission-specific DEVICE_ATTR variants
XilFPGA:
- Add IRQ controller and UART IRQ
- Add AXI I2C and emaclite to DT & defconfig"
* tag 'mips_4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips: (148 commits)
MIPS: VDSO: Explicitly use -fno-asynchronous-unwind-tables
MIPS: BCM47XX: Fix button inversion for Asus WL-500W
MIPS: DTS: Add img directory to Makefile
MIPS: ip27: Disable qlge driver in defconfig
MIPS: pic32mzda: Fix linker error for pic32_get_pbclk()
MIPS: Lantiq: Keep ethernet enabled during boot
MIPS: OCTEON: Fix copy_from_user fault handling for large buffers
MIPS: Fix special case in 64 bit IP checksumming.
MIPS: OCTEON: Enable DEVTMPFS
MIPS: lantiq: Set physical_memsize
MIPS: sysmips: Remove duplicated include from syscall.c
Kbuild: Add cpp_its_S in ksym_dep_filter
MIPS: Audit and remove any unnecessary uses of module.h
MIPS: Unify perf counter register definitions
MIPS: Disable stack checks on MIPS kernels
MIPS: OCTEON: Platform support for OCTEON III USB controller
MIPS: Lantiq: Fix cascaded IRQ setup
MIPS: sync-r4k: Fix KERN_CONT fallout
MIPS: IRQ Stack: Fix erroneous jal to plat_irq_dispatch
MIPS: Fix distclean with Makefile.postlink
...
Linus Torvalds [Tue, 21 Feb 2017 21:53:41 +0000 (13:53 -0800)]
Merge tag 'for-linus-4.11-rc0-tag' of git://git./linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen features and fixes:
- a series from Boris Ostrovsky adding support for booting Linux as
Xen PVH guest
- a series from Juergen Gross streamlining the xenbus driver
- a series from Paul Durrant adding support for the new device model
hypercall
- several small corrections"
* tag 'for-linus-4.11-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: add IOCTL_PRIVCMD_RESTRICT
xen/privcmd: Add IOCTL_PRIVCMD_DM_OP
xen/privcmd: return -ENOTTY for unimplemented IOCTLs
xen: optimize xenbus driver for multiple concurrent xenstore accesses
xen: modify xenstore watch event interface
xen: clean up xenbus internal headers
xenbus: Neaten xenbus_va_dev_error
xen/pvh: Use Xen's emergency_restart op for PVH guests
xen/pvh: Enable CPU hotplug
xen/pvh: PVH guests always have PV devices
xen/pvh: Initialize grant table for PVH guests
xen/pvh: Make sure we don't use ACPI_IRQ_MODEL_PIC for SCI
xen/pvh: Bootstrap PVH guest
xen/pvh: Import PVH-related Xen public interfaces
xen/x86: Remove PVH support
x86/boot/32: Convert the 32-bit pgtable setup code from assembly to C
xen/manage: correct return value check on xenbus_scanf()
x86/xen: Fix APIC id mismatch warning on Intel
xen/netback: set default upper limit of tx/rx queues to 8
xen/netfront: set default upper limit of tx/rx queues to 8
Linus Torvalds [Tue, 21 Feb 2017 21:25:50 +0000 (13:25 -0800)]
Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
"The audit changes for v4.11 are relatively small compared to what we
did for v4.10, both in terms of size and impact.
- two patches from Steve tweak the formatting for some of the audit
records to make them more consistent with other audit records.
- three patches from Richard record the name of a module on module
load, fix the logging of sockaddr information when using
socketcall() on 32-bit systems, and add the ability to reset
audit's lost record counter.
- my lone patch just fixes an annoying style nit that I was reminded
about by one of Richard's patches.
All these patches pass our test suite"
* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
audit: remove unnecessary curly braces from switch/case statements
audit: log module name on init_module
audit: log 32-bit socketcalls
audit: add feature audit_lost reset
audit: Make AUDIT_ANOM_ABEND event normalized
audit: Make AUDIT_KERNEL event conform to the specification
Kalle Valo [Tue, 21 Feb 2017 19:47:06 +0000 (21:47 +0200)]
Revert "ath10k: Search SMBIOS for OEM board file extension"
This reverts commit
f2593cb1b29185d38db706cbcbe22ed538720ae1.
Paul reported that this patch with older board-2.bin ath10k initialisation
fails on Dell XPS 13:
ath10k_pci 0000:3a:00.0: failed to fetch board data for bus=pci,vendor=168c,
device=003e,subsystem-vendor=1a56,subsystem-device=1535,variant=RV_0520 from
ath10k/QCA6174/hw3.0/board-2.bin
The reason is that the older board-2.bin does not have the variant version of
the image name and ath10k does not fallback to the older naming scheme.
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=185621#c9
Fixes: f2593cb1b291 ("ath10k: Search SMBIOS for OEM board file extension")
Signed-off-by: Kalle Valo <kvalo@qca.qualcomm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 21 Feb 2017 20:49:56 +0000 (12:49 -0800)]
Merge branch 'next' of git://git./linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:
"Highlights:
- major AppArmor update: policy namespaces & lots of fixes
- add /sys/kernel/security/lsm node for easy detection of loaded LSMs
- SELinux cgroupfs labeling support
- SELinux context mounts on tmpfs, ramfs, devpts within user
namespaces
- improved TPM 2.0 support"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (117 commits)
tpm: declare tpm2_get_pcr_allocation() as static
tpm: Fix expected number of response bytes of TPM1.2 PCR Extend
tpm xen: drop unneeded chip variable
tpm: fix misspelled "facilitate" in module parameter description
tpm_tis: fix the error handling of init_tis()
KEYS: Use memzero_explicit() for secret data
KEYS: Fix an error code in request_master_key()
sign-file: fix build error in sign-file.c with libressl
selinux: allow changing labels for cgroupfs
selinux: fix off-by-one in setprocattr
tpm: silence an array overflow warning
tpm: fix the type of owned field in cap_t
tpm: add securityfs support for TPM 2.0 firmware event log
tpm: enhance read_log_of() to support Physical TPM event log
tpm: enhance TPM 2.0 PCR extend to support multiple banks
tpm: implement TPM 2.0 capability to get active PCR banks
tpm: fix RC value check in tpm2_seal_trusted
tpm_tis: fix iTPM probe via probe_itpm() function
tpm: Begin the process to deprecate user_read_timer
tpm: remove tpm_read_index and tpm_write_index from tpm.h
...
Linus Torvalds [Tue, 21 Feb 2017 20:11:41 +0000 (12:11 -0800)]
Merge tag 'dm-4.11-changes' of git://git./linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Fix dm-raid transient device failure processing and other smaller
tweaks.
- Add journal support to the DM raid target to close the 'write hole'
on raid 4/5/6.
- Fix dm-cache corruption, due to rounding bug, when cache exceeds 2TB.
- Add 'metadata2' feature to dm-cache to separate the dirty bitset out
from other cache metadata. This improves speed of shutting down a
large cache device (which implies writing out dirty bits).
- Fix a memory leak during dm-stats data structure destruction.
- Fix a DM multipath round-robin path selector performance regression
that was caused by less precise balancing across all paths.
- Lastly, introduce a DM core fix for a long-standing DM snapshot
deadlock that is rooted in the complexity of the device stack used in
conjunction with block core maintaining bios on current->bio_list to
manage recursion in generic_make_request(). A more comprehensive fix
to block core (and its hook in the cpu scheduler) would be wonderful
but this DM-specific fix is pragmatic considering how difficult it
has been to make progress on a generic fix.
* tag 'dm-4.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits)
dm: flush queued bios when process blocks to avoid deadlock
dm round robin: revert "use percpu 'repeat_count' and 'current_path'"
dm stats: fix a leaked s->histogram_boundaries array
dm space map metadata: constify dm_space_map structures
dm cache metadata: use cursor api in blocks_are_clean_separate_dirty()
dm persistent data: add cursor skip functions to the cursor APIs
dm cache metadata: use dm_bitset_new() to create the dirty bitset in format 2
dm bitset: add dm_bitset_new()
dm cache metadata: name the cache block that couldn't be loaded
dm cache metadata: add "metadata2" feature
dm cache metadata: use bitset cursor api to load discard bitset
dm bitset: introduce cursor api
dm btree: use GFP_NOFS in dm_btree_del()
dm space map common: memcpy the disk root to ensure it's arch aligned
dm block manager: add unlikely() annotations on dm_bufio error paths
dm cache: fix corruption seen when using cache > 2TB
dm raid: cleanup awkward branching in raid_message() option processing
dm raid: use mddev rather than rdev->mddev
dm raid: use read_disk_sb() throughout
dm raid: add raid4/5/6 journaling support
...
Linus Torvalds [Tue, 21 Feb 2017 20:04:54 +0000 (12:04 -0800)]
Merge tag 'mmc-v4.11' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC updates from Ulf Hansson:
"MMC core:
- Add support for Marvell SD8787 Wifi/BT chip
- Improve UHS support for SDIO
- Invent MMC_CAP_3_3V_DDR and a DT binding for eMMC DDR 3.3V mode
- Detect Auto BKOPS enable bit
- Export eMMC device lifetime information through sysfs
- First take to slim down the public mmc headers to avoid abuse
- Re-factoring of the mmc block device driver to prepare for blkmq
- Cleanup code for the mmc block device driver
- Clarify and cleanup code dealing with data requests
- Cleanup some code by converting to ida_simple_ functions
- Cleanup code dealing with card quirks
- Cleanup private and public mmc header files
MMC host:
- Don't rely on public mmc headers to include non-mmc related headers
- meson: Add support for eMMC HS400 mode
- meson: Various cleanups and improvements
- omap_hsmmc: Use the proper provided busy timeout from the core
- sunxi: Enable new timings for the A64 MMC controllers
- sunxi: Improvements for clock management
- tmio: Improvements for SDIO interrupts
- mxs-mmc: Add CMD23 support
- sdhci-msm: Enable HS400 enhanced strobe mode support
- sdhci-msm: Correct HS400 tuning sequence
- sdhci-acpi: Support deferred probe
- sdhci-pci: Add support for eMMC HS200 tuning mode on AMD
- mediatek: Correct the implementation of card busy detection
- dw_mmc: Initial support for ZX mmc controller
- sh_mobile_sdhi: Enable support for eMMC HS200 mode
- sh_mmcif: Various cleanups and improvements"
* tag 'mmc-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (145 commits)
mmc: core: add mmc prefix for blk_fixups
mmc: core: move all quirks together into quirks.h
mmc: core: improve the quirks for sdio devices
mmc: core: move some sdio IDs out of quirks file
mmc: core: change quirks.c to be a header file
mmc: sdhci-cadence: fix bit shift of read data from PHY port
mmc: Adding AUTO_BKOPS_EN bit set for Auto BKOPS support
mmc: MAN_BKOPS_EN inverse debug message logic
mmc: meson-gx: add support for HS400 mode
mmc: meson-gx: remove unneeded checks in remove
mmc: meson-gx: reduce bounce buffer size
mmc: meson-gx: set max block count and request size
mmc: meson-gx: improve interrupt handling
mmc: meson-gx: improve meson_mmc_irq_thread
mmc: meson-gx: improve meson_mmc_clk_set
mmc: meson-gx: minor improvements in meson_mmc_set_ios
mmc: meson: Assign the minimum clk rate as close to 400KHz as possible
mmc: core: start to break apart mmc_start_areq()
mmc: block: respect bool returned from blk_end_request()
mmc: block: return errorcode from mmc_sd_num_wr_blocks()
...
Kees Cook [Tue, 14 Feb 2017 20:38:07 +0000 (12:38 -0800)]
usercopy: Add tests for all get_user() sizes
The existing test was only exercising native unsigned long size
get_user(). For completeness, we should check all sizes. But we
must skip some 32-bit architectures that don't implement a 64-bit
get_user().
These new tests actually uncovered a bug in ARM's 64-bit get_user()
zeroing.
Signed-off-by: Kees Cook <keescook@chromium.org>
Linus Torvalds [Tue, 21 Feb 2017 19:51:42 +0000 (11:51 -0800)]
Merge tag 'scsi-misc' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This update includes the usual round of major driver updates (ncr5380,
ufs, lpfc, be2iscsi, hisi_sas, storvsc, cxlflash, aacraid,
megaraid_sas, ...).
There's also an assortment of minor fixes and the major update of
switching a bunch of drivers to pci_alloc_irq_vectors from Christoph"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (188 commits)
scsi: megaraid_sas: handle dma_addr_t right on 32-bit
scsi: megaraid_sas: array overflow in megasas_dump_frame()
scsi: snic: switch to pci_irq_alloc_vectors
scsi: megaraid_sas: driver version upgrade
scsi: megaraid_sas: Change RAID_1_10_RMW_CMDS to RAID_1_PEER_CMDS and set value to 2
scsi: megaraid_sas: Indentation and smatch warning fixes
scsi: megaraid_sas: Cleanup VD_EXT_DEBUG and SPAN_DEBUG related debug prints
scsi: megaraid_sas: Increase internal command pool
scsi: megaraid_sas: Use synchronize_irq to wait for IRQs to complete
scsi: megaraid_sas: Bail out the driver load if ld_list_query fails
scsi: megaraid_sas: Change build_mpt_mfi_pass_thru to return void
scsi: megaraid_sas: During OCR, if get_ctrl_info fails do not continue with OCR
scsi: megaraid_sas: Do not set fp_possible if TM capable for non-RW syspdIO, change fp_possible to bool
scsi: megaraid_sas: Remove unused pd_index from megasas_build_ld_nonrw_fusion
scsi: megaraid_sas: megasas_return_cmd does not memset IO frame to zero
scsi: megaraid_sas: max_fw_cmds are decremented twice, remove duplicate
scsi: megaraid_sas: update can_queue only if the new value is less
scsi: megaraid_sas: Change max_cmd from u32 to u16 in all functions
scsi: megaraid_sas: set pd_after_lb from MR_BuildRaidContext and initialize pDevHandle to MR_DEVHANDLE_INVALID
scsi: megaraid_sas: latest controller OCR capability from FW before sending shutdown DCMD
...
Linus Torvalds [Tue, 21 Feb 2017 18:57:33 +0000 (10:57 -0800)]
Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
- blk-mq scheduling framework from me and Omar, with a port of the
deadline scheduler for this framework. A port of BFQ from Paolo is in
the works, and should be ready for 4.12.
- Various fixups and improvements to the above scheduling framework
from Omar, Paolo, Bart, me, others.
- Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
This allows us to export more information that helps debug hangs or
performance issues, without cluttering or abusing the sysfs API.
- Fixes for the sbitmap code, the scalable bitmap code that was
migrated from blk-mq, from Omar.
- Removal of the BLOCK_PC support in struct request, and refactoring of
carrying SCSI payloads in the block layer. This cleans up the code
nicely, and enables us to kill the SCSI specific parts of struct
request, shrinking it down nicely. From Christoph mainly, with help
from Hannes.
- Support for ranged discard requests and discard merging, also from
Christoph.
- Support for OPAL in the block layer, and for NVMe as well. Mainly
from Scott Bauer, with fixes/updates from various others folks.
- Error code fixup for gdrom from Christophe.
- cciss pci irq allocation cleanup from Christoph.
- Making the cdrom device operations read only, from Kees Cook.
- Fixes for duplicate bdi registrations and bdi/queue life time
problems from Jan and Dan.
- Set of fixes and updates for lightnvm, from Matias and Javier.
- A few fixes for nbd from Josef, using idr to name devices and a
workqueue deadlock fix on receive. Also marks Josef as the current
maintainer of nbd.
- Fix from Josef, overwriting queue settings when the number of
hardware queues is updated for a blk-mq device.
- NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
aborted, if we didn't end up aborting it.
- SG gap merging fix from Ming Lei for block.
- Loop fix also from Ming, fixing a race and crash between setting loop
status and IO.
- Two block race fixes from Tahsin, fixing request list iteration and
fixing a race between device registration and udev device add
notifiations.
- Double free fix from cgroup writeback, from Tejun.
- Another double free fix in blkcg, from Hou Tao.
- Partition overflow fix for EFI from Alden Tondettar.
* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
nvme: Check for Security send/recv support before issuing commands.
block/sed-opal: allocate struct opal_dev dynamically
block/sed-opal: tone down not supported warnings
block: don't defer flushes on blk-mq + scheduling
blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
blk-mq: don't special case flush inserts for blk-mq-sched
blk-mq-sched: don't add flushes to the head of requeue queue
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
block: do not allow updates through sysfs until registration completes
lightnvm: set default lun range when no luns are specified
lightnvm: fix off-by-one error on target initialization
Maintainers: Modify SED list from nvme to block
Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
uapi: sed-opal fix IOW for activate lsp to use correct struct
cdrom: Make device operations read-only
elevator: fix loading wrong elevator type for blk-mq devices
cciss: switch to pci_irq_alloc_vectors
block/loop: fix race between I/O and set_status
blk-mq-sched: don't hold queue_lock when calling exit_icq
block: set make_request_fn manually in blk_mq_update_nr_hw_queues
...
Mark Brown [Tue, 21 Feb 2017 17:29:01 +0000 (09:29 -0800)]
sched/core: Fix build paravirt build on arm and arm64
Commit
004172bdad64 ("sched/core: Remove unnecessary #include headers")
removed the inclusion of asm/paravirt.h which is used to get
declarations of paravirt_steal_rq_enabled and paravirt_steal_clock.
It is implicitly included on x86 but not on arm and arm64 breaking the
build if paravirtualization is used. Since things from that header are
used directly fix the build by putting the direct inclusion back.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Maxime Jayat [Tue, 21 Feb 2017 17:35:51 +0000 (18:35 +0100)]
net: socket: fix recvmmsg not returning error from sock_error
Commit
34b88a68f26a ("net: Fix use after free in the recvmmsg exit path"),
changed the exit path of recvmmsg to always return the datagrams
variable and modified the error paths to set the variable to the error
code returned by recvmsg if necessary.
However in the case sock_error returned an error, the error code was
then ignored, and recvmmsg returned 0.
Change the error path of recvmmsg to correctly return the error code
of sock_error.
The bug was triggered by using recvmmsg on a CAN interface which was
not up. Linux 4.6 and later return 0 in this case while earlier
releases returned -ENETDOWN.
Fixes: 34b88a68f26a ("net: Fix use after free in the recvmmsg exit path")
Signed-off-by: Maxime Jayat <maxime.jayat@mobile-devices.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Tue, 21 Feb 2017 14:27:28 +0000 (15:27 +0100)]
bnxt_en: use eth_hw_addr_random()
Use eth_hw_addr_random() to set a random MAC address in order to make
sure bp->dev->addr_assign_type will be properly set to NET_ADDR_RANDOM.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 21 Feb 2017 18:30:15 +0000 (13:30 -0500)]
Merge branch 'bpf-unlocking-fix'
Daniel Borkmann says:
====================
BPF fix with regards to unlocking
This set fixes the issue Eric was reporting recently [1].
First patch is a prerequisite discussed with Laura that is
needed for the later fix in the second one. I've tested this
extensively and it does not reproduce anymore on my side
after the fix. Thanks & sorry about that!
[1] https://www.spinics.net/lists/netdev/msg421877.html
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 21 Feb 2017 15:09:34 +0000 (16:09 +0100)]
bpf: fix unlocking of jited image when module ronx not set
Eric and Willem reported that they recently saw random crashes when
JIT was in use and bisected this to
74451e66d516 ("bpf: make jited
programs visible in traces"). Issue was that the consolidation part
added bpf_jit_binary_unlock_ro() that would unlock previously made
read-only memory back to read-write. However, DEBUG_SET_MODULE_RONX
cannot be used for this to test for presence of set_memory_*()
functions. We need to use ARCH_HAS_SET_MEMORY instead to fix this;
also add the corresponding bpf_jit_binary_lock_ro() to filter.h.
Fixes: 74451e66d516 ("bpf: make jited programs visible in traces")
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Bisected-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 21 Feb 2017 15:09:33 +0000 (16:09 +0100)]
arch: add ARCH_HAS_SET_MEMORY config
Currently, there's no good way to test for the presence of
set_memory_ro/rw/x/nx() helpers implemented by archs such as
x86, arm, arm64 and s390.
There's DEBUG_SET_MODULE_RONX and DEBUG_RODATA, however both
don't really reflect that: set_memory_*() are also available
even when DEBUG_SET_MODULE_RONX is turned off, and DEBUG_RODATA
is set by parisc, but doesn't implement above functions. Thus,
add ARCH_HAS_SET_MEMORY that is selected by mentioned archs,
where generic code can test against this.
This also allows later on to move DEBUG_SET_MODULE_RONX out of
the arch specific Kconfig to define it only once depending on
ARCH_HAS_SET_MEMORY.
Suggested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>