Christophe Leroy [Fri, 26 Apr 2019 15:58:03 +0000 (15:58 +0000)]
powerpc/mm: get rid of nohash/32/mmu.h and nohash/64/mmu.h
Those files have no real added values, especially the 64 bit
which only includes the common book3e mmu.h which is also
included from 32 bits side.
So lets do the final inclusion directly from nohash/mmu.h
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 15:58:02 +0000 (15:58 +0000)]
powerpc/mm: move pgtable_t in asm/mmu.h
pgtable_t is now identical for all subarches, move it to the
top level asm/mmu.h
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 15:58:01 +0000 (15:58 +0000)]
powerpc/mm: convert Book3E 64 to pte_fragment
Book3E 64 is the only subarch not using pte_fragment. In order
to allow refactorisation, this patch converts it to pte_fragment.
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 15:57:59 +0000 (15:57 +0000)]
powerpc/mm: drop __bad_pte()
This has never been called (since Kernel has been in git at least),
drop it.
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:53 +0000 (05:59 +0000)]
powerpc/mm: flatten function __find_linux_pte() step 3
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.
Previous patches left a { } block. This patch removes it.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:52 +0000 (05:59 +0000)]
powerpc/mm: flatten function __find_linux_pte() step 2
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.
Previous patch left { } blocks. This patch removes the first one
by shifting its content to the left.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:51 +0000 (05:59 +0000)]
powerpc/mm: flatten function __find_linux_pte() step 1
__find_linux_pte() is full of if/else which is hard to
follow allthough the handling is pretty simple.
This patch flattens the function by getting rid of as much if/else
as possible. In order to ease the review, this is done in three steps.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:49 +0000 (05:59 +0000)]
powerpc/mm: cleanup remaining ifdef mess in hugetlbpage.c
Only 3 subarches support huge pages. So when it is either 2 of them,
it is not the third one.
And mmu_has_feature() is known by all subarches so IS_ENABLED() can
be used instead of #ifdef
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:48 +0000 (05:59 +0000)]
powerpc/mm: cleanup HPAGE_SHIFT setup
Only book3s/64 may select default among several HPAGE_SHIFT at runtime.
8xx always defines 512K pages as default
FSL_BOOK3E always defines 4M pages as default
This patch limits HUGETLB_PAGE_SIZE_VARIABLE to book3s/64
moves the definitions in subarches files.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:47 +0000 (05:59 +0000)]
powerpc/mm: move hugetlb_disabled into asm/hugetlb.h
No need to have this in asm/page.h, move it into asm/hugetlb.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:46 +0000 (05:59 +0000)]
powerpc/mm: cleanup ifdef mess in add_huge_page_size()
Introduce a subarch specific helper check_and_get_huge_psize()
to check the huge page sizes and cleanup the ifdef mess in
add_huge_page_size()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:45 +0000 (05:59 +0000)]
powerpc/mm: add a helper to populate hugepd
This patchs adds a subarch helper to populate hugepd.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:44 +0000 (05:59 +0000)]
powerpc/mm: split asm/hugetlb.h into dedicated subarch files
Three subarches support hugepages:
- fsl book3e
- book3s/64
- 8xx
This patch splits asm/hugetlb.h to reduce the #ifdef mess.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:43 +0000 (05:59 +0000)]
powerpc/mm: make gup_hugepte() static
gup_huge_pd() is the only user of gup_hugepte() and it is
located in the same file. This patch moves gup_huge_pd()
after gup_hugepte() and makes gup_hugepte() static.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:42 +0000 (05:59 +0000)]
powerpc/mm: make hugetlbpage.c depend on CONFIG_HUGETLB_PAGE
The only function in hugetlbpage.c which doesn't depend on
CONFIG_HUGETLB_PAGE is gup_hugepte(), and this function is
only called from gup_huge_pd() which depends on
CONFIG_HUGETLB_PAGE so all the content of hugetlbpage.c
depends on CONFIG_HUGETLB_PAGE.
This patch modifies Makefile to only compile hugetlbpage.c
when CONFIG_HUGETLB_PAGE is set.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:41 +0000 (05:59 +0000)]
powerpc/mm: move __find_linux_pte() out of hugetlbpage.c
__find_linux_pte() is the only function in hugetlbpage.c
which is compiled in regardless on CONFIG_HUGETLBPAGE
This patch moves it in pgtable.c.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:40 +0000 (05:59 +0000)]
powerpc/book3e: hugetlbpage is only for CONFIG_PPC_FSL_BOOK3E
As per Kconfig.cputype, only CONFIG_PPC_FSL_BOOK3E gets to
select SYS_SUPPORTS_HUGETLBFS so simplify accordingly.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:39 +0000 (05:59 +0000)]
powerpc/64: only book3s/64 supports CONFIG_PPC_64K_PAGES
CONFIG_PPC_64K_PAGES cannot be selected by nohash/64.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 26 Apr 2019 05:59:38 +0000 (05:59 +0000)]
powerpc/book3e: drop mmu_get_tsize()
This function is not used anymore, drop it.
Fixes: b42279f0165c ("powerpc/mm/nohash: MM_SLICE is only used by book3s 64")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:36 +0000 (14:29 +0000)]
powerpc/mm: define subarch SLB_ADDR_LIMIT_DEFAULT
This patch defines a subarch specific SLB_ADDR_LIMIT_DEFAULT
to remove the #ifdefs around the setup of mm->context.slb_addr_limit
It also generalises the use of mm_ctx_set_slb_addr_limit() helper.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:35 +0000 (14:29 +0000)]
powerpc/mm: define get_slice_psize() all the time
get_slice_psize() can be defined regardless of CONFIG_PPC_MM_SLICES
to avoid ifdefs
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:34 +0000 (14:29 +0000)]
powerpc/8xx: get rid of #ifdef CONFIG_HUGETLB_PAGE for slices
The 8xx only selects CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE
is set.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:33 +0000 (14:29 +0000)]
powerpc/mm: remove a couple of #ifdef CONFIG_PPC_64K_PAGES in mm/slice.c
This patch replaces a couple of #ifdef CONFIG_PPC_64K_PAGES
by IS_ENABLED(CONFIG_PPC_64K_PAGES) to improve code maintainability.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:32 +0000 (14:29 +0000)]
powerpc/mm: remove unnecessary #ifdef CONFIG_PPC64
For PPC32 that's a noop, gcc should be smart enough to ignore it.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:31 +0000 (14:29 +0000)]
powerpc/mm: get rid of mm_ctx_slice_mask_xxx()
Now that slice_mask_for_size() is in mmu.h, the mm_ctx_slice_mask_xxx()
are not needed anymore, so drop them. Note that the 8xx ones where
not used anyway.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:30 +0000 (14:29 +0000)]
powerpc/mm: move slice_mask_for_size() into mmu.h
Move slice_mask_for_size() into subarch mmu.h
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Retain the BUG_ON()s, rather than converting to VM_BUG_ON()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:29 +0000 (14:29 +0000)]
powerpc/mm: hand a context_t over to slice_mask_for_size() instead of mm_struct
slice_mask_for_size() only uses mm->context, so hand directly a
pointer to the context. This will help moving the function in
subarch mmu.h in the next patch by avoiding having to include
the definition of struct mm_struct
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:28 +0000 (14:29 +0000)]
powerpc/mm: no slice for nohash/64
Only nohash/32 and book3s/64 support mm slices.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 25 Apr 2019 14:29:27 +0000 (14:29 +0000)]
powerpc/mm: fix erroneous duplicate slb_addr_limit init
Commit
67fda38f0d68 ("powerpc/mm: Move slb_addr_linit to
early_init_mmu") moved slb_addr_limit init out of setup_arch().
Commit
701101865f5d ("powerpc/mm: Reduce memory usage for mm_context_t
for radix") brought it back into setup_arch() by error.
This patch reverts that erroneous regress.
Fixes: 701101865f5d ("powerpc/mm: Reduce memory usage for mm_context_t for radix")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 29 Mar 2019 10:00:02 +0000 (10:00 +0000)]
powerpc/mm: Move nohash specifics in subdirectory mm/nohash
Many files in arch/powerpc/mm are only for nohash. This patch
creates a subdirectory for them.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Shorten new filenames]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 29 Mar 2019 10:00:01 +0000 (10:00 +0000)]
powerpc/mm: Move book3s32 specifics in subdirectory mm/book3s64
Several files in arch/powerpc/mm are only for book3S32. This patch
creates a subdirectory for them.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Shorten new filenames]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 29 Mar 2019 10:00:00 +0000 (10:00 +0000)]
powerpc/mm: Move book3s64 specifics in subdirectory mm/book3s64
Many files in arch/powerpc/mm are only for book3S64. This patch
creates a subdirectory for them.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Update the selftest sym links, shorten new filenames, cleanup some
whitespace and formatting in the new files.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Fri, 29 Mar 2019 09:59:59 +0000 (09:59 +0000)]
powerpc/mm: change #include "mmu_decl.h" to <mm/mmu_decl.h>
This patch make inclusion of mmu_decl.h independant of the location
of the file including it.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 28 Mar 2019 13:19:47 +0000 (13:19 +0000)]
powerpc/nohash64: clean pgtable.h
TRANSPARENT_HUGEPAGE is only supported by book3s
VMEMMAP_REGION_ID is never used
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 28 Mar 2019 13:03:45 +0000 (13:03 +0000)]
powerpc/book3e: drop BUG_ON() in map_kernel_page()
early_alloc_pgtable() never returns NULL as it panics on failure.
This patch drops the three BUG_ON() which check the non nullity
of early_alloc_pgtable() returned value.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Breno Leitao [Wed, 16 Jan 2019 16:47:44 +0000 (14:47 -0200)]
powerpc/tm: Avoid machine crash on rt_sigreturn()
There is a kernel crash that happens if rt_sigreturn() is called inside
a transactional block.
This crash happens if the kernel hits an in-kernel page fault when
accessing userspace memory, usually through copy_ckvsx_to_user(). A
major page fault calls might_sleep() function, which can cause a task
reschedule. A task reschedule (switch_to()) reclaim and recheckpoint
the TM states, but, in the signal return path, the checkpointed memory
was already reclaimed, thus the exception stack has MSR that points to
MSR[TS]=0.
When the code returns from might_sleep() and a task reschedule
happened, then this task is returned with the memory recheckpointed,
and CPU MSR[TS] = suspended.
This means that there is a side effect at might_sleep() if it is
called with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0.
This side effect can cause a TM bad thing, since at the exception
entrance, the stack saves MSR[TS]=0, and this is what will be used at
RFID, but, the processor has MSR[TS] = Suspended, and this transition
will be invalid and a TM Bad thing will be raised, causing the
following crash:
Unexpected TM Bad Thing exception at
c00000000000e9ec (msr 0x8000000302a03031) tm_scratch=
800000010280b033
cpu 0xc: Vector: 700 (Program Check) at [
c00000003ff1fd70]
pc:
c00000000000e9ec: fast_exception_return+0x100/0x1bc
lr:
c000000000032948: handle_rt_signal64+0xb8/0xaf0
sp:
c0000004263ebc40
msr:
8000000302a03031
current = 0xc000000415050300
paca = 0xc00000003ffc4080 irqmask: 0x03 irq_happened: 0x01
pid = 25006, comm = sigfuz
Linux version
5.0.0-rc1-00001-g3bd6e94bec12 (breno@debian) (gcc version 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019
WARNING: exception is not recoverable, can't continue
enter ? for help
[
c0000004263ebc40]
c000000000032948 handle_rt_signal64+0xb8/0xaf0 (unreliable)
[
c0000004263ebd30]
c000000000022780 do_notify_resume+0x2f0/0x430
[
c0000004263ebe20]
c00000000000e844 ret_from_except_lite+0x70/0x74
--- Exception: c00 (System Call) at
00007fffbaac400c
SP (
7fffeca90f40) is in userspace
The solution for this problem is running the sigreturn code with
regs->msr[TS] disabled, thus, avoiding hitting the side effect above.
This does not seem to be a problem since regs->msr will be replaced by
the ucontext value, so, it is being flushed already. In this case, it
is flushed earlier.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Tue, 30 Apr 2019 07:59:07 +0000 (13:29 +0530)]
powerpc/mm/radix: Fix kernel crash when running subpage protect test
This patch fixes the below crash by making sure we touch the subpage
protection related structures only if we know they are allocated on
the platform. With radix translation we don't allocate hash context at
all and trying to access subpage_prot_table results in:
Faulting instruction address: 0xc00000000008bdb4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
....
NIP [
c00000000008bdb4] sys_subpage_prot+0x74/0x590
LR [
c00000000000b688] system_call+0x5c/0x70
Call Trace:
[
c00020002c6b7d30] [
c00020002c6b7d90] 0xc00020002c6b7d90 (unreliable)
[
c00020002c6b7e20] [
c00000000000b688] system_call+0x5c/0x70
Instruction dump:
fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f821ff11 e92d1178 f9210068
39200000 e92d0968 ebe90630 e93f03e8 <
eb891038>
60000000 3860fffe e9410068
We also move the subpage_prot_table with mmp_sem held to avoid race
between two parallel subpage_prot syscall.
Fixes: 701101865f5d ("powerpc/mm: Reduce memory usage for mm_context_t for radix")
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mahesh Salgaonkar [Mon, 29 Apr 2019 18:16:02 +0000 (23:46 +0530)]
powerpc/powernv/mce: Print additional information about MCE error.
Print more information about MCE error whether it is an hardware or
software error.
Some of the MCE errors can be easily categorized as hardware or
software errors e.g. UEs are due to hardware error, where as error
triggered due to invalid usage of tlbie is a pure software bug. But
not all the MCE errors can be easily categorize into either software
or hardware. There are errors like multihit errors which are usually
result of a software bug, but in some rare cases a hardware failure
can cause a multihit error. In past, we have seen case where after
replacing faulty chip, multihit errors stopped occurring. Same with
parity errors, which are usually due to faulty hardware but there are
chances where multihit can also cause an parity error. Such errors are
difficult to determine what really caused it. Hence this patch
classifies MCE errors into following four categorize:
1. Hardware error:
UE and Link timeout failure errors.
2. Probable hardware error (some chance of software cause)
SLB/ERAT/TLB Parity errors.
3. Software error
Invalid tlbie form.
4. Probable software error (some chance of hardware cause)
SLB/ERAT/TLB Multihit errors.
Sample output:
MCE: CPU80: machine check (Warning) Guest SLB Multihit DAR:
000001001b6e0320 [Recovered]
MCE: CPU80: PID: 24765 Comm: qemu-system-ppc Guest NIP: [
00007fffa309dc60]
MCE: CPU80: Probable Software error (some chance of hardware cause)
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mahesh Salgaonkar [Mon, 29 Apr 2019 18:15:55 +0000 (23:45 +0530)]
powerpc/powernv/mce: Print correct severity for MCE error.
Currently all machine check errors are printed as severe errors which
isn't correct. Print soft errors as warning instead of severe errors.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mahesh Salgaonkar [Mon, 29 Apr 2019 18:15:48 +0000 (23:45 +0530)]
powerpc/powernv/mce: Reduce MCE console logs to lesser lines.
Also add cpu number while displaying MCE log. This will help cleaner
logs when MCE hits on multiple cpus simultaneously.
Before the changes the MCE output was:
Severe Machine check interrupt [Recovered]
NIP [
d00000000ba80280]: insert_slb_entry.constprop.0+0x278/0x2c0 [mcetest_slb]
Initiator: CPU
Error type: SLB [Multihit]
Effective address:
d00000000ba80280
After this patch series changes the MCE output will be:
MCE: CPU80: machine check (Warning) Host SLB Multihit [Recovered]
MCE: CPU80: NIP: [
d00000000b550280] insert_slb_entry.constprop.0+0x278/0x2c0 [mcetest_slb]
MCE: CPU80: Probable software error (some chance of hardware cause)
UE in host application:
MCE: CPU48: machine check (Severe) Host UE Load/Store DAR:
00007fffc6079a80 paddr:
0000000f8e260000 [Not recovered]
MCE: CPU48: PID: 4584 Comm: find NIP: [
0000000010023368]
MCE: CPU48: Hardware error
and for MCE in Guest:
MCE: CPU80: machine check (Warning) Guest SLB Multihit DAR:
000001001b6e0320 [Recovered]
MCE: CPU80: PID: 24765 Comm: qemu-system-ppc Guest NIP: [
00007fffa309dc60]
MCE: CPU80: Probable software error (some chance of hardware cause)
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anton Blanchard [Thu, 4 Oct 2018 06:23:37 +0000 (16:23 +1000)]
powerpc: Add doorbell tracepoints
When analysing sources of OS jitter, I noticed that doorbells cannot be
traced.
Signed-off-by: Anton Blanchard <anton@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
YueHaibing [Fri, 29 Mar 2019 15:44:56 +0000 (23:44 +0800)]
ocxl: remove set but not used variables 'tid' and 'lpid'
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/misc/ocxl/link.c: In function 'xsl_fault_handler':
drivers/misc/ocxl/link.c:187:17: warning: variable 'tid' set but not used
drivers/misc/ocxl/link.c:187:6: warning: variable 'lpid' set but not used
They are never used and can be removed.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Mathieu Malaterre [Wed, 13 Mar 2019 20:00:30 +0000 (21:00 +0100)]
powerpc/64s: Remove 'dummy_copy_buffer'
In commit
2bf1071a8d50 ("powerpc/64s: Remove POWER9 DD1 support") the
function __switch_to remove usage for 'dummy_copy_buffer'. Since it is
not used anywhere else, remove it completely.
This remove the following warning:
arch/powerpc/kernel/process.c:1156:17: error: 'dummy_copy_buffer' defined but not used
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tobin C. Harding [Tue, 30 Apr 2019 01:09:23 +0000 (11:09 +1000)]
powerpc/cacheinfo: Fix kobject memleak
Currently error return from kobject_init_and_add() is not followed by
a call to kobject_put(). This means there is a memory leak.
Add call to kobject_put() in error path of kobject_init_and_add().
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nick Desaulniers [Tue, 23 Apr 2019 21:11:14 +0000 (14:11 -0700)]
powerpc/vdso: Drop unnecessary cc-ldoption
Towards the goal of removing cc-ldoption, it seems that --hash-style=
was added to binutils 2.17.50.0.2 in 2006. The minimal required
version of binutils for the kernel according to
Documentation/process/changes.rst is 2.20.
Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexey Kardashevskiy [Wed, 10 Apr 2019 06:48:00 +0000 (16:48 +1000)]
powerpc/powernv/ioda: Handle failures correctly in pnv_pci_ioda_iommu_bypass_supported()
When the return value type was changed from int to bool, few places
were left unchanged, this fixes them. We did not hit these failures as
the first one is not happening at all and the second one is little
more likely to happen if the user switches a 33..58bit DMA capable
device between the VFIO and vendor drivers and there are not so many
of these.
Fixes: 2d6ad41b2c21 ("powerpc/powernv: use the generic iommu bypass code")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Tue, 30 Apr 2019 12:52:03 +0000 (22:52 +1000)]
Merge branch 'topic/ppc-kvm' into next
Merge our topic branch shared with KVM. In particular this includes the
rewrite of the idle code into C.
Michael Ellerman [Tue, 30 Apr 2019 04:28:17 +0000 (14:28 +1000)]
powerpc/powernv/idle: Restore AMR/UAMOR/AMOR/IAMR after idle
This is an implementation of commits
53a712bae5dd
("powerpc/powernv/idle: Restore AMR/UAMOR/AMOR after idle") and
a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR after idle") using
the new C-based idle code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Extract from Nick's patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Fri, 12 Apr 2019 14:30:52 +0000 (00:30 +1000)]
powerpc/64s: Reimplement book3s idle code in C
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code.
Book3S assembly stubs are kept in common code and used only to save
the stack frame and non-volatile GPRs before executing architected
idle instructions, and restoring the stack and reloading GPRs then
returning to C after waking from idle.
The complex logic dealing with threads and subcores, locking, SPRs,
HMIs, timebase resync, etc., is all done in C which makes it more
maintainable.
This is not a strict translation to C code, there are some
significant differences:
- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them itself.
- The optimisation where EC=ESL=0 idle modes did not have to save GPRs
or change MSR is restored, because it's now simple to do. ESL=1
sleeps that do not lose GPRs can use this optimization too.
- KVM secondary entry and cede is now more of a call/return style
rather than branchy. nap_state_lost is not required because KVM
always returns via NVGPR restoring path.
- KVM secondary wakeup from offline sequence is moved entirely into
the offline wakeup, which avoids a hwsync in the normal idle wakeup
path.
Performance measured with context switch ping-pong on different
threads or cores, is possibly improved a small amount, 1-3% depending
on stop state and core vs thread test for shallow states. Deep states
it's in the noise compared with other latencies.
KVM improvements:
- Idle sleepers now always return to caller rather than branch out
to KVM first.
- This allows optimisations like very fast return to caller when no
state has been lost.
- KVM no longer requires nap_state_lost because it controls NVGPR
save/restore itself on the way in and out.
- The heavy idle wakeup KVM request check can be moved out of the
normal host idle code and into the not-performance-critical offline
code.
- KVM nap code now returns from where it is called, which makes the
flow a bit easier to follow.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Squash the KVM changes in]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Tue, 9 Apr 2019 04:40:05 +0000 (14:40 +1000)]
powerpc/watchdog: Use hrtimers for per-CPU heartbeat
Using a jiffies timer creates a dependency on the tick_do_timer_cpu
incrementing jiffies. If that CPU has locked up and jiffies is not
incrementing, the watchdog heartbeat timer for all CPUs stops and
creates false positives and confusing warnings on local CPUs, and
also causes the SMP detector to stop, so the root cause is never
detected.
Fix this by using hrtimer based timers for the watchdog heartbeat,
like the generic kernel hardlockup detector.
Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reported-by: Ravikumar Bangoria <ravi.bangoria@in.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reported-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Fontenot [Tue, 2 Oct 2018 15:35:59 +0000 (10:35 -0500)]
powerpc/pseries: Track LMB nid instead of using device tree
When removing memory we need to remove the memory from the node
it was added to instead of looking up the node it should be in
in the device tree.
During testing we have seen scenarios where the affinity for a
LMB changes due to a partition migration or PRRN event. In these
cases the node the LMB exists in may not match the node the device
tree indicates it belongs in. This can lead to a system crash
when trying to DLPAR remove the LMB after a migration or PRRN
event. The current code looks up the node in the device tree to
remove the LMB from, the crash occurs when we try to offline this
node and it does not have any data, i.e. node_data[nid] == NULL.
36:mon> e
cpu 0x36: Vector: 300 (Data Access) at [
c0000001828b7810]
pc:
c00000000036d08c: try_offline_node+0x2c/0x1b0
lr:
c0000000003a14ec: remove_memory+0xbc/0x110
sp:
c0000001828b7a90
msr:
800000000280b033
dar: 9a28
dsisr:
40000000
current = 0xc0000006329c4c80
paca = 0xc000000007a55200 softe: 0 irq_happened: 0x01
pid = 76926, comm = kworker/u320:3
36:mon> t
[link register ]
c0000000003a14ec remove_memory+0xbc/0x110
[
c0000001828b7a90]
c00000000006a1cc arch_remove_memory+0x9c/0xd0 (unreliable)
[
c0000001828b7ad0]
c0000000003a14e0 remove_memory+0xb0/0x110
[
c0000001828b7b20]
c0000000000c7db4 dlpar_remove_lmb+0x94/0x160
[
c0000001828b7b60]
c0000000000c8ef8 dlpar_memory+0x7e8/0xd10
[
c0000001828b7bf0]
c0000000000bf828 handle_dlpar_errorlog+0xf8/0x160
[
c0000001828b7c60]
c0000000000bf8cc pseries_hp_work_fn+0x3c/0xa0
[
c0000001828b7c90]
c000000000128cd8 process_one_work+0x298/0x5a0
[
c0000001828b7d20]
c000000000129068 worker_thread+0x88/0x620
[
c0000001828b7dc0]
c00000000013223c kthread+0x1ac/0x1c0
[
c0000001828b7e30]
c00000000000b45c ret_from_kernel_thread+0x5c/0x80
To resolve this we need to track the node a LMB belongs to when
it is added to the system so we can remove it from that node instead
of the node that the device tree indicates it should belong to.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Colin Ian King [Tue, 23 Apr 2019 15:10:17 +0000 (16:10 +0100)]
powerpc/mm: fix spelling mistake "Outisde" -> "Outside"
There are several identical spelling mistakes in warning messages,
fix these.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Sat, 30 Mar 2019 05:43:45 +0000 (11:13 +0530)]
powerpc/mm: Fix section mismatch warning
This patch fix the below section mismatch warnings.
WARNING: vmlinux.o(.text+0x2d1f44): Section mismatch in reference from the function devm_memremap_pages_release() to the function .meminit.text:arch_remove_memory()
WARNING: vmlinux.o(.text+0x2d265c): Section mismatch in reference from the function devm_memremap_pages() to the function .meminit.text:arch_add_memory()
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 12:59:19 +0000 (18:29 +0530)]
powerpc/mm/hash: Rename KERNEL_REGION_ID to LINEAR_MAP_REGION_ID
The region actually point to linear map. Rename the #define to
clarify thati.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 12:59:18 +0000 (18:29 +0530)]
powerpc/mm: Print kernel map details to dmesg
This helps in debugging. We can look at the dmesg to find out
different kernel mapping details.
On 4K config this shows
kernel vmalloc start = 0xc000100000000000
kernel IO start = 0xc000200000000000
kernel vmemmap start = 0xc000300000000000
On 64K config:
kernel vmalloc start = 0xc008000000000000
kernel IO start = 0xc00a000000000000
kernel vmemmap start = 0xc00c000000000000
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 12:59:17 +0000 (18:29 +0530)]
powerpc/mm/hash: Simplify the region id calculation.
This reduces multiple comparisons in get_region_id to a bit shift operation.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 12:59:16 +0000 (18:29 +0530)]
powerpc/mm: Drop the unnecessary region check
All the regions are now mapped with top nibble 0xc. Hence the region id
check is not needed for virt_addr_valid()
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 12:59:15 +0000 (18:29 +0530)]
powerpc/mm: Validate address values against different region limits
This adds an explicit check in various functions.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 12:59:14 +0000 (18:29 +0530)]
powerpc/mm/hash64: Map all the kernel regions in the same 0xc range
This patch maps vmalloc, IO and vmemap regions in the 0xc address range
instead of the current 0xd and 0xf range. This brings the mapping closer
to radix translation mode.
With hash 64K page size each of this region is 512TB whereas with 4K config
we are limited by the max page table range of 64TB and hence there regions
are of 16TB size.
The kernel mapping is now:
On 4K hash
kernel_region_map_size = 16TB
kernel vmalloc start = 0xc000100000000000
kernel IO start = 0xc000200000000000
kernel vmemmap start = 0xc000300000000000
64K hash, 64K radix and 4k radix:
kernel_region_map_size = 512TB
kernel vmalloc start = 0xc008000000000000
kernel IO start = 0xc00a000000000000
kernel vmemmap start = 0xc00c000000000000
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 12:59:13 +0000 (18:29 +0530)]
powerpc/mm/hash64: Add a variable to track the end of IO mapping
This makes it easy to update the region mapping in the later patch
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 13:03:51 +0000 (18:33 +0530)]
powerc/mm/hash: Reduce hash_mm_context size
Allocate subpage protect related variables only if we use the feature.
This helps in reducing the hash related mm context struct by around 4K
Before the patch
sizeof(struct hash_mm_context) = 8288
After the patch
sizeof(struct hash_mm_context) = 4160
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 13:03:50 +0000 (18:33 +0530)]
powerpc/mm: Reduce memory usage for mm_context_t for radix
Currently, our mm_context_t on book3s64 include all hash specific
context details like slice mask and subpage protection details. We
can skip allocating these with radix translation. This will help us to save
8K per mm_context with radix translation.
With the patch applied we have
sizeof(mm_context_t) = 136
sizeof(struct hash_mm_context) = 8288
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 13:03:49 +0000 (18:33 +0530)]
powerpc/mm: Move slb_addr_linit to early_init_mmu
Avoid #ifdef in generic code. Also enables us to do this specific to
MMU translation mode on book3s64
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 13:03:48 +0000 (18:33 +0530)]
powerpc/mm: Add helpers for accessing hash translation related variables
We want to switch to allocating them runtime only when hash translation is
enabled. Add helpers so that both book3s and nohash can be adapted to
upcoming change easily.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 17 Apr 2019 13:03:47 +0000 (18:33 +0530)]
powerpc/mm: Remove PPC_MM_SLICES #ifdef for book3s64
Book3s64 always have PPC_MM_SLICES enabled. So remove the unncessary #ifdef
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Wed, 3 Apr 2019 06:05:14 +0000 (11:35 +0530)]
powerpc/mm: Fix build error with FLATMEM book3s64 config
The current value of MAX_PHYSMEM_BITS cannot work with 32 bit configs.
We used to have MAX_PHYSMEM_BITS not defined without SPARSEMEM and 32
bit configs never expected a value to be set for MAX_PHYSMEM_BITS.
Dependent code such as zsmalloc derived the right values based on other
fields. Instead of finding a value that works with different configs,
use new values only for book3s_64. For 64 bit booke, use the definition
of MAX_PHYSMEM_BITS as per commit
a7df61a0e2b6 ("[PATCH] ppc64: Increase sparsemem defaults")
That change was done in 2005 and hopefully will work with book3e 64.
Fixes: 8bc086899816 ("powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 11 Mar 2019 08:30:38 +0000 (08:30 +0000)]
powerpc/32s: Implement Kernel Userspace Access Protection
This patch implements Kernel Userspace Access Protection for
book3s/32.
Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.
The previous patch modifies the page protection so that RW user
pages are RW for Key 0 and RO for Key 1, and it sets Key 0 for
both user and kernel.
This patch changes userspace segment registers are set to Ku 0
and Ks 1. When kernel needs to write to RW pages, the associated
segment register is then changed to Ks 0 in order to allow write
access to the kernel.
In order to avoid having the read all segment registers when
locking/unlocking the access, some data is kept in the thread_struct
and saved on stack on exceptions. The field identifies both the
first unlocked segment and the first segment following the last
unlocked one. When no segment is unlocked, it contains value 0.
As the hash_page() function is not able to easily determine if a
protfault is due to a bad kernel access to userspace, protfaults
need to be handled by handle_page_fault when KUAP is set.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Drop allow_read/write_to/from_user() as they're now in kup.h,
and adapt allow_user_access() to do nothing when to == NULL]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 11 Mar 2019 08:30:36 +0000 (08:30 +0000)]
powerpc/32s: Prepare Kernel Userspace Access Protection
This patch prepares Kernel Userspace Access Protection for
book3s/32.
Due to limitations of the processor page protection capabilities,
the protection is only against writing. read protection cannot be
achieved using page protection.
book3s/32 provides the following values for PP bits:
PP00 provides RW for Key 0 and NA for Key 1
PP01 provides RW for Key 0 and RO for Key 1
PP10 provides RW for all
PP11 provides RO for all
Today PP10 is used for RW pages and PP11 for RO pages, and user
segment register's Kp and Ks are set to 1. This patch modifies
page protection to use PP01 for RW pages and sets user segment
registers to Kp 0 and Ks 0.
This will allow to setup Userspace write access protection by
settng Ks to 1 in the following patch.
Kernel space segment registers remain unchanged.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 11 Mar 2019 08:30:35 +0000 (08:30 +0000)]
powerpc/32s: Implement Kernel Userspace Execution Prevention.
To implement Kernel Userspace Execution Prevention, this patch
sets NX bit on all user segments on kernel entry and clears NX bit
on all user segments on kernel exit.
Note that powerpc 601 doesn't have the NX bit, so KUEP will not
work on it. A warning is displayed at startup.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 11 Mar 2019 08:30:34 +0000 (08:30 +0000)]
powerpc/8xx: Add Kernel Userspace Access Protection
This patch adds Kernel Userspace Access Protection on the 8xx.
When a page is RO or RW, it is set RO or RW for Key 0 and NA
for Key 1.
Up to now, the User group is defined with Key 0 for both User and
Supervisor.
By changing the group to Key 0 for User and Key 1 for Supervisor,
this patch prevents the Kernel from being able to access user data.
At exception entry, the kernel saves SPRN_MD_AP in the regs struct,
and reapply the protection. At exception exit it restores SPRN_MD_AP
with the value saved on exception entry.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Drop allow_read/write_to/from_user() as they're now in kup.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 11 Mar 2019 08:30:33 +0000 (08:30 +0000)]
powerpc/8xx: Add Kernel Userspace Execution Prevention
This patch adds Kernel Userspace Execution Prevention on the 8xx.
When a page is Executable, it is set Executable for Key 0 and NX
for Key 1.
Up to now, the User group is defined with Key 0 for both User and
Supervisor.
By changing the group to Key 0 for User and Key 1 for Supervisor,
this patch prevents the Kernel from being able to execute user code.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 11 Mar 2019 08:30:32 +0000 (08:30 +0000)]
powerpc/8xx: Only define APG0 and APG1
Since the 8xx implements hardware page table walk assistance,
the PGD entries always point to a 4k aligned page, so the 2 upper
bits of the APG are not clobbered anymore and remain 0. Therefore
only APG0 and APG1 are used and need a definition. We set the
other APG to the lowest permission level.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 11 Mar 2019 08:30:31 +0000 (08:30 +0000)]
powerpc/32: Prepare for Kernel Userspace Access Protection
This patch adds ASM macros for saving, restoring and checking
the KUAP state, and modifies setup_32 to call them on exceptions
from kernel.
The macros are defined as empty by default for when CONFIG_PPC_KUAP
is not selected and/or for platforms which don't handle (yet) KUAP.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Mon, 11 Mar 2019 08:30:30 +0000 (08:30 +0000)]
powerpc/32: Remove MSR_PR test when returning from syscall
syscalls are from user only, so we can account time without checking
whether returning to kernel or user as it will only be user.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Thu, 18 Apr 2019 06:51:25 +0000 (16:51 +1000)]
powerpc/mm: Detect bad KUAP faults
When KUAP is enabled we have logic to detect page faults that occur
outside of a valid user access region and are blocked by the AMR.
What we don't have at the moment is logic to detect a fault *within* a
valid user access region, that has been incorrectly blocked by AMR.
This is not meant to ever happen, but it can if we incorrectly
save/restore the AMR, or if the AMR was overwritten for some other
reason.
Currently if that happens we assume it's just a regular fault that
will be corrected by handling the fault normally, so we just return.
But there is nothing the fault handling code can do to fix it, so the
fault just happens again and we spin forever, leading to soft lockups.
So add some logic to detect that case and WARN() if we ever see it.
Arguably it should be a BUG(), but it's more polite to fail the access
and let the kernel continue, rather than taking down the box. There
should be no data integrity issue with failing the fault rather than
BUG'ing, as we're just going to disallow an access that should have
been allowed.
To make the code a little easier to follow, unroll the condition at
the end of bad_kernel_fault() and comment each case, before adding the
call to bad_kuap_fault().
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Thu, 18 Apr 2019 06:51:24 +0000 (16:51 +1000)]
powerpc/64s: Implement KUAP for Radix MMU
Kernel Userspace Access Prevention utilises a feature of the Radix MMU
which disallows read and write access to userspace addresses. By
utilising this, the kernel is prevented from accessing user data from
outside of trusted paths that perform proper safety checks, such as
copy_{to/from}_user() and friends.
Userspace access is disabled from early boot and is only enabled when
performing an operation like copy_{to/from}_user(). The register that
controls this (AMR) does not prevent userspace from accessing itself,
so there is no need to save and restore when entering and exiting
userspace.
When entering the kernel from the kernel we save AMR and if it is not
blocking user access (because eg. we faulted doing a user access) we
reblock user access for the duration of the exception (ie. the page
fault) and then restore the AMR when returning back to the kernel.
This feature can be tested by using the lkdtm driver (CONFIG_LKDTM=y)
and performing the following:
# (echo ACCESS_USERSPACE) > [debugfs]/provoke-crash/DIRECT
If enabled, this should send SIGSEGV to the thread.
We also add paranoid checking of AMR in switch and syscall return
under CONFIG_PPC_KUAP_DEBUG.
Co-authored-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Russell Currey [Thu, 18 Apr 2019 06:51:23 +0000 (16:51 +1000)]
powerpc/lib: Refactor __patch_instruction() to use __put_user_asm()
__patch_instruction() is called in early boot, and uses
__put_user_size(), which includes the allow/prevent calls to enforce
KUAP, which could either be called too early, or in the Radix case,
forced to use "early_" versions of functions just to safely handle
this one case.
__put_user_asm() does not do this, and thus is safe to use both in
early boot, and later on since in this case it should only ever be
touching kernel memory.
__patch_instruction() was previously refactored to use
__put_user_size() in order to be able to return -EFAULT, which would
allow the kernel to patch instructions in userspace, which should
never happen. This has the functional change of causing faults on
userspace addresses if KUAP is turned on, which should never happen in
practice.
A future enhancement could be to double check the patch address is
definitely allowed to be tampered with by the kernel.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Russell Currey [Thu, 18 Apr 2019 06:51:22 +0000 (16:51 +1000)]
powerpc/mm/radix: Use KUEP API for Radix MMU
Execution protection already exists on radix, this just refactors
the radix init to provide the KUEP setup function instead.
Thus, the only functional change is that it can now be disabled.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Russell Currey [Thu, 18 Apr 2019 06:51:21 +0000 (16:51 +1000)]
powerpc/64: Setup KUP on secondary CPUs
Some platforms (i.e. Radix MMU) need per-CPU initialisation for KUP.
Any platforms that only want to do KUP initialisation once
globally can just check to see if they're running on the boot CPU, or
check if whatever setup they need has already been performed.
Note that this is only for 64-bit.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 18 Apr 2019 06:51:20 +0000 (16:51 +1000)]
powerpc: Add a framework for Kernel Userspace Access Protection
This patch implements a framework for Kernel Userspace Access
Protection.
Then subarches will have the possibility to provide their own
implementation by providing setup_kuap() and
allow/prevent_user_access().
Some platforms will need to know the area accessed and whether it is
accessed from read, write or both. Therefore source, destination and
size and handed over to the two functions.
mpe: Rename to allow/prevent rather than unlock/lock, and add
read/write wrappers. Drop the 32-bit code for now until we have an
implementation for it. Add kuap to pt_regs for 64-bit as well as
32-bit. Don't split strings, use pr_crit_ratelimited().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 18 Apr 2019 06:51:19 +0000 (16:51 +1000)]
powerpc: Add skeleton for Kernel Userspace Execution Prevention
This patch adds a skeleton for Kernel Userspace Execution Prevention.
Then subarches implementing it have to define CONFIG_PPC_HAVE_KUEP
and provide setup_kuep() function.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Don't split strings, use pr_crit_ratelimited()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Christophe Leroy [Thu, 18 Apr 2019 06:51:18 +0000 (16:51 +1000)]
powerpc: Add framework for Kernel Userspace Protection
This patch adds a skeleton for Kernel Userspace Protection
functionnalities like Kernel Userspace Access Protection and Kernel
Userspace Execution Prevention
The subsequent implementation of KUAP for radix makes use of a MMU
feature in order to patch out assembly when KUAP is disabled or
unsupported. This won't work unless there's an entry point for KUP
support before the feature magic happens, so for PPC64 setup_kup() is
called early in setup.
On PPC32, feature_fixup() is done too early to allow the same.
Suggested-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman [Thu, 18 Apr 2019 06:51:17 +0000 (16:51 +1000)]
powerpc/powernv/idle: Restore AMR/UAMOR/AMOR after idle
In order to implement KUAP (Kernel Userspace Access Protection) on
Power9 we will be using the AMR, and therefore indirectly the
UAMOR/AMOR.
So save/restore these regs in the idle code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Russell Currey [Thu, 18 Apr 2019 06:51:16 +0000 (16:51 +1000)]
powerpc/powernv/idle: Restore IAMR after idle
Without restoring the IAMR after idle, execution prevention on POWER9
with Radix MMU is overwritten and the kernel can freely execute
userspace without faulting.
This is necessary when returning from any stop state that modifies
user state, as well as hypervisor state.
To test how this fails without this patch, load the lkdtm driver and
do the following:
$ echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
which won't fault, then boot the kernel with powersave=off, where it
will fault. Applying this patch will fix this.
Fixes: 3b10d0095a1e ("powerpc/mm/radix: Prevent kernel execution of user space")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Neuling [Mon, 1 Apr 2019 06:03:12 +0000 (17:03 +1100)]
powerpc: Add force enable of DAWR on P9 option
This adds a flag so that the DAWR can be enabled on P9 via:
echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
The DAWR was previously force disabled on POWER9 in:
9654153158 powerpc: Disable DAWR in the base POWER9 CPU features
Also see Documentation/powerpc/DAWR-POWER9.txt
This is a dangerous setting, USE AT YOUR OWN RISK.
Some users may not care about a bad user crashing their box
(ie. single user/desktop systems) and really want the DAWR. This
allows them to force enable DAWR.
This flag can also be used to disable DAWR access. Once this is
cleared, all DAWR access should be cleared immediately and your
machine once again safe from crashing.
Userspace may get confused by toggling this. If DAWR is force
enabled/disabled between getting the number of breakpoints (via
PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
inconsistent view of what's available. Similarly for guests.
For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
enabled in the host AND the guest. For this reason, this won't work on
POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
dawr_enable_dangerous file will fail if the hypervisor doesn't support
writing the DAWR.
To double check the DAWR is working, run this kernel selftest:
tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
Any errors/failures/skips mean something is wrong.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Lynch [Thu, 18 Apr 2019 18:56:58 +0000 (13:56 -0500)]
powerpc/numa: document topology_updates_enabled, disable by default
Changing the NUMA associations for CPUs and memory at runtime is
basically unsupported by the core mm, scheduler etc. We see all manner
of crashes, warnings and instability when the pseries code tries to do
this. Disable this behavior by default, and document the switch a bit.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nathan Lynch [Thu, 18 Apr 2019 18:56:57 +0000 (13:56 -0500)]
powerpc/numa: improve control of topology updates
When booted with "topology_updates=no", or when "off" is written to
/proc/powerpc/topology_updates, NUMA reassignments are inhibited for
PRRN and VPHN events. However, migration and suspend unconditionally
re-enable reassignments via start_topology_update(). This is
incoherent.
Check the topology_updates_enabled flag in
start/stop_topology_update() so that callers of those APIs need not be
aware of whether reassignments are enabled. This allows the
administrative decision on reassignments to remain in force across
migrations and suspensions.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Andrew Donnellan [Thu, 14 Mar 2019 04:27:27 +0000 (15:27 +1100)]
powerpc/powernv: Squash sparse warnings in opal-call.c
sparse complains a lot about opal-call.c:
arch/powerpc/platforms/powernv/opal-call.c:128:1: warning: symbol 'opal_invalid_call' was not declared. Should it be static?
arch/powerpc/platforms/powernv/opal-call.c:129:1: warning: symbol 'opal_console_write' was not declared. Should it be static?
arch/powerpc/platforms/powernv/opal-call.c:130:1: warning: symbol 'opal_console_read' was not declared. Should it be static?
Those symbols are forward declared in opal.h, but we can't include that
because the function signatures in opal.h are different. So instead, just
add an extra forward declaration to the OPAL_CALL macro to shut sparse up.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
George Spelvin [Thu, 21 Mar 2019 10:42:22 +0000 (10:42 +0000)]
powerpc/crypto: Use cheaper random numbers for crc-vpmsum self-test
This code was filling a 64K buffer from /dev/urandom in order to
compute a CRC over (on average half of) it by two different methods,
comparing the CRCs, and repeating.
This is not a remotely security-critical application, so use the far
faster and cheaper prandom_u32() generator.
And, while we're at it, only fill as much of the buffer as we plan to use.
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jagadeesh Pagadala [Sat, 23 Mar 2019 12:50:55 +0000 (18:20 +0530)]
powerpc: Remove duplicate headers
Remove duplicate headers inclusions.
Signed-off-by: Jagadeesh Pagadala <jagdsh.linux@gmail.com>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Wen Yang [Tue, 26 Mar 2019 10:29:51 +0000 (18:29 +0800)]
powerpc/8xx: Fix possible device node reference leak
The call to of_find_compatible_node() returns a node pointer with
refcount incremented thus it must be explicitly decremented after the
last usage.
irq_domain_add_linear() also calls of_node_get() to increase refcount,
so irq_domain() will not be affected when it is released.
Detected by coccinelle.
Fixes: a8db8cf0d894 ("irq_domain: Replace irq_alloc_host() with revmap-specific initializers")
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Peng Hao <peng.hao2@zte.com.cn>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ganesh Goudar [Mon, 15 Apr 2019 10:05:44 +0000 (15:35 +0530)]
powerpc/pseries: hwpoison the pages upon hitting UE
Add support to hwpoison the pages upon hitting machine check
exception.
This patch queues the address where UE is hit to percpu array
and schedules work to plumb it into memory poison infrastructure.
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
[mpe: Combine #ifdefs, drop PPC_BIT8(), and empty inline stub]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Julia Lawall [Sat, 23 Feb 2019 13:20:34 +0000 (14:20 +0100)]
powerpc/83xx: Add missing of_node_put() after of_device_is_available()
Add an of_node_put() when a tested device node is not available.
Fixes: c026c98739c7e ("powerpc/83xx: Do not configure or probe disabled FSL DR USB controllers")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Lukas Bulwahn [Thu, 11 Apr 2019 04:27:52 +0000 (06:27 +0200)]
MAINTAINERS: Update remaining @linux.vnet.ibm.com addresses
Paul McKenney attempted to update all email addresses @linux.vnet.ibm.com
to @linux.ibm.com in commit
1dfddcdb95c4
("MAINTAINERS: Update from @linux.vnet.ibm.com to @linux.ibm.com"), but
some still remained.
We update the remaining email addresses in MAINTAINERS, hopefully finally
catching all cases for good.
Fixes: 1dfddcdb95c4 ("MAINTAINERS: Update from @linux.vnet.ibm.com to @linux.ibm.com")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Sukadev Bhattiprolu [Thu, 11 Apr 2019 01:38:46 +0000 (18:38 -0700)]
MAINTAINERS: Remove non-existent VAS file
The file arch/powerpc/include/uapi/asm/vas.h was considered but
never merged and should be removed from the MAINTAINERS file.
While here, add missing email address.
Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Qian Cai [Sun, 7 Apr 2019 01:54:47 +0000 (21:54 -0400)]
powerpc/pseries/pmem: Fix a set but not used value
The commit
4c5d87db4978 ("powerpc/pseries: PAPR persistent memory
support") set a local variable "count" in dlpar_hp_pmem() but never
use it.
arch/powerpc/platforms/pseries/pmem.c: In function 'dlpar_hp_pmem':
arch/powerpc/platforms/pseries/pmem.c:109:6: warning: variable 'count' set but not used
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Qian Cai [Sun, 7 Apr 2019 02:48:08 +0000 (22:48 -0400)]
powerpc/pseries/iommu: Fix set but not used values
The commit
b7d6bf4fdd47 ("powerpc/pseries/pci: Remove obsolete SW
invalidate") left 2 variables unused.
arch/powerpc/platforms/pseries/iommu.c:108:17: warning: variable 'tces' set but not used
__be64 *tcep, *tces;
^~~~
arch/powerpc/platforms/pseries/iommu.c:132:17: warning: variable 'tces' set but not used
__be64 *tcep, *tces;
^~~~
Also, the commit
68c0449ea16d ("powerpc/pseries/iommu: Use memory@
nodes in max RAM address calculation") set "ranges" in
ddw_memory_hotplug_max() but never use it.
arch/powerpc/platforms/pseries/iommu.c: In function 'ddw_memory_hotplug_max':
arch/powerpc/platforms/pseries/iommu.c:948:7: warning: variable 'ranges' set but not used
int ranges, n_mem_addr_cells, n_mem_size_cells, len;
^~~~~~
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Qian Cai [Thu, 7 Mar 2019 14:40:31 +0000 (09:40 -0500)]
powerpc/mm: Silence unused-but-set-variable warnings
pte_unmap() compiles away on some powerpc platforms, so silence the
warnings below by making it a static inline function.
mm/memory.c: In function 'copy_pte_range':
mm/memory.c:820:24: warning: variable 'orig_dst_pte' set but not used
mm/memory.c:820:9: warning: variable 'orig_src_pte' set but not used
mm/madvise.c: In function 'madvise_free_pte_range':
mm/madvise.c:318:9: warning: variable 'orig_pte' set but not used
mm/swap_state.c: In function 'swap_ra_info':
mm/swap_state.c:634:15: warning: variable 'orig_pte' set but not used
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Laurent Vivier [Wed, 13 Mar 2019 10:25:28 +0000 (11:25 +0100)]
powerpc/mm: move warning from resize_hpt_for_hotplug()
resize_hpt_for_hotplug() reports a warning when it cannot
resize the hash page table ("Unable to resize hash page
table to target order") but in some cases it's not a problem
and can make user thinks something has not worked properly.
This patch moves the warning to arch_remove_memory() to
only report the problem when it is needed.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Aneesh Kumar K.V [Tue, 9 Apr 2019 04:03:28 +0000 (09:33 +0530)]
powerpc/mm/radix: Don't do SLB preload when using the radix MMU
Add radix_enabled() check to avoid SLB preload with radix translation.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>