The current code extracts the physical address for UE errors and then
hooks it up into memory failure infrastructure. On successful
extraction of physical address it wrongly sets "handled = 1" which
means this UE error has been recovered. Since MCE handler gets return
value as handled = 1, it assumes that error has been recovered and
goes back to same NIP. This causes MCE interrupt again and again in a
loop leading to hard lockup.
Also, initialize phys_addr to ULONG_MAX so that we don't end up
queuing undesired page to hwpoison.
Without this patch we see:
Severe Machine check interrupt [Recovered]
NIP: [
000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address:
00007fffd2755940
Physical address:
000020181a080000
...
Severe Machine check interrupt [Recovered]
NIP: [
000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address:
00007fffd2755940
Physical address:
000020181a080000
Severe Machine check interrupt [Recovered]
NIP: [
000000001002588c] PID: 7109 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address:
00007fffd2755940
Physical address:
000020181a080000
Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
Memory failure: 0x20181a08: already hardware poisoned
...
Watchdog CPU:38 Hard LOCKUP
After this patch we see:
Severe Machine check interrupt [Not recovered]
NIP: [
00007fffaae585f4] PID: 7168 Comm: find
Initiator: CPU
Error type: UE [Load/Store]
Effective address:
00007fffaafe28ac
Physical address:
00002017c0bd0000
find[7168]: unhandled signal 7 at
00007fffaae585f4 nip
00007fffaae585f4 lr
00007fffaae585e0 code 4
Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
Fixes: 01eaac2b0591 ("powerpc/mce: Hookup ierror (instruction) UE errors")
Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
if (pfn != ULONG_MAX) {
*phys_addr =
(pfn << PAGE_SHIFT);
- handled = 1;
}
}
}
* kernel/exception-64s.h
*/
if (get_paca()->in_mce < MAX_MCE_DEPTH)
- if (!mce_find_instr_ea_and_pfn(regs, addr,
- phys_addr))
- handled = 1;
+ mce_find_instr_ea_and_pfn(regs, addr, phys_addr);
}
found = 1;
}
const struct mce_ierror_table itable[])
{
struct mce_error_info mce_err = { 0 };
- uint64_t addr, phys_addr;
+ uint64_t addr, phys_addr = ULONG_MAX;
uint64_t srr1 = regs->msr;
long handled;