mvebu: PCI: aardvark: Implement workaround for PCIe Completion Timeout
Turris MOX randomly crashes up, when there is connected miniPCIe card
MediaTek MT7915 with the following output:
[ 71.457007] Internal error: synchronous external abort:
96000210 [#1] SMP
[ 71.464021] Modules linked in: xt_connlimit pppoe ppp_async nf_conncount iptable_nat ath9k xt_state xt_nat xt_helper xt_conntrack xt_connmark xt_connbytes xt_REDIREl
[ 71.464187] btintel br_netfilter bnep bluetooth ath9k_hw ath10k_pci ath10k_core ath sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_mg
[ 71.629589] CPU: 0 PID: 1298 Comm: kworker/u5:3 Not tainted 5.4.114 #0
[ 71.636319] Hardware name: CZ.NIC Turris Mox Board (DT)
[ 71.641725] Workqueue: napi_workq napi_workfn
[ 71.646221] pstate:
80400085 (Nzcv daIf +PAN -UAO)
[ 71.651169] pc : mt76_set_irq_mask+0x118/0x150 [mt76]
[ 71.656385] lr : mt7915_init_debugfs+0x358/0x368 [mt7915e]
[ 71.662038] sp :
ffffffc010003cd0
[ 71.665451] x29:
ffffffc010003cd0 x28:
0000000000000060
[ 71.670929] x27:
ffffffc010a56f98 x26:
ffffffc010c0fa9a
[ 71.676407] x25:
ffffffc010ba8788 x24:
ffffff803e01fe00
[ 71.681885] x23:
0000000000000030 x22:
ffffffc010003dc4
[ 71.687361] x21:
0000000000000000 x20:
ffffff803e01fea4
[ 71.692839] x19:
ffffff803cb725c0 x18:
000000002d660780
[ 71.698317] x17:
0000000000000000 x16:
0000000000000001
[ 71.703795] x15:
0000000000005ee0 x14:
ffffffc010d1d000
[ 71.709272] x13:
0000000000002f70 x12:
0000000000000000
[ 71.714749] x11:
0000000000000000 x10:
0000000000000040
[ 71.720226] x9 :
ffffffc010bbe980 x8 :
ffffffc010bbe978
[ 71.725704] x7 :
ffffff803e4003f0 x6 :
0000000000000000
[ 71.731181] x5 :
ffffffc02f240000 x4 :
ffffffc010003e00
[ 71.736658] x3 :
0000000000000000 x2 :
ffffffc008e3f230
[ 71.742135] x1 :
00000000000d7010 x0 :
ffffffc0114d7010
[ 71.747613] Call trace:
[ 71.750137] mt76_set_irq_mask+0x118/0x150 [mt76]
[ 71.754990] mt7915_dual_hif_set_irq_mask+0x108/0xdc0 [mt7915e]
[ 71.761098] __handle_irq_event_percpu+0x6c/0x170
[ 71.765950] handle_irq_event_percpu+0x34/0x88
[ 71.770531] handle_irq_event+0x40/0xb0
[ 71.774486] handle_level_irq+0xe0/0x170
[ 71.778530] generic_handle_irq+0x24/0x38
[ 71.782667] advk_pcie_irq_handler+0x11c/0x238
[ 71.787249] __handle_irq_event_percpu+0x6c/0x170
[ 71.792099] handle_irq_event_percpu+0x34/0x88
[ 71.796680] handle_irq_event+0x40/0xb0
[ 71.800633] handle_fasteoi_irq+0xdc/0x190
[ 71.804855] generic_handle_irq+0x24/0x38
[ 71.808988] __handle_domain_irq+0x60/0xb8
[ 71.813213] gic_handle_irq+0x8c/0x198
[ 71.817077] el1_irq+0xf0/0x1c0
[ 71.820314] el1_da+0xc/0xc0
[ 71.823288] mt76_set_irq_mask+0x118/0x150 [mt76]
[ 71.828141] mt7915_mac_tx_free+0x4c4/0x828 [mt7915e]
[ 71.833352] mt7915_queue_rx_skb+0x5c/0xa8 [mt7915e]
[ 71.838473] mt76_dma_cleanup+0x89c/0x1248 [mt76]
[ 71.843329] __napi_poll+0x38/0xf8
[ 71.846835] napi_workfn+0x58/0xb0
[ 71.850342] process_one_work+0x1fc/0x390
[ 71.854475] worker_thread+0x48/0x4d0
[ 71.858252] kthread+0x120/0x128
[ 71.861581] ret_from_fork+0x10/0x1c
[ 71.865273] Code:
52800000 d65f03c0 f9562c00 8b214000 (
b9400000)
[ 71.871560] ---[ end trace
1d4e29987011411b ]---
[ 71.876320] Kernel panic - not syncing: Fatal exception in interrupt
[ 71.882875] SMP: stopping secondary CPUs
[ 71.886923] Kernel Offset: disabled
[ 71.890519] CPU features: 0x0002,
00002008
[ 71.894649] Memory Limit: none
[ 71.897799] Rebooting in 3 seconds..
Patch is awaiting upstream merge:
https://lore.kernel.org/linux-pci/
20220802123816.21817-1-pali@kernel.org/T/#u
There was also discussion about it in the linux-pci mailing list, where can
be found response from Marvell's employee regarding A3720 PCIe erratum 3.12, which seems to provide further details which help this issue:
https://lore.kernel.org/linux-pci/BN9PR18MB425154FE5019DCAF2028A1D5DB8D9@BN9PR18MB4251.namprd18.prod.outlook.com/t/#u
Reported-by: Ondřej Caletka <ondrej@caletka.cz> [Turris MOX]
Signed-off-by: Josef Schlehofer <pepe.schlehofer@gmail.com>
Reviewed-by: Robert Marko <robimarko@gmail.com>