A warning message during SRIOV multicast cleanup should have actually been
a debug level message. The condition generating the warning does no harm
and can fill the message log.
In some cases, during testing, some tests were so intense as to swamp the
message log with these warning messages, causing a stall in the console
message log output task. This stall caused an NMI to be sent to all CPUs
(so that they all dumped their stacks into the message log).
Aside from the message flood causing an NMI, the tests all passed.
Once the message flood which caused the NMI is removed (by reducing the
warning message to debug level), the NMI no longer occurs.
Sample message log (console log) output illustrating the flood and
resultant NMI (snippets with comments and modified with ... instead
of hex digits, to satisfy checkpatch.pl):
<mlx4_ib> _mlx4_ib_mcg_port_cleanup: ... WARNING: group refcount 1!!!...
*** About 4000 almost identical lines in less than one second ***
<mlx4_ib> _mlx4_ib_mcg_port_cleanup: ... WARNING: group refcount 1!!!...
INFO: rcu_sched detected stalls on CPUs/tasks: { 17} (...)
*** { 17} above indicates that CPU 17 was the one that stalled ***
sending NMI to all CPUs:
...
NMI backtrace for cpu 17
CPU: 17 PID: 45909 Comm: kworker/17:2
Hardware name: HP ProLiant DL360p Gen8, BIOS P71 09/08/2013
Workqueue: events fb_flashcursor
task:
ffff880478...... ti:
ffff88064e...... task.ti:
ffff88064e......
RIP: 0010:[
ffffffff81......] [
ffffffff81......] io_serial_in+0x15/0x20
RSP: 0018:
ffff88064e257cb0 EFLAGS:
00000002
RAX:
0000000000...... RBX:
ffffffff81...... RCX:
0000000000......
RDX:
0000000000...... RSI:
0000000000...... RDI:
ffffffff81......
RBP:
ffff88064e...... R08:
ffffffff81...... R09:
0000000000......
R10:
0000000000...... R11:
ffff88064e...... R12:
0000000000......
R13:
0000000000...... R14:
ffffffff81...... R15:
0000000000......
FS:
0000000000......(0000) GS:
ffff8804af......(0000) knlGS:
000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080......
CR2:
00007f2a2f...... CR3:
0000000001...... CR4:
0000000000......
DR0:
0000000000...... DR1:
0000000000...... DR2:
0000000000......
DR3:
0000000000...... DR6:
00000000ff...... DR7:
0000000000......
Stack:
ffff88064e......
ffffffff81......
ffffffff81......
0000000000......
ffffffff81......
ffff88064e......
ffffffff81......
ffffffff81......
ffffffff81......
ffff88064e......
ffffffff81......
0000000000......
Call Trace:
[<
ffffffff813d099b>] wait_for_xmitr+0x3b/0xa0
[<
ffffffff813d0b5c>] serial8250_console_putchar+0x1c/0x30
[<
ffffffff813d0b40>] ? serial8250_console_write+0x140/0x140
[<
ffffffff813cb5fa>] uart_console_write+0x3a/0x80
[<
ffffffff813d0aae>] serial8250_console_write+0xae/0x140
[<
ffffffff8107c4d1>] call_console_drivers.constprop.15+0x91/0xf0
[<
ffffffff8107d6cf>] console_unlock+0x3bf/0x400
[<
ffffffff813503cd>] fb_flashcursor+0x5d/0x140
[<
ffffffff81355c30>] ? bit_clear+0x120/0x120
[<
ffffffff8109d5fb>] process_one_work+0x17b/0x470
[<
ffffffff8109e3cb>] worker_thread+0x11b/0x400
[<
ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[<
ffffffff810a5aef>] kthread+0xcf/0xe0
[<
ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[<
ffffffff81645858>] ret_from_fork+0x58/0x90
[<
ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Code: 48 89 e5 d3 e6 48 63 f6 48 03 77 10 8b 06 5d c3 66 0f 1f 44 00 00 66 66 66 6
As indicated in the stack trace above, the console output task got swamped.
Fixes: b9c5d6a64358 ("IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV")
Cc: <stable@vger.kernel.org> # v3.6+
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>