Seiji Aguchi [Fri, 28 Jun 2013 18:02:11 +0000 (14:02 -0400)]
x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()
Reschedule vector tracepoints may be called in cpu idle state.
This causes lockdep check warning below.
The tracepoint requires rcu but for accuracy it also
requires irq_enter() (tracepoints record the irq context), thus,
the tracepoint interrupt handler should be calling irq_enter()
and not rcu_irq_enter() (irq_enter() calls rcu_irq_enter()).
So, add irq_enter/exit() to smp_trace_reschedule_interrupt()
with common pre/post processing functions, smp_entering_irq()
and exiting_irq() (exiting_irq() calls just irq_exit()
in arch/x86/include/asm/apic.h),
because these can be shared among reschedule, call_function,
and call_function_single vectors.
[ 50.720557] Testing event reschedule_exit:
[ 50.721349]
[ 50.721502] ===============================
[ 50.721835] [ INFO: suspicious RCU usage. ]
[ 50.722169]
3.10.0-rc6-00004-gcf910e8 #190 Not tainted
[ 50.722582] -------------------------------
[ 50.722915] /c/kernel-tests/src/linux/arch/x86/include/asm/trace/irq_vectors.h:50 suspicious rcu_dereference_check() usage!
[ 50.723770]
[ 50.723770] other info that might help us debug this:
[ 50.723770]
[ 50.724385]
[ 50.724385] RCU used illegally from idle CPU!
[ 50.724385] rcu_scheduler_active = 1, debug_locks = 0
[ 50.725232] RCU used illegally from extended quiescent state!
[ 50.725690] no locks held by swapper/0/0.
[ 50.726010]
[ 50.726010] stack backtrace:
[...]
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/51CDCFA3.9080101@hds.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
H. Peter Anvin [Mon, 24 Jun 2013 18:01:09 +0000 (11:01 -0700)]
Merge remote-tracking branch 'trace/tip/x86/trace' into x86/trace
Fix from Steven Rostedt.
Seiji Aguchi [Sat, 22 Jun 2013 11:33:30 +0000 (07:33 -0400)]
x86/tracing: Add config option checking to the definitions of mce handlers
In case CONFIG_X86_MCE_THRESHOLD and CONFIG_X86_THERMAL_VECTOR
are disabled, kernel build fails as follows.
arch/x86/built-in.o: In function `trace_threshold_interrupt':
(.entry.text+0x122b): undefined reference to `smp_trace_threshold_interrupt'
arch/x86/built-in.o: In function `trace_thermal_interrupt':
(.entry.text+0x132b): undefined reference to `smp_trace_thermal_interrupt'
In this case, trace_threshold_interrupt/trace_thermal_interrupt
are not needed to define.
So, add config option checking to their definitions in entry_64.S.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/51C58B8A.2080808@hds.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steven Rostedt (Red Hat) [Sat, 22 Jun 2013 17:16:19 +0000 (13:16 -0400)]
trace,x86: Do not call local_irq_save() in load_current_idt()
As load_current_idt() is now what is used to update the IDT for the
switches needed for NMI, lockdep debug, and for tracing, it must not
call local_irq_save(). This is because one of the users of this is
lockdep, which does tracing of local_irq_save() and when the debug
trap is hit, we need to update the IDT before tracing interrupts
being disabled. As load_current_idt() is used to do this, calling
local_irq_save() which lockdep traces, defeats the point of calling
load_current_idt().
As interrupts are already disabled when used by lockdep and NMI, the
only other user is tracing that can disable interrupts itself. Simply
have the tracing update disable interrupts before calling load_current_idt()
instead of breaking the other users.
Here's the dump that happened:
------------[ cut here ]------------
WARNING: at /work/autotest/nobackup/linux-test.git/kernel/fork.c:1196 copy_process+0x2c3/0x1398()
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled)
Modules linked in:
CPU: 1 PID: 4570 Comm: gdm-simple-gree Not tainted 3.10.0-rc3-test+ #5
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
ffffffff81d2a7a5 ffff88006ed13d50 ffffffff8192822b ffff88006ed13d90
ffffffff81035f25 ffff8800721c6000 ffff88006ed13da0 0000000001200011
0000000000000000 ffff88006ed5e000 ffff8800721c6000 ffff88006ed13df0
Call Trace:
[<
ffffffff8192822b>] dump_stack+0x19/0x1b
[<
ffffffff81035f25>] warn_slowpath_common+0x67/0x80
[<
ffffffff81035fe1>] warn_slowpath_fmt+0x46/0x48
[<
ffffffff812bfc5d>] ? __raw_spin_lock_init+0x31/0x52
[<
ffffffff810341f7>] copy_process+0x2c3/0x1398
[<
ffffffff8103539d>] do_fork+0xa8/0x260
[<
ffffffff810ca7b1>] ? trace_preempt_on+0x2a/0x2f
[<
ffffffff812afb3e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<
ffffffff81937fe7>] ? sysret_check+0x1b/0x56
[<
ffffffff81937fe7>] ? sysret_check+0x1b/0x56
[<
ffffffff810355cf>] SyS_clone+0x16/0x18
[<
ffffffff81938369>] stub_clone+0x69/0x90
[<
ffffffff81937fc2>] ? system_call_fastpath+0x16/0x1b
---[ end trace
8b157a9d20ca1aa2 ]---
in fork.c:
#ifdef CONFIG_PROVE_LOCKING
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled); <-- bug here
DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt (Red Hat) [Fri, 21 Jun 2013 14:29:05 +0000 (10:29 -0400)]
trace,x86: Move creation of irq tracepoints from apic.c to irq.c
Compiling without CONFIG_X86_LOCAL_APIC set, apic.c will not be
compiled, and the irq tracepoints will not be created via the
CREATE_TRACE_POINTS macro. When CONFIG_X86_LOCAL_APIC is not set,
we get the following build error:
LD init/built-in.o
arch/x86/built-in.o: In function `trace_x86_platform_ipi_entry':
linux-test.git/arch/x86/include/asm/trace/irq_vectors.h:66: undefined reference to `__tracepoint_x86_platform_ipi_entry'
arch/x86/built-in.o: In function `trace_x86_platform_ipi_exit':
linux-test.git/arch/x86/include/asm/trace/irq_vectors.h:66: undefined reference to `__tracepoint_x86_platform_ipi_exit'
arch/x86/built-in.o: In function `trace_irq_work_entry':
linux-test.git/arch/x86/include/asm/trace/irq_vectors.h:72: undefined reference to `__tracepoint_irq_work_entry'
arch/x86/built-in.o: In function `trace_irq_work_exit':
linux-test.git/arch/x86/include/asm/trace/irq_vectors.h:72: undefined reference to `__tracepoint_irq_work_exit'
arch/x86/built-in.o:(__jump_table+0x8): undefined reference to `__tracepoint_x86_platform_ipi_entry'
arch/x86/built-in.o:(__jump_table+0x14): undefined reference to `__tracepoint_x86_platform_ipi_exit'
arch/x86/built-in.o:(__jump_table+0x20): undefined reference to `__tracepoint_irq_work_entry'
arch/x86/built-in.o:(__jump_table+0x2c): undefined reference to `__tracepoint_irq_work_exit'
make[1]: *** [vmlinux] Error 1
make: *** [sub-make] Error 2
As irq.c is always compiled for x86, it is a more appropriate location
to create the irq tracepoints.
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Seiji Aguchi [Thu, 20 Jun 2013 15:46:53 +0000 (11:46 -0400)]
x86, trace: Add irq vector tracepoints
[Purpose of this patch]
As Vaibhav explained in the thread below, tracepoints for irq vectors
are useful.
http://www.spinics.net/lists/mm-commits/msg85707.html
<snip>
The current interrupt traces from irq_handler_entry and irq_handler_exit
provide when an interrupt is handled. They provide good data about when
the system has switched to kernel space and how it affects the currently
running processes.
There are some IRQ vectors which trigger the system into kernel space,
which are not handled in generic IRQ handlers. Tracing such events gives
us the information about IRQ interaction with other system events.
The trace also tells where the system is spending its time. We want to
know which cores are handling interrupts and how they are affecting other
processes in the system. Also, the trace provides information about when
the cores are idle and which interrupts are changing that state.
<snip>
On the other hand, my usecase is tracing just local timer event and
getting a value of instruction pointer.
I suggested to add an argument local timer event to get instruction pointer before.
But there is another way to get it with external module like systemtap.
So, I don't need to add any argument to irq vector tracepoints now.
[Patch Description]
Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events.
But there is an above use case to trace specific irq_vector rather than tracing all events.
In this case, we are concerned about overhead due to unwanted events.
So, add following tracepoints instead of introducing irq_vector_entry/exit.
so that we can enable them independently.
- local_timer_vector
- reschedule_vector
- call_function_vector
- call_function_single_vector
- irq_work_entry_vector
- error_apic_vector
- thermal_apic_vector
- threshold_apic_vector
- spurious_apic_vector
- x86_platform_ipi_vector
Also, introduce a logic switching IDT at enabling/disabling time so that a time penalty
makes a zero when tracepoints are disabled. Detailed explanations are as follows.
- Create trace irq handlers with entering_irq()/exiting_irq().
- Create a new IDT, trace_idt_table, at boot time by adding a logic to
_set_gate(). It is just a copy of original idt table.
- Register the new handlers for tracpoints to the new IDT by introducing
macros to alloc_intr_gate() called at registering time of irq_vector handlers.
- Add checking, whether irq vector tracing is on/off, into load_current_idt().
This has to be done below debug checking for these reasons.
- Switching to debug IDT may be kicked while tracing is enabled.
- On the other hands, switching to trace IDT is kicked only when debugging
is disabled.
In addition, the new IDT is created only when CONFIG_TRACING is enabled to avoid being
used for other purposes.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/51C323ED.5050708@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Seiji Aguchi [Thu, 20 Jun 2013 15:45:44 +0000 (11:45 -0400)]
x86: Rename variables for debugging
Rename variables for debugging to describe meaning of them precisely.
Also, introduce a generic way to switch IDT by checking a current state,
debug on/off.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/51C323A8.7050905@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Seiji Aguchi [Thu, 20 Jun 2013 15:45:17 +0000 (11:45 -0400)]
x86, trace: Introduce entering/exiting_irq()
When implementing tracepoints in interrupt handers, if the tracepoints are
simply added in the performance sensitive path of interrupt handers,
it may cause potential performance problem due to the time penalty.
To solve the problem, an idea is to prepare non-trace/trace irq handers and
switch their IDTs at the enabling/disabling time.
So, let's introduce entering_irq()/exiting_irq() for pre/post-
processing of each irq handler.
A way to use them is as follows.
Non-trace irq handler:
smp_irq_handler()
{
entering_irq(); /* pre-processing of this handler */
__smp_irq_handler(); /*
* common logic between non-trace and trace handlers
* in a vector.
*/
exiting_irq(); /* post-processing of this handler */
}
Trace irq_handler:
smp_trace_irq_handler()
{
entering_irq(); /* pre-processing of this handler */
trace_irq_entry(); /* tracepoint for irq entry */
__smp_irq_handler(); /*
* common logic between non-trace and trace handlers
* in a vector.
*/
trace_irq_exit(); /* tracepoint for irq exit */
exiting_irq(); /* post-processing of this handler */
}
If tracepoints can place outside entering_irq()/exiting_irq() as follows,
it looks cleaner.
smp_trace_irq_handler()
{
trace_irq_entry();
smp_irq_handler();
trace_irq_exit();
}
But it doesn't work.
The problem is with irq_enter/exit() being called. They must be called before
trace_irq_enter/exit(), because of the rcu_irq_enter() must be called before
any tracepoints are used, as tracepoints use rcu to synchronize.
As a possible alternative, we may be able to call irq_enter() first as follows
if irq_enter() can nest.
smp_trace_irq_hander()
{
irq_entry();
trace_irq_entry();
smp_irq_handler();
trace_irq_exit();
irq_exit();
}
But it doesn't work, either.
If irq_enter() is nested, it may have a time penalty because it has to check if it
was already called or not. The time penalty is not desired in performance sensitive
paths even if it is tiny.
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/51C3238D.9040706@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt [Thu, 20 Jun 2013 15:44:44 +0000 (11:44 -0400)]
tracing: Add DEFINE_EVENT_FN() macro
Each TRACE_EVENT() adds several helper functions. If two or more trace events
share the same structure and print format, they can also share most of these
helper functions and save a lot of space from duplicate code. This is why the
DECLARE_EVENT_CLASS() and DEFINE_EVENT() were created.
Some events require a trigger to be called at registering and unregistering of
the event and to do so they use TRACE_EVENT_FN().
If multiple events require a trigger, they currently have no choice but to use
TRACE_EVENT_FN() as there's no DEFINE_EVENT_FN() available. This unfortunately
causes a lot of wasted duplicate code created.
By adding a DEFINE_EVENT_FN(), these events can still use a
DECLARE_EVENT_CLASS() and then define their own triggers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/51C3236C.8030508@hds.com
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Linus Torvalds [Sat, 15 Jun 2013 21:51:07 +0000 (11:51 -1000)]
Linux 3.10-rc6
Linus Torvalds [Sat, 15 Jun 2013 21:49:48 +0000 (11:49 -1000)]
Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"These are a little later than I planned on since I got caught up with
handling merges for 3.11 most of the week.
Another week, another batch of fixes for arm-soc platforms.
Again, nothing controversial. A few more than would be ideal, but all
are valid fixes. In particular the prima2 panic patch is critical
since it fixes a problem where multiplatform kernels panic on all but
prima2 hardware."
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: SAMSUNG: pm: Adjust for pinctrl- and DT-enabled platforms
ARM: prima2: fix incorrect panic usage
arm: mvebu: armada-xp-{gp,openblocks-ax3-4}: specify PCIe range
ARM: Kirkwood: handle mv88f6282 cpu in __kirkwood_variant().
ARM: omap3: clock: fix wrong container_of in clock36xx.c
ARM: dts: OMAP5: Fix missing PWM capability to timer nodes
ARM: dts: omap4-panda|sdp: Fix mux for twl6030 IRQ pin and msecure line
ARM: dts: AM33xx: Fix properties on gpmc node
arm: omap2: fix AM33xx hwmod infos for UART2
ARM: OMAP3: Fix iva2_pwrdm settings for 3703
Linus Torvalds [Sat, 15 Jun 2013 21:47:56 +0000 (11:47 -1000)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Fix RTNL locking in batman-adv, from Matthias Schiffer.
2) Don't allow non-passthrough macvlan devices to set NOPROMISC via
netlink, otherwise we can end up with corrupted promisc counter
values on the device. From Michael S Tsirkin.
3) Fix stmmac driver build with debugging defines enabled, from Dinh
Nguyen.
4) Make sure name string we give in socket address in AF_PACKET is NULL
terminated, from Daniel Borkmann.
5) Fix leaking of two uninitialized bytes of memory to userspace in
l2tp, from Guillaume Nault.
6) Clear IPCB(skb) before tunneling otherwise we touch dangling IP
options state and crash. From Saurabh Mohan.
7) Fix suspend/resume for davinci_mdio by using suspend_late and
resume_early. From Mugunthan V N.
8) Don't tag ip_tunnel_init_net and ip_tunnel_delete_net with
__net_{init,exit}, they can be called outside of those contexts.
From Eric Dumazet.
9) Fix RX length error in sh_eth driver, from Yoshihiro Shimoda.
10) Fix missing sctp_outq initialization in some code paths of SCTP
stack, from Neil Horman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
sctp: fully initialize sctp_outq in sctp_outq_init
netiucv: Hold rtnl between name allocation and device registration.
tulip: Properly check dma mapping result
net: sh_eth: fix incorrect RX length error if R8A7740
ip_tunnel: remove __net_init/exit from exported functions
drivers: net: davinci_mdio: restore mdio clk divider in mdio resume
drivers: net: davinci_mdio: moving mdio resume earlier than cpsw ethernet driver
net/ipv4: ip_vti clear skb cb before tunneling.
tg3: Wait for boot code to finish after power on
l2tp: Fix sendmsg() return value
l2tp: Fix PPP header erasure and memory leak
bonding: fix igmp_retrans type and two related races
bonding: reset master mac on first enslave failure
packet: packet_getname_spkt: make sure string is always 0-terminated
net: ethernet: stmicro: stmmac: Fix compile error when STMMAC_XMIT_DEBUG used
be2net: Fix 32-bit DMA Mask handling
xen-netback: don't de-reference vif pointer after having called xenvif_put()
macvlan: don't touch promisc without passthrough
batman-adv: Don't handle address updates when bla is disabled
batman-adv: forward late OGMs from best next hop
...
Linus Torvalds [Sat, 15 Jun 2013 05:25:04 +0000 (19:25 -1000)]
Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc
Pull powerpc fixes from Benjamin Herrenschmidt:
"So here are 3 fixes still for 3.10. Fixes are simple, bugs are nasty
(though not recent regressions, nasty enough) and all targeted at
stable"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Fix missing/delayed calls to irq_work
powerpc: Fix emulation of illegal instructions on PowerNV platform
powerpc: Fix stack overflow crash in resume_kernel when ftracing
David Daney [Fri, 14 Jun 2013 18:13:59 +0000 (11:13 -0700)]
smp.h: Use local_irq_{save,restore}() in !SMP version of on_each_cpu().
Thanks to commit
f91eb62f71b3 ("init: scream bloody murder if interrupts
are enabled too early"), "bloody murder" is now being screamed.
With a MIPS OCTEON config, we use on_each_cpu() in our
irq_chip.irq_bus_sync_unlock() function. This gets called in early as a
result of the time_init() call. Because the !SMP version of
on_each_cpu() unconditionally enables irqs, we get:
WARNING: at init/main.c:560 start_kernel+0x250/0x410()
Interrupts were enabled early
CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0-rc5-Cavium-Octeon+ #801
Call Trace:
show_stack+0x68/0x80
warn_slowpath_common+0x78/0xb0
warn_slowpath_fmt+0x38/0x48
start_kernel+0x250/0x410
Suggested fix: Do what we already do in the SMP version of
on_each_cpu(), and use local_irq_save/local_irq_restore. Because we
need a flags variable, make it a static inline to avoid name space
issues.
[ Change from v1: Convert on_each_cpu to a static inline function, add
#include <linux/irqflags.h> to avoid build breakage on some files.
on_each_cpu_mask() and on_each_cpu_cond() suffer the same problem as
on_each_cpu(), but they are not causing !SMP bugs for me, so I will
defer changing them to a less urgent patch. ]
Signed-off-by: David Daney <david.daney@cavium.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 15 Jun 2013 05:18:56 +0000 (19:18 -1000)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
"Several fixes + obvious cleanup (you've missed a couple of open-coded
can_lookup() back then)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
snd_pcm_link(): fix a leak...
use can_lookup() instead of direct checks of ->i_op->lookup
move exit_task_namespaces() outside of exit_notify()
fput: task_work_add() can fail if the caller has passed exit_task_work()
ncpfs: fix rmdir returns Device or resource busy
Linus Torvalds [Sat, 15 Jun 2013 05:16:31 +0000 (19:16 -1000)]
Merge tag 'for-linus-v3.10-rc6' of git://oss.sgi.com/xfs/xfs
Pull xfs fixes from Ben Myers:
- Remove noisy warnings about experimental support which spams the logs
- Add padding to align directory and attr structures correctly
- Set block number on child buffer on a root btree split
- Disable verifiers during log recovery for non-CRC filesystems
* tag 'for-linus-v3.10-rc6' of git://oss.sgi.com/xfs/xfs:
xfs: don't shutdown log recovery on validation errors
xfs: ensure btree root split sets blkno correctly
xfs: fix implicit padding in directory and attr CRC formats
xfs: don't emit v5 superblock warnings on write
Linus Torvalds [Sat, 15 Jun 2013 05:15:36 +0000 (19:15 -1000)]
Merge tag 'char-misc-3.10-rc5' of git://git./linux/kernel/git/gregkh/char-misc
Pull char / misc fixes from Greg Kroah-Hartman:
"Here are some small mei driver fixes for 3.10-rc6 that fix some
reported problems"
* tag 'char-misc-3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
mei: me: clear interrupts on the resume path
mei: nfc: fix nfc device freeing
mei: init: Flush scheduled work before resetting the device
Linus Torvalds [Sat, 15 Jun 2013 05:14:39 +0000 (19:14 -1000)]
Merge tag 'usb-3.10-rc5' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg Kroah-Hartman:
"Here are some small USB driver fixes that resolve some reported
problems for 3.10-rc6
Nothing major, just 3 USB serial driver fixes, and two chipidea fixes"
* tag 'usb-3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: chipidea: fix id change handling
usb: chipidea: fix no transceiver case
USB: pl2303: fix device initialisation at open
USB: spcp8x5: fix device initialisation at open
USB: f81232: fix device initialisation at open
Benjamin Herrenschmidt [Sat, 15 Jun 2013 02:13:40 +0000 (12:13 +1000)]
powerpc: Fix missing/delayed calls to irq_work
When replaying interrupts (as a result of the interrupt occurring
while soft-disabled), in the case of the decrementer, we are exclusively
testing for a pending timer target. However we also use decrementer
interrupts to trigger the new "irq_work", which in this case would
be missed.
This change the logic to force a replay in both cases of a timer
boundary reached and a decrementer interrupt having actually occurred
while disabled. The former test is still useful to catch cases where
a CPU having been hard-disabled for a long time completely misses the
interrupt due to a decrementer rollover.
CC: <stable@vger.kernel.org> [v3.4+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Paul Mackerras [Fri, 14 Jun 2013 10:07:41 +0000 (20:07 +1000)]
powerpc: Fix emulation of illegal instructions on PowerNV platform
Normally, the kernel emulates a few instructions that are unimplemented
on some processors (e.g. the old dcba instruction), or privileged (e.g.
mfpvr). The emulation of unimplemented instructions is currently not
working on the PowerNV platform. The reason is that on these machines,
unimplemented and illegal instructions cause a hypervisor emulation
assist interrupt, rather than a program interrupt as on older CPUs.
Our vector for the emulation assist interrupt just calls
program_check_exception() directly, without setting the bit in SRR1
that indicates an illegal instruction interrupt. This fixes it by
making the emulation assist interrupt set that bit before calling
program_check_interrupt(). With this, old programs that use no-longer
implemented instructions such as dcba now work again.
CC: <stable@vger.kernel.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Ellerman [Thu, 13 Jun 2013 11:04:56 +0000 (21:04 +1000)]
powerpc: Fix stack overflow crash in resume_kernel when ftracing
It's possible for us to crash when running with ftrace enabled, eg:
Bad kernel stack pointer
bffffd12 at
c00000000000a454
cpu 0x3: Vector: 300 (Data Access) at [
c00000000ffe3d40]
pc:
c00000000000a454: resume_kernel+0x34/0x60
lr:
c00000000000335c: performance_monitor_common+0x15c/0x180
sp:
bffffd12
msr:
8000000000001032
dar:
bffffd12
dsisr:
42000000
If we look at current's stack (paca->__current->stack) we see it is
equal to
c0000002ecab0000. Our stack is 16K, and comparing to
paca->kstack (
c0000002ecab3e30) we can see that we have overflowed our
kernel stack. This leads to us writing over our struct thread_info, and
in this case we have corrupted thread_info->flags and set
_TIF_EMULATE_STACK_STORE.
Dumping the stack we see:
3:mon> t
c0000002ecab0000
[
c0000002ecab0000]
c00000000002131c .performance_monitor_exception+0x5c/0x70
[
c0000002ecab0080]
c00000000000335c performance_monitor_common+0x15c/0x180
--- Exception: f01 (Performance Monitor) at
c0000000000fb2ec .trace_hardirqs_off+0x1c/0x30
[
c0000002ecab0370]
c00000000016fdb0 .trace_graph_entry+0xb0/0x280 (unreliable)
[
c0000002ecab0410]
c00000000003d038 .prepare_ftrace_return+0x98/0x130
[
c0000002ecab04b0]
c00000000000a920 .ftrace_graph_caller+0x14/0x28
[
c0000002ecab0520]
c0000000000d6b58 .idle_cpu+0x18/0x90
[
c0000002ecab05a0]
c00000000000a934 .return_to_handler+0x0/0x34
[
c0000002ecab0620]
c00000000001e660 .timer_interrupt+0x160/0x300
[
c0000002ecab06d0]
c0000000000025dc decrementer_common+0x15c/0x180
--- Exception: 901 (Decrementer) at
c0000000000104d4 .arch_local_irq_restore+0x74/0xa0
[
c0000002ecab09c0]
c0000000000fe044 .trace_hardirqs_on+0x14/0x30 (unreliable)
[
c0000002ecab0fb0]
c00000000016fe3c .trace_graph_entry+0x13c/0x280
[
c0000002ecab1050]
c00000000003d038 .prepare_ftrace_return+0x98/0x130
[
c0000002ecab10f0]
c00000000000a920 .ftrace_graph_caller+0x14/0x28
[
c0000002ecab1160]
c0000000000161f0 .__ppc64_runlatch_on+0x10/0x40
[
c0000002ecab11d0]
c00000000000a934 .return_to_handler+0x0/0x34
--- Exception: 901 (Decrementer) at
c0000000000104d4 .arch_local_irq_restore+0x74/0xa0
... and so on
__ppc64_runlatch_on() is called from RUNLATCH_ON in the exception entry
path. At that point the irq state is not consistent, ie. interrupts are
hard disabled (by the exception entry), but the paca soft-enabled flag
may be out of sync.
This leads to the local_irq_restore() in trace_graph_entry() actually
enabling interrupts, which we do not want. Because we have not yet
reprogrammed the decrementer we immediately take another decrementer
exception, and recurse.
The fix is twofold. Firstly make sure we call DISABLE_INTS before
calling RUNLATCH_ON. The badly named DISABLE_INTS actually reconciles
the irq state in the paca with the hardware, making it safe again to
call local_irq_save/restore().
Although that should be sufficient to fix the bug, we also mark the
runlatch routines as notrace. They are called very early in the
exception entry and we are asking for trouble tracing them. They are
also fairly uninteresting and tracing them just adds unnecessary
overhead.
[ This regression was introduced by
fe1952fc0afb9a2e4c79f103c08aef5d13db1873
"powerpc: Rework runlatch code" by myself --BenH
]
CC: <stable@vger.kernel.org> [v3.4+]
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Al Viro [Wed, 5 Jun 2013 18:07:08 +0000 (14:07 -0400)]
snd_pcm_link(): fix a leak...
in case when snd_pcm_stream_linked(substream) is true, we end up leaking
group.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 6 Jun 2013 23:33:47 +0000 (19:33 -0400)]
use can_lookup() instead of direct checks of ->i_op->lookup
a couple of places got missed back when Linus has introduced that one...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Oleg Nesterov [Fri, 14 Jun 2013 19:09:49 +0000 (21:09 +0200)]
move exit_task_namespaces() outside of exit_notify()
exit_notify() does exit_task_namespaces() after
forget_original_parent(). This was needed to ensure that ->nsproxy
can't be cleared prematurely, an exiting child we are going to
reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
However, after
32084504 "pidns: use task_active_pid_ns in
do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
on task_active_pid_ns().
Move exit_task_namespaces() from exit_notify() to do_exit(), after
exit_fs() and before exit_task_work().
This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
does fput() which needs task_work_add().
Note: this particular problem can be fixed if we change fput(), and
that change makes sense anyway. But there is another reason to move
the callsite. The original reason for exit_task_namespaces() from
the middle of exit_notify() was subtle and it has already gone away,
now this looks confusing. And this allows us do simplify exit_notify(),
we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
of PF_EXITING in forget_original_parent().
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Oleg Nesterov [Fri, 14 Jun 2013 19:09:47 +0000 (21:09 +0200)]
fput: task_work_add() can fail if the caller has passed exit_task_work()
fput() assumes that it can't be called after exit_task_work() but
this is not true, for example free_ipc_ns()->shm_destroy() can do
this. In this case fput() silently leaks the file.
Change it to fallback to delayed_fput_work if task_work_add() fails.
The patch looks complicated but it is not, it changes the code from
if (PF_KTHREAD) {
schedule_work(...);
return;
}
task_work_add(...)
to
if (!PF_KTHREAD) {
if (!task_work_add(...))
return;
/* fallback */
}
schedule_work(...);
As for shm_destroy() in particular, we could make another fix but I
think this change makes sense anyway. There could be another similar
user, it is not safe to assume that task_work_add() can't fail.
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Wed, 12 Jun 2013 02:19:06 +0000 (12:19 +1000)]
xfs: don't shutdown log recovery on validation errors
Unfortunately, we cannot guarantee that items logged multiple times
and replayed by log recovery do not take objects back in time. When
they are taken back in time, the go into an intermediate state which
is corrupt, and hence verification that occurs on this intermediate
state causes log recovery to abort with a corruption shutdown.
Instead of causing a shutdown and unmountable filesystem, don't
verify post-recovery items before they are written to disk. This is
less than optimal, but there is no way to detect this issue for
non-CRC filesystems If log recovery successfully completes, this
will be undone and the object will be consistent by subsequent
transactions that are replayed, so in most cases we don't need to
take drastic action.
For CRC enabled filesystems, leave the verifiers in place - we need
to call them to recalculate the CRCs on the objects anyway. This
recovery problem can be solved for such filesystems - we have a LSN
stamped in all metadata at writeback time that we can to determine
whether the item should be replayed or not. This is a separate piece
of work, so is not addressed by this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit
9222a9cf86c0d64ffbedf567412b55da18763aa3)
Dave Chinner [Wed, 12 Jun 2013 02:19:08 +0000 (12:19 +1000)]
xfs: ensure btree root split sets blkno correctly
For CRC enabled filesystems, the BMBT is rooted in an inode, so it
passes through a different code path on root splits than the
freespace and inode btrees. This is much less traversed by xfstests
than the other trees. When testing on a 1k block size filesystem,
I've been seeing ASSERT failures in generic/234 like:
XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317
which are generally preceded by a lblock check failure. I noticed
this in the bmbt stats:
$ pminfo -f xfs.btree.block_map
xfs.btree.block_map.lookup
value 39135
xfs.btree.block_map.compare
value 268432
xfs.btree.block_map.insrec
value 15786
xfs.btree.block_map.delrec
value 13884
xfs.btree.block_map.newroot
value 2
xfs.btree.block_map.killroot
value 0
.....
Very little coverage of root splits and merges. Indeed, on a 4k
filesystem, block_map.newroot and block_map.killroot are both zero.
i.e. the code is not exercised at all, and it's the only generic
btree infrastructure operation that is not exercised by a default run
of xfstests.
Turns out that on a 1k filesystem, generic/234 accounts for one of
those two root splits, and that is somewhat of a smoking gun. In
fact, it's the same problem we saw in the directory/attr code where
headers are memcpy()d from one block to another without updating the
self describing metadata.
Simple fix - when copying the header out of the root block, make
sure the block number is updated correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit
ade1335afef556df6538eb02e8c0dc91fbd9cc37)
Dave Chinner [Wed, 12 Jun 2013 02:19:07 +0000 (12:19 +1000)]
xfs: fix implicit padding in directory and attr CRC formats
Michael L. Semon has been testing CRC patches on a 32 bit system and
been seeing assert failures in the directory code from xfs/080.
Thanks to Michael's heroic efforts with printk debugging, we found
that the problem was that the last free space being left in the
directory structure was too small to fit a unused tag structure and
it was being corrupted and attempting to log a region out of bounds.
Hence the assert failure looked something like:
.....
#5 calling xfs_dir2_data_log_unused() 36 32
#1 4092 4095 4096
#2 8182 8183 4096
XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
Where #1 showed the first region of the dup being logged (i.e. the
last 4 bytes of a directory buffer) and #2 shows the corrupt values
being calculated from the length of the dup entry which overflowed
the size of the buffer.
It turns out that the problem was not in the logging code, nor in
the freespace handling code. It is an initial condition bug that
only shows up on 32 bit systems. When a new buffer is initialised,
where's the freespace that is set up:
[ 172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
[ 172.316346] #9 calling xfs_dir2_data_log_unused()
[ 172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
[ 172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096
Note the offset of the first region being logged? It's 60 bytes into
the buffer. Once I saw that, I pretty much knew that the bug was
going to be caused by this.
Essentially, all direct entries are rounded to 8 bytes in length,
and all entries start with an 8 byte alignment. This means that we
can decode inplace as variables are naturally aligned. With the
directory data supposedly starting on a 8 byte boundary, and all
entries padded to 8 bytes, the minimum freespace in a directory
block is supposed to be 8 bytes, which is large enough to fit a
unused data entry structure (6 bytes in size). The fact we only have
4 bytes of free space indicates a directory data block alignment
problem.
And what do you know - there's an implicit hole in the directory
data block header for the CRC format, which means the header is 60
byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
padding. And while looking at the structures, I found the same
problem in the attr leaf header. Fix them both.
Note that this only affects 32 bit systems with CRCs enabled.
Everything else is just fine. Note that CRC enabled filesystems created
before this fix on such systems will not be readable with this fix
applied.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Debugged-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit
8a1fd2950e1fe267e11fc8c85dcaa6b023b51b60)
Dave Chinner [Mon, 27 May 2013 06:38:19 +0000 (16:38 +1000)]
xfs: don't emit v5 superblock warnings on write
We write the superblock every 30s or so which results in the
verifier being called. Right now that results in this output
every 30s:
XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
Use of these features in this kernel is at your own risk!
And spamming the logs.
We don't need to check for whether we support v5 superblocks or
whether there are feature bits we don't support set as these are
only relevant when we first mount the filesytem. i.e. on superblock
read. Hence for the write verification we can just skip all the
checks (and hence verbose output) altogether.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit
34510185abeaa5be9b178a41c0a03d30aec3db7e)
Linus Torvalds [Fri, 14 Jun 2013 05:34:14 +0000 (22:34 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"This is an assortment of crash fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: stop all workers before cleaning up roots
Btrfs: fix use-after-free bug during umount
Btrfs: init relocate extent_io_tree with a mapping
btrfs: Drop inode if inode root is NULL
Btrfs: don't delete fs_roots until after we cleanup the transaction
Tomas Winkler [Wed, 5 Jun 2013 07:51:13 +0000 (10:51 +0300)]
mei: me: clear interrupts on the resume path
We need to clear pending interrupts on the resume
path. This brings the device into defined state
before starting the reset flow
This should solve suspend/resume issues:
mei_me : wait hw ready failed. status = 0x0
mei_me : version message write failed
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tomas Winkler [Mon, 10 Jun 2013 07:10:26 +0000 (10:10 +0300)]
mei: nfc: fix nfc device freeing
The nfc_dev is a static variable and is not cleaned properly upon reset
mainly ndev->cl and ndev->cl_info are not set to NULL after freeing which
mei_stop:198: mei_me 0000:00:16.0: stopping the device.
[ 404.253427] general protection fault: 0000 [#2] SMP
[ 404.253437] Modules linked in: mei_me(-) binfmt_misc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave fuse loop dm_mod hid_generic usbhid hid coretemp acpi_cpufreq mperf kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul snd_hda_codec_hdmi glue_helper aes_x86_64 e1000e snd_hda_intel snd_hda_codec ehci_pci iTCO_wdt iTCO_vendor_support ehci_hcd snd_hwdep xhci_hcd snd_pcm usbcore ptp mei sg microcode snd_timer pps_core i2c_i801 snd pcspkr battery rtc_cmos lpc_ich mfd_core soundcore usb_common snd_page_alloc ac ext3 jbd mbcache drm_kms_helper drm intel_agp i2c_algo_bit intel_gtt i2c_core sd_mod crc_t10dif thermal fan video button processor thermal_sys hwmon ahci libahci libata scsi_mod [last unloaded: mei_me]
[ 404.253591] CPU: 0 PID: 5551 Comm: modprobe Tainted: G D W 3.10.0-rc3 #1
[ 404.253611] task:
ffff880143cd8300 ti:
ffff880144a2a000 task.ti:
ffff880144a2a000
[ 404.253619] RIP: 0010:[<
ffffffff81334e5d>] [<
ffffffff81334e5d>] device_del+0x1d/0x1d0
[ 404.253638] RSP: 0018:
ffff880144a2bcf8 EFLAGS:
00010206
[ 404.253645] RAX:
2020302e30202030 RBX:
ffff880144fdb000 RCX:
0000000000000086
[ 404.253652] RDX:
0000000000000001 RSI:
0000000000000086 RDI:
ffff880144fdb000
[ 404.253659] RBP:
ffff880144a2bd18 R08:
0000000000000651 R09:
0000000000000006
[ 404.253666] R10:
0000000000000651 R11:
0000000000000006 R12:
ffff880144fdb000
[ 404.253673] R13:
ffff880149371098 R14:
ffff880144482c00 R15:
ffffffffa04710e0
[ 404.253681] FS:
00007f251c59a700(0000) GS:
ffff88014e200000(0000) knlGS:
0000000000000000
[ 404.253689] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 404.253696] CR2:
ffffffffff600400 CR3:
0000000145319000 CR4:
00000000001407f0
[ 404.253703] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 404.253710] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 404.253716] Stack:
[ 404.253720]
ffff880144fdb000 ffff880143ffe000 ffff880149371098 ffffffffa0471000
[ 404.253732]
ffff880144a2bd38 ffffffff8133502d ffff88014e20cf48 ffff880143ffe1d8
[ 404.253744]
ffff880144a2bd48 ffffffffa02a4749 ffff880144a2bd58 ffffffffa02a4ba1
[ 404.253755] Call Trace:
[ 404.253766] [<
ffffffff8133502d>] device_unregister+0x1d/0x60
[ 404.253787] [<
ffffffffa02a4749>] mei_cl_remove_device+0x9/0x10 [mei]
[ 404.253804] [<
ffffffffa02a4ba1>] mei_nfc_host_exit+0x21/0x30 [mei]
[ 404.253819] [<
ffffffffa029c2dd>] mei_stop+0x3d/0x90 [mei]
[ 404.253830] [<
ffffffffa046e220>] mei_me_remove+0x60/0xe0 [mei_me]
[ 404.253843] [<
ffffffff81278f37>] pci_device_remove+0x37/0xb0
[ 404.253855] [<
ffffffff81337c68>] __device_release_driver+0x98/0x100
[ 404.253865] [<
ffffffff81337d80>] driver_detach+0xb0/0xc0
[ 404.253876] [<
ffffffff81336b4f>] bus_remove_driver+0x8f/0x120
[ 404.253891] [<
ffffffff81075990>] ? try_to_wake_up+0x2b0/0x2b0
[ 404.253903] [<
ffffffff81338a48>] driver_unregister+0x58/0x90
[ 404.253913] [<
ffffffff8127906b>] pci_unregister_driver+0x2b/0xb0
[ 404.253924] [<
ffffffffa046f244>] mei_me_driver_exit+0x10/0xdcc [mei_me]
[ 404.253936] [<
ffffffff810a50d8>] SyS_delete_module+0x198/0x2b0
[ 404.253949] [<
ffffffff814850d9>] ? do_page_fault+0x9/0x10
[ 404.253961] [<
ffffffff81489692>] system_call_fastpath+0x16/0x1b
[ 404.253967] Code: 41 5c 41 5d 41 5e 41 5f c9 c3 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc 53 48 8b 87 88 00 00 00 4c 8b 37 48 85 c0 74 18 <48> 8b 78 78 4c 89 e2 be 02 00 00 00 48 81 c7 f8 00 00 00 e8 3b
[ 404.254048] RIP [<
ffffffff81334e5d>] device_del+0x1d/0x1d0
Cc: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Samuel Ortiz [Mon, 10 Jun 2013 07:10:25 +0000 (10:10 +0300)]
mei: init: Flush scheduled work before resetting the device
Flushing pending work items before resetting the device makes more
sense than doing so afterwards. Some of them, like e.g. the NFC
initialization one, find themselves with client IDs changed after
the reset, eventually leading to trigger a client.c:mei_me_cl_by_id()
warning after a few modprobe/rmmod cycles.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Neil Horman [Wed, 12 Jun 2013 18:26:44 +0000 (14:26 -0400)]
sctp: fully initialize sctp_outq in sctp_outq_init
In commit
2f94aabd9f6c925d77aecb3ff020f1cc12ed8f86
(refactor sctp_outq_teardown to insure proper re-initalization)
we modified sctp_outq_teardown to use sctp_outq_init to fully re-initalize the
outq structure. Steve West recently asked me why I removed the q->error = 0
initalization from sctp_outq_teardown. I did so because I was operating under
the impression that sctp_outq_init would properly initalize that value for us,
but it doesn't. sctp_outq_init operates under the assumption that the outq
struct is all 0's (as it is when called from sctp_association_init), but using
it in __sctp_outq_teardown violates that assumption. We should do a memset in
sctp_outq_init to ensure that the entire structure is in a known state there
instead.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: "West, Steve (NSN - US/Fort Worth)" <steve.west@nsn.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: netdev@vger.kernel.org
CC: davem@davemloft.net
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benjamin Poirier [Thu, 13 Jun 2013 13:09:47 +0000 (09:09 -0400)]
netiucv: Hold rtnl between name allocation and device registration.
fixes a race condition between concurrent initializations of netiucv devices
that try to use the same name.
sysfs: cannot create duplicate filename '/devices/iucv/netiucv2'
[...]
Call Trace:
([<
00000000002edea4>] sysfs_add_one+0xb0/0xdc)
[<
00000000002eecd4>] create_dir+0x80/0xfc
[<
00000000002eee38>] sysfs_create_dir+0xe8/0x118
[<
00000000003835a8>] kobject_add_internal+0x120/0x2d0
[<
00000000003839d6>] kobject_add+0x62/0x9c
[<
00000000003d9564>] device_add+0xcc/0x510
[<
000003e00212c7b4>] netiucv_register_device+0xc0/0x1ec [netiucv]
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Tested-by: Ursula Braun <braunu@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Horman [Thu, 13 Jun 2013 19:31:28 +0000 (15:31 -0400)]
tulip: Properly check dma mapping result
Tulip throws an error when dma debugging is enabled, as it doesn't properly
check dma mapping results with dma_mapping_error() durring tx ring refills.
Easy fix, just add it in, and drop the frame if the mapping is bad
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Grant Grundler <grundler@parisc-linux.org>
CC: "David S. Miller" <davem@davemloft.net>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 13 Jun 2013 22:32:17 +0000 (15:32 -0700)]
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull device tree bug fixes from Grant Likely:
"This branch contains the following bug fixes:
- Fix locking vs. interrupts. Bug caught by lockdep checks
- Fix parsing of cpp #line directive output by dtc
- Fix 'make clean' for dtc temporary files.
There is also a commit that regenerates the dtc lexer and parser files
with Bison 2.5. The only purpose of this commit is to separate the
functional change in the dtc bug fix from the code generation change
caused by a different Bison version"
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux:
dtc: ensure #line directives don't consume data from the next line
dtc: Update generated files to output from Bison 2.5
of: Fix locking vs. interrupts
kbuild: make sure we clean up DTB temporary files
Grant Likely [Thu, 13 Jun 2013 11:57:44 +0000 (12:57 +0100)]
dtc: ensure #line directives don't consume data from the next line
Previously, the #line parsing regex ended with ({WS}+[0-9]+)?. The {WS}
could match line-break characters. If the #line directive did not contain
the optional flags field at the end, this could cause any integer data on
the next line to be consumed as part of the #line directive parsing. This
could cause syntax errors (i.e. #line parsing consuming the leading 0
from a hex literal 0x1234, leaving x1234 to be parsed as cell data,
which is a syntax error), or invalid compilation results (i.e. simply
consuming literal 1234 as part of the #line processing, thus removing it
from the cell data).
Fix this by replacing {WS} with [ \t] so that it can't match line-breaks.
Convert all instances of {WS}, even though the other instances should be
irrelevant for any well-formed #line directive. This is done for
consistency and ultimate safety.
[Cherry picked from DTC commit
a1ee6f068e1c8dbc62873645037a353d7852d5cc]
Reported-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Grant Likely [Thu, 13 Jun 2013 12:00:43 +0000 (13:00 +0100)]
dtc: Update generated files to output from Bison 2.5
This patch merely updates the generated dtc parser and lexer files to
the output generated by Bison 2.5. The previous versions were generated
from version 2.4.1. The only reason for this commit is to minimize the
diff on the next commit which fixes a bug in the DTC #line directive
parsing. Otherwise the Bison changes would be intermingled with the
functional changes.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Benjamin Herrenschmidt [Wed, 12 Jun 2013 05:39:04 +0000 (15:39 +1000)]
of: Fix locking vs. interrupts
The OF code uses irqsafe locks everywhere except in a handful of functions
for no obvious reasons. Since the conversion from the old rwlocks, this
now triggers lockdep warnings when used at interrupt time. At least one
driver (ibmvscsi) seems to be doing that from softirq context.
This converts the few non-irqsafe locks into irqsafe ones, making them
consistent with the rest of the code.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Ian Campbell [Fri, 31 May 2013 10:14:20 +0000 (11:14 +0100)]
kbuild: make sure we clean up DTB temporary files
Various temporary files used when building DTB files were not suffixed with
.tmp and therefore were not cleaned up by "make clean".
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Stephen Warren <swarren@nvidia.com>
Tested-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Linus Torvalds [Thu, 13 Jun 2013 20:09:50 +0000 (13:09 -0700)]
Merge tag 'acpi-3.10-rc6' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"This is an alternative fix for the regression introduced in 3.9 whose
previous fix had to be reverted right before 3.10-rc5, because it
broke one of the Tony's machines.
In this one the check is confined to the ACPI video driver (which is
the only one causing the problem to happen in the first place) and the
Tony's box shouldn't even notice it.
- ACPI fix for an issue causing ACPI video driver to attempt to bind
to devices it shouldn't touch from Rafael J Wysocki."
* tag 'acpi-3.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / video: Do not bind to device objects with a scan handler
Linus Torvalds [Thu, 13 Jun 2013 20:08:51 +0000 (13:08 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
"Another set of fixes, the biggest bit of this is yet another tweak to
the UEFI anti-bricking code; apparently we finally got some feedback
from Samsung as to what makes at least their systems fail. This set
should actually fix the boot regressions that some other systems (e.g.
SGI) have exhibited.
Other than that, there is a patch to avoid a panic with particularly
unhappy memory layouts and two minor protocol fixes which may or may
not be manifest bugs"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix typo in kexec register clearing
x86, relocs: Move __vvar_page from S_ABS to S_REL
Modify UEFI anti-bricking code
x86: Fix adjust_range_size_mask calling position
Linus Torvalds [Thu, 13 Jun 2013 19:36:42 +0000 (12:36 -0700)]
Merge branch 'rcu/urgent' of git://git./linux/kernel/git/paulmck/linux-rcu
Pull RCU fixes from Paul McKenney:
"I must confess that this past merge window was not RCU's best showing.
This series contains three more fixes for RCU regressions:
1. A fix to __DECLARE_TRACE_RCU() that causes it to act as an
interrupt from idle rather than as a task switch from idle.
This change is needed due to the recent use of _rcuidle()
tracepoints that can be invoked from interrupt handlers as well
as from idle. Without this fix, invoking _rcuidle() tracepoints
from interrupt handlers results in splats and (more seriously)
confusion on RCU's part as to whether a given CPU is idle or not.
This confusion can in turn result in too-short grace periods and
therefore random memory corruption.
2. A fix to a subtle deadlock that could result due to RCU doing
a wakeup while holding one of its rcu_node structure's locks.
Although the probability of occurrence is low, it really
does happen. The fix, courtesy of Steven Rostedt, uses
irq_work_queue() to avoid the deadlock.
3. A fix to a silent deadlock (invisible to lockdep) due to the
interaction of timeouts posted by RCU debug code enabled by
CONFIG_PROVE_RCU_DELAY=y, grace-period initialization, and CPU
hotplug operations. This will not occur in production kernels,
but really does occur in randconfig testing. Diagnosis courtesy
of Steven Rostedt"
* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration
rcu: Don't call wakeup() with rcu_node structure ->lock held
trace: Allow idle-safe tracepoints to be called from irq
Linus Torvalds [Thu, 13 Jun 2013 18:02:31 +0000 (11:02 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"Three kvm related memory management fixes, a fix for show_trace, a fix
for early console output and a patch from Ben to help prevent compile
errors in regard to irq functions (or our lack thereof)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: Implement IRQ functions if !PCI
s390/sclp: fix new line detection
s390/pgtable: make pgste lock an explicit barrier
s390/pgtable: Save pgste during modify_prot_start/commit
s390/dumpstack: fix address ranges for asynchronous and panic stack
s390/pgtable: Fix guest overindication for change bit
Linus Torvalds [Thu, 13 Jun 2013 17:18:33 +0000 (10:18 -0700)]
Merge tag 'asoc-v3.10-rc5' of git://git./linux/kernel/git/broonie/sound
Pull ASoC sound updates from Mark Brown:
"Takashi is travelling at the minute and it'd be good to get the
MAINTAINERS update in here merged so sending directly.
As well as the usual driver specifics we've got a couple of core fixes
here, one fixing capabilities for unidirectional streams and the other
fixing suspend while audio streams are active.
The suspend fix is a little involved but mostly as a result of
removing some special casing that was doing the wrong thing."
* tag 'asoc-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound:
ASoC: tlv320aic3x: Remove deadlock from snd_soc_dapm_put_volsw_aic3x()
ASoC: dapm: Treat DAI widgets like AIF widgets for power
ASoC: arizona: Correct AEC loopback enable
ASoC: pcm: Require both CODEC and CPU support when declaring stream caps
MAINTAINERS: Remove myself from Wolfson maintainers
ASoC: wm8994: Ensure microphone detection state is reset on removal
ASoC: wm8994: Avoid leaking pm_runtime reference on removed jack race
ASoC: cs42l52: fix hp_gain_enum shift value.
ASoC: cs42l52: use correct PCM mixer TLV dB scale to match datasheet.
Linus Torvalds [Thu, 13 Jun 2013 17:13:29 +0000 (10:13 -0700)]
Merge tag 'md-3.10-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
Josh Triplett [Thu, 13 Jun 2013 00:26:37 +0000 (17:26 -0700)]
turbostat: Increase output buffer size to accommodate C8-C10
On platforms with C8-C10 support, the additional C-states cause
turbostat to overrun its output buffer of 128 bytes per CPU. Increase
this to 256 bytes per CPU.
[ As a bugfix, this should go into 3.10; however, since the C8-C10
support didn't go in until after 3.9, this need not go into any stable
kernel. ]
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
H. Peter Anvin [Thu, 13 Jun 2013 15:59:23 +0000 (08:59 -0700)]
Merge tag 'efi-urgent' into x86/urgent
* More tweaking to the EFI variable anti-bricking algorithm. Quite a
few users were reporting boot regressions in v3.9. This has now been
fixed with a more accurate "minimum storage requirement to avoid
bricking" value from Samsung (5K instead of 50%) and code to trigger
garbage collection when we near our limit - Matthew Garrett.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Yoshihiro Shimoda [Thu, 13 Jun 2013 01:15:45 +0000 (10:15 +0900)]
net: sh_eth: fix incorrect RX length error if R8A7740
This patch fixes an issue that the driver increments the "RX length error"
on every buffer in sh_eth_rx() if the R8A7740.
This patch also adds a description about the Receive Frame Status bits.
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 7 Jun 2013 20:26:05 +0000 (13:26 -0700)]
ip_tunnel: remove __net_init/exit from exported functions
If CONFIG_NET_NS is not set then __net_init is the same as __init and
__net_exit is the same as __exit. These functions will be removed from
memory after the module loads or is removed. Functions that are exported
for use by other functions should never be labeled for removal.
Bug introduced by commit
c54419321455631079c
("GRE: Refactor GRE tunneling code.")
Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Tue, 11 Jun 2013 10:02:05 +0000 (15:32 +0530)]
drivers: net: davinci_mdio: restore mdio clk divider in mdio resume
During suspend resume cycle all the register data is lost, so MDIO
clock divier value gets reset. This patch restores the clock divider
value.
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mugunthan V N [Tue, 11 Jun 2013 10:02:04 +0000 (15:32 +0530)]
drivers: net: davinci_mdio: moving mdio resume earlier than cpsw ethernet driver
MDIO driver should resume before CPSW ethernet driver so that CPSW connect
to the phy and start tx/rx ethernet packets, changing the suspend/resume
apis with suspend_late/resume_early.
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saurabh Mohan [Tue, 11 Jun 2013 00:45:10 +0000 (17:45 -0700)]
net/ipv4: ip_vti clear skb cb before tunneling.
If users apply shaper to vti tunnel then it will cause a kernel crash. The
problem seems to be due to the vti_tunnel_xmit function not clearing
skb->opt field before passing the packet to xfrm tunneling code.
Signed-off-by: Saurabh Mohan <saurabh@vyatta.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nithin Sujir [Wed, 12 Jun 2013 18:08:59 +0000 (11:08 -0700)]
tg3: Wait for boot code to finish after power on
Some systems that don't need wake-on-lan may choose to power down the
chip on system standby. Upon resume, the power on causes the boot code
to startup and initialize the hardware. On one new platform, this is
causing the device to go into a bad state due to a race between the
driver and boot code, once every several hundred resumes. The same race
exists on open since we come up from a power on.
This patch adds a wait for boot code signature at the beginning of
tg3_init_hw() which is common to both cases. If there has not been a
power-off or the boot code has already completed, the signature will be
present and poll_fw() returns immediately. Also return immediately if
the device does not have firmware.
Cc: stable@vger.kernel.org
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Wed, 12 Jun 2013 14:07:36 +0000 (16:07 +0200)]
l2tp: Fix sendmsg() return value
PPPoL2TP sockets should comply with the standard send*() return values
(i.e. return number of bytes sent instead of 0 upon success).
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Wed, 12 Jun 2013 14:07:23 +0000 (16:07 +0200)]
l2tp: Fix PPP header erasure and memory leak
Copy user data after PPP framing header. This prevents erasure of the
added PPP header and avoids leaking two bytes of uninitialised memory
at the end of skb's data buffer.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Tue, 11 Jun 2013 22:07:02 +0000 (00:07 +0200)]
bonding: fix igmp_retrans type and two related races
First the type of igmp_retrans (which is the actual counter of
igmp_resend parameter) is changed to u8 to be able to store values up
to 255 (as per documentation). There are two races that were hidden
there and which are easy to trigger after the previous fix, the first is
between bond_resend_igmp_join_requests and bond_change_active_slave
where igmp_retrans is set and can be altered by the periodic. The second
race condition is between multiple running instances of the periodic
(upon execution it can be scheduled again for immediate execution which
can cause the counter to go < 0 which in the unsigned case leads to
unnecessary igmp retransmissions).
Since in bond_change_active_slave bond->lock is held for reading and
curr_slave_lock for writing, we use curr_slave_lock for mutual
exclusion. We can't drop them as there're cases where RTNL is not held
when bond_change_active_slave is called. RCU is unlocked in
bond_resend_igmp_join_requests before getting curr_slave_lock since we
don't need it there and it's pointless to delay.
The decrement is moved inside the "if" block because if we decrement
unconditionally there's still a possibility for a race condition although
it is much more difficult to hit (many changes have to happen in
a very short period in order to trigger) which in the case of 3 parallel
running instances of this function and igmp_retrans == 1
(with check bond->igmp_retrans-- > 1) is:
f1 passes, doesn't re-schedule, but decrements - igmp_retrans = 0
f2 then passes, doesn't re-schedule, but decrements - igmp_retrans = 255
f3 does the unnecessary retransmissions.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Tue, 11 Jun 2013 22:07:01 +0000 (00:07 +0200)]
bonding: reset master mac on first enslave failure
If the bond device is supposed to get the first slave's MAC address and
the first enslavement fails then we need to reset the master's MAC
otherwise it will stay the same as the failed slave device. We do it
after err_undo_flags since that is the first place where the MAC can be
changed and we check if it should've been the first slave and if the
bond's MAC was set to it because that err place is used by multiple
locations prior to changing the master's MAC address.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 12 Jun 2013 14:02:27 +0000 (16:02 +0200)]
packet: packet_getname_spkt: make sure string is always 0-terminated
uaddr->sa_data is exactly of size 14, which is hard-coded here and
passed as a size argument to strncpy(). A device name can be of size
IFNAMSIZ (== 16), meaning we might leave the destination string
unterminated. Thus, use strlcpy() and also sizeof() while we're
at it. We need to memset the data area beforehand, since strlcpy
does not padd the remaining buffer with zeroes for user space, so
that we do not possibly leak anything.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dinh Nguyen [Wed, 12 Jun 2013 16:05:03 +0000 (11:05 -0500)]
net: ethernet: stmicro: stmmac: Fix compile error when STMMAC_XMIT_DEBUG used
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c: In function:
stmmac_xmit drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:1902:74:
error: expected ) before __func__
Signed-off-by: Dinh Nguyen <dinguyen@altera.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Somnath Kotur [Tue, 11 Jun 2013 11:48:22 +0000 (17:18 +0530)]
be2net: Fix 32-bit DMA Mask handling
Fix to set the coherent DMA mask only if dma_set_mask() succeeded, and to
error out if either fails.
Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 13 Jun 2013 08:26:54 +0000 (01:26 -0700)]
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Included change:
- fix "rtnl locked" concurrent executions by using rtnl_lock instead of
rtnl_trylock. This fix enables batman-adv initialisation to do not fail just
because somewhere else in the system another code path is holding the rtnl
lock. It is easy to see the problem when batman-adv is trying to start
together with other networking components.
- fix the routing protocol forwarding policy by enhancing the duplicate control
packet detection. When the right circumstances trigger the issue, some nodes in
the network become totally unreachable, so breaking the mesh connectivity.
- fix the Bridge Loop Avoidance component by not running the originator address
change handling routine when the component is disabled. The routine was
generating useless packets that were sent over the network.
Signed-off-by: David S. Miller <davem@davemloft.net>
Jan Beulich [Tue, 11 Jun 2013 10:00:34 +0000 (11:00 +0100)]
xen-netback: don't de-reference vif pointer after having called xenvif_put()
When putting vif-s on the rx notify list, calling xenvif_put() must be
deferred until after the removal from the list and the issuing of the
notification, as both operations dereference the pointer.
Changing this got me to notice that the "irq" variable was effectively
unused (and was of too narrow type anyway).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael S. Tsirkin [Thu, 13 Jun 2013 07:07:29 +0000 (10:07 +0300)]
macvlan: don't touch promisc without passthrough
commit
df8ef8f3aaa6692970a436204c4429210addb23a
"macvlan: add FDB bridge ops and macvlan flags"
added a way to control NOPROMISC macvlan flag through netlink.
However, with a non passthrough device we never set promisc on open,
even if NOPROMISC is off. As a result:
If userspace clears NOPROMISC on open, then does not clear it on a
netlink command, promisc counter is not decremented on stop and there
will be no way to clear it once macvlan is detached.
If userspace does not clear NOPROMISC on open, then sets NOPROMISC on a
netlink command, promisc counter will be decremented from 0 and overflow
to
fffffffff with no way to clear promisc.
To fix, simply ignore NOPROMISC flag in a netlink command for
non-passthrough devices, same as we do at open/close.
Since we touch this code anyway - check dev_set_promiscuity return code
and pass it to users (though an error here is unlikely).
Cc: "David S. Miller" <davem@davemloft.net>
Reviewed-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
H. Peter Anvin [Wed, 12 Jun 2013 14:37:43 +0000 (07:37 -0700)]
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Wed, 12 Jun 2013 01:01:22 +0000 (11:01 +1000)]
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.
Using raise_barrier will sometimes deadlock. Using freeze_array
should not.
As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.
The deadlock was made particularly noticeable by commits
050b66152f87c7 (raid10) and
6b740b8d79252f13 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.
This patch probably won't apply directly to some early kernels and
will need to be applied by hand.
Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Alex Lyakas [Tue, 4 Jun 2013 17:42:21 +0000 (20:42 +0300)]
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
Without that fix, the following scenario could happen:
- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.
So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.
[neilb - applied identical change to raid10 as well]
This bug can result in lost data, so it is suitable for any
-stable kernel.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Wed, 8 May 2013 23:48:30 +0000 (09:48 +1000)]
md: md_stop_writes() should always freeze recovery.
__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.
However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward. This can particularly cause problems or dm-raid.
So change __md_stop_writes() to always freeze recovery. This is safe
and more predicatable.
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Linus Torvalds [Thu, 13 Jun 2013 00:18:29 +0000 (17:18 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking update from David Miller:
1) Fix dump iterator in nfnl_acct_dump() and ctnl_timeout_dump() to
dump all objects properly, from Pablo Neira Ayuso.
2) xt_TCPMSS must use the default MSS of 536 when no MSS TCP option is
present. Fix from Phil Oester.
3) qdisc_get_rtab() looks for an existing matching rate table and uses
that instead of creating a new one. However, it's key matching is
incomplete, it fails to check to make sure the ->data[] array is
identical too. Fix from Eric Dumazet.
4) ip_vs_dest_entry isn't fully initialized before copying back to
userspace, fix from Dan Carpenter.
5) Fix ubuf reference counting regression in vhost_net, from Jason
Wang.
6) When sock_diag dumps a socket filter back to userspace, we have to
translate it out of the kernel's internal representation first.
From Nicolas Dichtel.
7) davinci_mdio holds a spinlock while calling pm_runtime, which
sleeps. Fix from Sebastian Siewior.
8) Timeout check in sh_eth_check_reset is off by one, from Sergei
Shtylyov.
9) If sctp socket init fails, we can NULL deref during cleanup. Fix
from Daniel Borkmann.
10) netlink_mmap() does not propagate errors properly, from Patrick
McHardy.
11) Disable powersave and use minstrel by default in ath9k. From Sujith
Manoharan.
12) Fix a regression in that SOCK_ZEROCOPY is not set on tuntap sockets
which prevents vhost from being able to use zerocopy. From Jason
Wang.
13) Fix race between port lookup and TX path in team driver, from Jiri
Pirko.
14) Missing length checks in bluetooth L2CAP packet parsing, from Johan
Hedberg.
15) rtlwifi fails to connect to networking using any encryption method
other than WPA2. Fix from Larry Finger.
16) Fix iwlegacy build due to incorrect CONFIG_* ifdeffing for power
management stuff. From Yijing Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
b43: stop format string leaking into error msgs
ath9k: Use minstrel rate control by default
Revert "ath9k_hw: Update rx gain initval to improve rx sensitivity"
ath9k: Disable PowerSave by default
net: wireless: iwlegacy: fix build error for il_pm_ops
rtlwifi: Fix a false leak indication for PCI devices
wl12xx/wl18xx: scan all 5ghz channels
wl12xx: increase minimum singlerole firmware version required
wl12xx: fix minimum required firmware version for wl127x multirole
rtlwifi: rtl8192cu: Fix problem in connecting to WEP or WPA(1) networks
mwifiex: debugfs: Fix out of bounds array access
Bluetooth: Fix mgmt handling of power on failures
Bluetooth: Fix missing length checks for L2CAP signalling PDUs
Bluetooth: btmrvl: support Marvell Bluetooth device SD8897
Bluetooth: Fix checks for LE support on LE-only controllers
team: fix checks in team_get_first_port_txable_rcu()
team: move add to port list before port enablement
team: check return value of team_get_port_by_index_rcu() for NULL
tuntap: set SOCK_ZEROCOPY flag during open
netlink: fix error propagation in netlink_mmap()
...
Linus Torvalds [Thu, 13 Jun 2013 00:08:49 +0000 (17:08 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull input layer bugfix from Jiri Kosina:
"Memory leak regression fix from Benjamin Tissoires"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: multitouch: prevent memleak with the allocated name
Linus Torvalds [Wed, 12 Jun 2013 23:42:39 +0000 (16:42 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"Outside of bcache (which really isn't super big), these are all
few-liners. There are a few important fixes in here:
- Fix blk pm sleeping when holding the queue lock
- A small collection of bcache fixes that have been done and tested
since bcache was included in this merge window.
- A fix for a raid5 regression introduced with the bio changes.
- Two important fixes for mtip32xx, fixing an oops and potential data
corruption (or hang) due to wrong bio iteration on stacked devices."
* 'for-linus' of git://git.kernel.dk/linux-block:
scatterlist: sg_set_buf() argument must be in linear mapping
raid5: Initialize bi_vcnt
pktcdvd: silence static checker warning
block: remove refs to XD disks from documentation
blkpm: avoid sleep when holding queue lock
mtip32xx: Correctly handle bio->bi_idx != 0 conditions
mtip32xx: Fix NULL pointer dereference during module unload
bcache: Fix error handling in init code
bcache: clarify free/available/unused space
bcache: drop "select CLOSURES"
bcache: Fix incompatible pointer type warning
Linus Torvalds [Wed, 12 Jun 2013 23:29:53 +0000 (16:29 -0700)]
Merge branch 'akpm' (updates from Andrew Morton)
Merge misc fixes from Andrew Morton:
"Bunch of fixes and one little addition to math64.h"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
include/linux/math64.h: add div64_ul()
mm: memcontrol: fix lockless reclaim hierarchy iterator
frontswap: fix incorrect zeroing and allocation size for frontswap_map
kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
mm: migration: add migrate_entry_wait_huge()
ocfs2: add missing lockres put in dlm_mig_lockres_handler
mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
aio: fix io_destroy() regression by using call_rcu()
rtc-at91rm9200: use shadow IMR on at91sam9x5
rtc-at91rm9200: add shadow interrupt mask
rtc-at91rm9200: refactor interrupt-register handling
rtc-at91rm9200: add configuration support
rtc-at91rm9200: add match-table compile guard
fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
cciss: fix broken mutex usage in ioctl
audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
...
Alex Shi [Wed, 12 Jun 2013 21:05:10 +0000 (14:05 -0700)]
include/linux/math64.h: add div64_ul()
There is div64_long() to handle the s64/long division, but no mocro do
u64/ul division. It is necessary in some scenarios, so add this
function.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alex Shi <alex.shi@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 12 Jun 2013 21:05:09 +0000 (14:05 -0700)]
mm: memcontrol: fix lockless reclaim hierarchy iterator
The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.
The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.
The write side sets the position pointer first, then updates the
sequence count to "publish" the new position. Likewise, the read side
must read the sequence count first, then the position. If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:
writer: reader:
iter->position = position if iter->sequence == expected:
smp_wmb() smp_rmb()
iter->sequence = sequence position = iter->position
However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory. Fix this.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: <stable@kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 12 Jun 2013 21:05:08 +0000 (14:05 -0700)]
frontswap: fix incorrect zeroing and allocation size for frontswap_map
The bitmap accessed by bitops must have enough size to hold the required
numbers of bits rounded up to a multiple of BITS_PER_LONG. And the
bitmap must not be zeroed by memset() if the number of bits cleared is
not a multiple of BITS_PER_LONG.
This fixes incorrect zeroing and allocation size for frontswap_map. The
incorrect zeroing part doesn't cause any problem because frontswap_map
is freed just after zeroing. But the wrongly calculated allocation size
may cause the problem.
For 32bit systems, the allocation size of frontswap_map is about twice
as large as required size. For 64bit systems, the allocation size is
smaller than requeired if the number of bits is not a multiple of
BITS_PER_LONG.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 12 Jun 2013 21:05:07 +0000 (14:05 -0700)]
kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
audit_add_tree_rule() must set 'rule->tree = NULL;' firstly, to protect
the rule itself freed in kill_rules().
The reason is when it is killed, the 'rule' itself may have already
released, we should not access it. one example: we add a rule to an
inode, just at the same time the other task is deleting this inode.
The work flow for adding a rule:
audit_receive() -> (need audit_cmd_mutex lock)
audit_receive_skb() ->
audit_receive_msg() ->
audit_receive_filter() ->
audit_add_rule() ->
audit_add_tree_rule() -> (need audit_filter_mutex lock)
...
unlock audit_filter_mutex
get_tree()
...
iterate_mounts() -> (iterate all related inodes)
tag_mount() ->
tag_trunk() ->
create_trunk() -> (assume it is 1st rule)
fsnotify_add_mark() ->
fsnotify_add_inode_mark() -> (add mark to inode->i_fsnotify_marks)
...
get_tree(); (each inode will get one)
...
lock audit_filter_mutex
The work flow for deleting an inode:
__destroy_inode() ->
fsnotify_inode_delete() ->
__fsnotify_inode_delete() ->
fsnotify_clear_marks_by_inode() -> (get mark from inode->i_fsnotify_marks)
fsnotify_destroy_mark() ->
fsnotify_destroy_mark_locked() ->
audit_tree_freeing_mark() ->
evict_chunk() ->
...
tree->goner = 1
...
kill_rules() -> (assume current->audit_context == NULL)
call_rcu() -> (rule->tree != NULL)
audit_free_rule_rcu() ->
audit_free_rule()
...
audit_schedule_prune() -> (assume current->audit_context == NULL)
kthread_run() -> (need audit_cmd_mutex and audit_filter_mutex lock)
prune_one() -> (delete it from prue_list)
put_tree(); (match the original get_tree above)
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 12 Jun 2013 21:05:04 +0000 (14:05 -0700)]
mm: migration: add migrate_entry_wait_huge()
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes. As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage. This patch introduces
migration_entry_wait_huge() to solve this.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@vger.kernel.org> [2.6.35+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xue jiufei [Wed, 12 Jun 2013 21:05:03 +0000 (14:05 -0700)]
ocfs2: add missing lockres put in dlm_mig_lockres_handler
dlm_mig_lockres_handler() is missing a dlm_lockres_put() on an error path.
Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: shencanquan <shencanquan@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tomasz Stanislawski [Wed, 12 Jun 2013 21:05:02 +0000 (14:05 -0700)]
mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
The watermark check consists of two sub-checks. The first one is:
if (free_pages <= min + lowmem_reserve)
return false;
The check assures that there is minimal amount of RAM in the zone. If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.
if (!(alloc_flags & ALLOC_CMA))
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
This prevents the zone from being drained from pages available for
non-movable allocations.
The second check prevents the zone from getting too fragmented.
for (o = 0; o < order; o++) {
free_pages -= z->free_area[o].nr_free << o;
min >>= 1;
if (free_pages <= min)
return false;
}
The field z->free_area[o].nr_free is equal to the number of free pages
including free CMA pages. Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented. In such a case there are many 0-order
free pages located in CMA. Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation. The test fails even though there are many free non-cma
pages in the zone.
This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages <= min + lowmem_reserve) check.
Laura said:
We were observing allocation failures of higher order pages (order 5 =
128K typically) under tight memory conditions resulting in driver
failure. The output from the page allocation failure showed plenty of
free pages of the appropriate order/type/zone and mostly CMA pages in
the lower orders.
For full disclosure, we still observed some page allocation failures
even after applying the patch but the number was drastically reduced and
those failures were attributed to fragmentation/other system issues.
Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Tested-by: Laura Abbott <lauraa@codeaurora.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: <stable@vger.kernel.org> [3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter [Wed, 12 Jun 2013 21:05:00 +0000 (14:05 -0700)]
drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
The "info.fill" array isn't initialized so it can leak uninitialized stack
information to user space.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Robin Holt <holt@sgi.com>
Acked-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Wed, 12 Jun 2013 21:04:59 +0000 (14:04 -0700)]
aio: fix io_destroy() regression by using call_rcu()
There was a regression introduced by
36f5588905c1 ("aio: refcounting
cleanup"), reported by Jens Axboe - the refcounting cleanup switched to
using RCU in the shutdown path, but the synchronize_rcu() was done in
the context of the io_destroy() syscall greatly increasing the time it
could block.
This patch switches it to call_rcu() and makes shutdown asynchronous
(more asynchronous than it was originally; before the refcount changes
io_destroy() would still wait on pending kiocbs).
Note that there's a global quota on the max outstanding kiocbs, and that
quota must be manipulated synchronously; otherwise io_setup() could
return -EAGAIN when there isn't quota available, and userspace won't
have any way of waiting until shutdown of the old kioctxs has finished
(besides busy looping).
So we release our quota before kioctx shutdown has finished, which
should be fine since the quota never corresponded to anything real
anyways.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Tested-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 12 Jun 2013 21:04:57 +0000 (14:04 -0700)]
rtc-at91rm9200: use shadow IMR on at91sam9x5
Add support for the at91sam9x5-family which must use the shadow
interrupt mask due to a hardware issue (causing RTC_IMR to always be
zero).
Signed-off-by: Johan Hovold <jhovold@gmail.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Cc: Ludovic Desroches <ludovic.desroches@atmel.com>
Cc: Robert Nelson <Robert.Nelson@digikey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 12 Jun 2013 21:04:56 +0000 (14:04 -0700)]
rtc-at91rm9200: add shadow interrupt mask
Add shadow interrupt-mask register which can be used on SoCs where the
actual hardware register is broken.
Note that some care needs to be taken to make sure the shadow mask
corresponds to the actual hardware state. The added overhead is not an
issue for the non-broken SoCs due to the relatively infrequent
interrupt-mask updates. We do, however, only use the shadow mask value
as a fall-back when it actually needed as there is still a theoretical
possibility that the mask is incorrect (see the code for details).
Signed-off-by: Johan Hovold <jhovold@gmail.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Cc: Ludovic Desroches <ludovic.desroches@atmel.com>
Cc: Robert Nelson <Robert.Nelson@digikey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 12 Jun 2013 21:04:55 +0000 (14:04 -0700)]
rtc-at91rm9200: refactor interrupt-register handling
Add accessors for the interrupt register.
This will allow us to easily add a shadow interrupt-mask register to use
on SoCs where the interrupt-mask register cannot be used.
Signed-off-by: Johan Hovold <jhovold@gmail.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Cc: Ludovic Desroches <ludovic.desroches@atmel.com>
Cc: Robert Nelson <Robert.Nelson@digikey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 12 Jun 2013 21:04:53 +0000 (14:04 -0700)]
rtc-at91rm9200: add configuration support
Add configuration support which can be used to implement SoC-specific
workarounds for broken hardware.
Signed-off-by: Johan Hovold <jhovold@gmail.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Cc: Ludovic Desroches <ludovic.desroches@atmel.com>
Cc: Robert Nelson <Robert.Nelson@digikey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 12 Jun 2013 21:04:52 +0000 (14:04 -0700)]
rtc-at91rm9200: add match-table compile guard
The members of Atmel's at91sam9x5 family (9x5) have a broken RTC
interrupt mask register (AT91_RTC_IMR). It does not reflect enabled
interrupts but instead always returns zero.
The kernel's rtc-at91rm9200 driver handles the RTC for the 9x5 family.
Currently when the date/time is set, an interrupt is generated and this
driver neglects to handle the interrupt. The kernel complains about the
un-handled interrupt and disables it henceforth. This not only breaks
the RTC function, but since that interrupt is shared (Atmel's SYS
interrupt) then other things break as well (e.g. the debug port no
longer accepts characters).
Tested on the at91sam9g25. Bug confirmed by Atmel.
This patch (of 5):
Add missing match-table compile guard.
Signed-off-by: Johan Hovold <jhovold@gmail.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Cc: Ludovic Desroches <ludovic.desroches@atmel.com>
Cc: Robert Nelson <Robert.Nelson@digikey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Goldwyn Rodrigues [Wed, 12 Jun 2013 21:04:51 +0000 (14:04 -0700)]
fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
While removing a non-empty directory, the kernel dumps a message:
(rmdir,21743,1):ocfs2_unlink:953 ERROR: status = -39
Suppress the error message from being printed in the dmesg so users
don't panic.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael Aquini [Wed, 12 Jun 2013 21:04:49 +0000 (14:04 -0700)]
swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
read_swap_cache_async() can race against get_swap_page(), and stumble
across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
into the swapcache yet.
This transient swap_map state is expected to be transitory, but the
actual placement of discard at scan_swap_map() inserts a wait for I/O
completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page() is
scheduled away at scan_swap_map(). This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU waitqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.
This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tony Lindgren [Wed, 12 Jun 2013 21:04:48 +0000 (14:04 -0700)]
drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
When booted in legacy mode device_init_wakeup() gets called by
drivers/mfd/twl-core.c when the children are initialized. However, when
booted using device tree, the children are created with
of_platform_populate() instead add_children().
This means that the RTC driver will not have device_init_wakeup() set,
and we need to call it from the driver probe like RTC drivers typically
do.
Without this we cannot test PM wake-up events on omaps for cases where
there may not be any physical wake-up event.
Signed-off-by: Tony Lindgren <tony@atomide.com>
Reported-by: Kevin Hilman <khilman@linaro.org>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Jingoo Han <jg1.han@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephen M. Cameron [Wed, 12 Jun 2013 21:04:47 +0000 (14:04 -0700)]
cciss: fix broken mutex usage in ioctl
If a new logical drive is added and the CCISS_REGNEWD ioctl is invoked
(as is normal with the Array Configuration Utility) the process will
hang as below. It attempts to acquire the same mutex twice, once in
do_ioctl() and once in cciss_unlocked_open(). The BKL was recursive,
the mutex isn't.
Linux version 3.10.0-rc2 (scameron@localhost.localdomain) (gcc version 4.4.7
20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri May 24 14:32:12 CDT 2013
[...]
acu D
0000000000000001 0 3246 3191 0x00000080
Call Trace:
schedule+0x29/0x70
schedule_preempt_disabled+0xe/0x10
__mutex_lock_slowpath+0x17b/0x220
mutex_lock+0x2b/0x50
cciss_unlocked_open+0x2f/0x110 [cciss]
__blkdev_get+0xd3/0x470
blkdev_get+0x5c/0x1e0
register_disk+0x182/0x1a0
add_disk+0x17c/0x310
cciss_add_disk+0x13a/0x170 [cciss]
cciss_update_drive_info+0x39b/0x480 [cciss]
rebuild_lun_table+0x258/0x370 [cciss]
cciss_ioctl+0x34f/0x470 [cciss]
do_ioctl+0x49/0x70 [cciss]
__blkdev_driver_ioctl+0x28/0x30
blkdev_ioctl+0x200/0x7b0
block_ioctl+0x3c/0x40
do_vfs_ioctl+0x89/0x350
SyS_ioctl+0xa1/0xb0
system_call_fastpath+0x16/0x1b
This mutex usage was added into the ioctl path when the big kernel lock
was removed. As it turns out, these paths are all thread safe anyway
(or can easily be made so) and we don't want ioctl() to be single
threaded in any case.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Miller <mike.miller@hp.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 12 Jun 2013 21:04:46 +0000 (14:04 -0700)]
audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
audit_log_start() does wait_for_auditd() in a loop until
audit_backlog_wait_time passes or audit_skb_queue has a room.
If signal_pending() is true this becomes a busy-wait loop, schedule() in
TASK_INTERRUPTIBLE won't block.
Thanks to Guy for fully investigating and explaining the problem.
(akpm: that'll cause the system to lock up on a non-preemptible
uniprocessor kernel)
(Guy: "Our customer was in fact running a uniprocessor machine, and they
reported a system hang.")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Guy Streeter <streeter@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Derek Basehore [Wed, 12 Jun 2013 21:04:45 +0000 (14:04 -0700)]
drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
During resume, we call hpet_rtc_timer_init after masking an irq bit in
hpet. This will cause the call to hpet_disable_rtc_channel to be undone
if RTC_AIE is the only bit not masked.
Allowing the cmos interrupt handler to run before resuming caused some
issues where the timer for the alarm was not removed. This would cause
other, later timers to not be cleared, so utilities such as hwclock
would time out when waiting for the update interrupt.
[akpm@linux-foundation.org: coding-style tweak]
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Osipenko [Wed, 12 Jun 2013 21:04:44 +0000 (14:04 -0700)]
drivers/rtc/rtc-tps6586x.c: device wakeup flags correction
Use device_init_wakeup() instead of device_set_wakeup_capable() and move
it before rtc dev registering. This fixes alarmtimer not registered
when tps6586x rtc is the only wakeup compatible rtc in the system.
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Cc: Laxman Dewangan <ldewangan@nvidia.com>
Cc: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Vagin [Wed, 12 Jun 2013 21:04:42 +0000 (14:04 -0700)]
memcg: don't initialize kmem-cache destroying work for root caches
struct memcg_cache_params has a union. Different parts of this union
are used for root and non-root caches. A part with destroying work is
used only for non-root caches.
BUG: unable to handle kernel paging request at
0000000fffffffe0
IP: kmem_cache_alloc+0x41/0x1f0
Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G D 3.10.0-rc1+ #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: kmem_cache_alloc+0x41/0x1f0
Call Trace:
getname_flags.part.34+0x30/0x140
getname+0x38/0x60
do_sys_open+0xc5/0x1e0
SyS_open+0x22/0x30
system_call_fastpath+0x16/0x1b
Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 <4d> 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
RIP [<
ffffffff8116b641>] kmem_cache_alloc+0x41/0x1f0
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Li Zefan <lizefan@huawei.com>
Cc: <stable@vger.kernel.org> [3.9.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xiaowei.Hu [Wed, 12 Jun 2013 21:04:41 +0000 (14:04 -0700)]
ocfs2: ocfs2_prep_new_orphaned_file() should return ret
If an error occurs, for example an EIO in __ocfs2_prepare_orphan_dir,
ocfs2_prep_new_orphaned_file will release the inode_ac, then when the
caller of ocfs2_prep_new_orphaned_file gets a 0 return, it will refer to
a NULL ocfs2_alloc_context struct in the following functions. A kernel
panic happens.
Signed-off-by: "Xiaowei.Hu" <xiaowei.hu@oracle.com>
Reviewed-by: shencanquan <shencanquan@huawei.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 12 Jun 2013 21:04:40 +0000 (14:04 -0700)]
lib/mpi/mpicoder.c: looping issue, need stop when equal to zero, found by 'EXTRA_FLAGS=-W'.
For 'while' looping, need stop when 'nbytes == 0', or will cause issue.
('nbytes' is size_t which is always bigger or equal than zero).
The related warning: (with EXTRA_CFLAGS=-W)
lib/mpi/mpicoder.c:40:2: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Wed, 12 Jun 2013 21:04:39 +0000 (14:04 -0700)]
kmsg: honor dmesg_restrict sysctl on /dev/kmsg
The dmesg_restrict sysctl currently covers the syslog method for access
dmesg, however /dev/kmsg isn't covered by the same protections. Most
people haven't noticed because util-linux dmesg(1) defaults to using the
syslog method for access in older versions. With util-linux dmesg(1)
defaults to reading directly from /dev/kmsg.
To fix /dev/kmsg, let's compare the existing interfaces and what they
allow:
- /proc/kmsg allows:
- open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
single-reader interface (SYSLOG_ACTION_READ).
- everything, after an open.
- syslog syscall allows:
- anything, if CAP_SYSLOG.
- SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
dmesg_restrict==0.
- nothing else (EPERM).
The use-cases were:
- dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
- sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
destructive SYSLOG_ACTION_READs.
AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
clear the ring buffer.
Based on the comments in devkmsg_llseek, it sounds like actions besides
reading aren't going to be supported by /dev/kmsg (i.e.
SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
syslog syscall actions.
To this end, move the check as Josh had done, but also rename the
constants to reflect their new uses (SYSLOG_FROM_CALL becomes
SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
allows destructive actions after a capabilities-constrained
SYSLOG_ACTION_OPEN check.
- /dev/kmsg allows:
- open if CAP_SYSLOG or dmesg_restrict==0
- reading/polling, after open
Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192
[akpm@linux-foundation.org: use pr_warn_once()]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robin Holt [Wed, 12 Jun 2013 21:04:37 +0000 (14:04 -0700)]
reboot: rigrate shutdown/reboot to boot cpu
We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus. The slowdown was tracked to commit
f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
kernel_restart()").
The current implementation does all the work of hot removing the cpus
before halting the system. We are switching to just migrating to the
boot cpu and then continuing with shutdown/reboot.
This also has the effect of not breaking x86's command line parameter
for specifying the reboot cpu. Note, this code was shamelessly copied
from arch/x86/kernel/reboot.c with bits removed pertaining to the
reboot_cpu command line parameter.
Signed-off-by: Robin Holt <holt@sgi.com>
Tested-by: Shawn Guo <shawn.guo@linaro.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Srivatsa S. Bhat [Wed, 12 Jun 2013 21:04:36 +0000 (14:04 -0700)]
CPU hotplug: provide a generic helper to disable/enable CPU hotplug
There are instances in the kernel where we would like to disable CPU
hotplug (from sysfs) during some important operation. Today the freezer
code depends on this and the code to do it was kinda tailor-made for
that.
Restructure the code and make it generic enough to be useful for other
usecases too.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>