Linus Torvalds [Wed, 25 Sep 2013 21:53:55 +0000 (14:53 -0700)]
Merge branch 'merge' of git://git./linux/kernel/git/benh/powerpc
Pull powerpc fixes from Ben Herrenschmidt:
"Here are a few things for -rc2, this time it's all written by me so it
can only be perfect .... right ? :)
So we have the fix to call irq_enter/exit on the irq stack we've been
discussing, plus a cleanup on top to remove an unused (and broken)
stack limit tracking feature (well, make it 32-bit only in fact where
it is used and works properly).
Then we have two things that I wrote over the last couple of days and
made the executive decision to include just because I can (and I'm
sure you won't object .... right ?).
They fix a couple of annoying and long standing "issues":
- We had separate zImages for when booting via Open Firmware vs.
booting via a flat device-tree, while it's trivial to make one that
deals with both
- We wasted a ton of cycles spinning secondary CPUs uselessly at boot
instead of starting them when needed on pseries, thus contributing
significantly to global warming"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/pseries: Do not start secondaries in Open Firmware
powerpc/zImage: make the "OF" wrapper support ePAPR boot
powerpc: Remove ksp_limit on ppc64
powerpc/irq: Run softirqs off the top of the irq stack
Linus Torvalds [Wed, 25 Sep 2013 20:29:18 +0000 (13:29 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"An EFI fix and two reboot-quirk fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/reboot: Fix apparent cut-n-paste mistake in Dell reboot workaround
x86/reboot: Add quirk to make Dell C6100 use reboot=pci automatically
x86, efi: Don't map Boot Services on i386
Linus Torvalds [Wed, 25 Sep 2013 20:28:45 +0000 (13:28 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Three small fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/balancing: Fix cfs_rq->task_h_load calculation
sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance()
sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()
Linus Torvalds [Wed, 25 Sep 2013 20:28:08 +0000 (13:28 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Assorted standalone fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add model number for Avoton Silvermont
perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page'
perf/x86/intel/uncore: Don't use smp_processor_id() in validate_group()
perf: Update ABI comment
tools lib lk: Uninclude linux/magic.h in debugfs.c
perf tools: Fix old GCC build error in trace-event-parse.c:parse_proc_kallsyms()
perf probe: Fix finder to find lines of given function
perf session: Check for SIGINT in more loops
perf tools: Fix compile with libelf without get_phdrnum
perf tools: Fix buildid cache handling of kallsyms with kcore
perf annotate: Fix objdump line parsing offset validation
perf tools: Fill in new definitions for madvise()/mmap() flags
perf tools: Sharpen the libaudit dependencies test
Mikael Pettersson [Wed, 25 Sep 2013 20:21:55 +0000 (22:21 +0200)]
update contact information for Mikael Pettersson
My old @it.uu.se email address is going away, so update relevant
files to point to my @gmail.com address instead. In sata_promise.c
just delete the address, people can get it from MAINTAINERS.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Jones [Wed, 25 Sep 2013 00:13:44 +0000 (20:13 -0400)]
x86/reboot: Fix apparent cut-n-paste mistake in Dell reboot workaround
This seems to have been copied from the Optiplex 990 entry
above, but somoene forgot to change the ident text.
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Link: http://lkml.kernel.org/r/20130925001344.GA13554@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Benjamin Herrenschmidt [Wed, 25 Sep 2013 04:02:50 +0000 (14:02 +1000)]
powerpc/pseries: Do not start secondaries in Open Firmware
Starting secondary CPUs early on from Open Firmware and placing them
in a holding spin loop slows down the boot process significantly under
some hypervisors such as KVM.
This is also unnecessary when RTAS supports querying the CPU state
So let's not do it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Benjamin Herrenschmidt [Tue, 24 Sep 2013 06:10:38 +0000 (16:10 +1000)]
powerpc/zImage: make the "OF" wrapper support ePAPR boot
This makes the "OF" zImage wrapper (zImage.pseries, zImage.pmac,
zImage.maple) work if booted via a flat device-tree (ePAPR boot
mode), and thus potentially usable with kexec.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Benjamin Herrenschmidt [Tue, 24 Sep 2013 05:17:21 +0000 (15:17 +1000)]
powerpc: Remove ksp_limit on ppc64
We've been keeping that field in thread_struct for a while, it contains
the "limit" of the current stack pointer and is meant to be used for
detecting stack overflows.
It has a few problems however:
- First, it was never actually *used* on 64-bit. Set and updated but
not actually exploited
- When switching stack to/from irq and softirq stacks, it's update
is racy unless we hard disable interrupts, which is costly. This
is fine on 32-bit as we don't soft-disable there but not on 64-bit.
Thus rather than fixing 2 in order to implement 1 in some hypothetical
future, let's remove the code completely from 64-bit. In order to avoid
a clutter of ifdef's, we remove the updates from C code completely
during interrupt stack switching, and instead maintain it from the
asm helper that is used to do the stack switching in the first place.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Benjamin Herrenschmidt [Mon, 23 Sep 2013 04:29:11 +0000 (14:29 +1000)]
powerpc/irq: Run softirqs off the top of the irq stack
Nowadays, irq_exit() calls __do_softirq() pretty much directly
instead of calling do_softirq() which switches to the decicated
softirq stack.
This has lead to observed stack overflows on powerpc since we call
irq_enter() and irq_exit() outside of the scope that switches to
the irq stack.
This fixes it by moving the stack switching up a level, making
irq_enter() and irq_exit() run off the irq stack.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Paul E. McKenney [Wed, 25 Sep 2013 01:29:11 +0000 (18:29 -0700)]
mm: Place preemption point in do_mlockall() loop
There is a loop in do_mlockall() that lacks a preemption point, which
means that the following can happen on non-preemptible builds of the
kernel. Dave Jones reports:
"My fuzz tester keeps hitting this. Every instance shows the non-irq
stack came in from mlockall. I'm only seeing this on one box, but
that has more ram (8gb) than my other machines, which might explain
it.
INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0)
sending NMI to all CPUs:
NMI backtrace for cpu 3
CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32
Call Trace:
lru_add_drain_all+0x15/0x20
SyS_mlockall+0xa5/0x1a0
tracesys+0xdd/0xe2"
This commit addresses this problem by inserting the required preemption
point.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 25 Sep 2013 00:00:35 +0000 (17:00 -0700)]
Merge branch 'akpm' (patches from Andrew Morton)
Merge fixes from Andrew Morton:
"Bunch of fixes.
And a reversion of mhocko's "Soft limit rework" patch series. This is
actually your fault for opening the merge window when I was off racing ;)
I didn't read the email thread before sending everything off.
Johannes Weiner raised significant issues:
http://www.spinics.net/lists/cgroups/msg08813.html
and we agreed to back it all out"
I clearly need to be more aware of Andrew's racing schedule.
* akpm:
MAINTAINERS: update mach-bcm related email address
checkpatch: make extern in .h prototypes quieter
cciss: fix info leak in cciss_ioctl32_passthru()
cpqarray: fix info leak in ida_locked_ioctl()
kernel/reboot.c: re-enable the function of variable reboot_default
audit: fix endless wait in audit_log_start()
revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code"
revert "memcg: get rid of soft-limit tree infrastructure"
revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim"
revert "memcg: enhance memcg iterator to support predicates"
revert "memcg: track children in soft limit excess to improve soft limit"
revert "memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything"
revert "memcg: track all children over limit in the root"
revert "memcg, vmscan: do not fall into reclaim-all pass too quickly"
fs/ocfs2/super.c: use a bigger nodestr in ocfs2_dismount_volume
watchdog: update watchdog_thresh properly
watchdog: update watchdog attributes atomically
Christian Daudt [Tue, 24 Sep 2013 22:27:47 +0000 (15:27 -0700)]
MAINTAINERS: update mach-bcm related email address
Update email address on mach-bcm + drivers for Broadcom mobile SoCs.
Signed-off-by: Christian Daudt <csd@broadcom.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Stephen Warren <swarren@wwwdotorg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Tue, 24 Sep 2013 22:27:46 +0000 (15:27 -0700)]
checkpatch: make extern in .h prototypes quieter
The use of extern in .h files is a bit contentious.
Make the warning be emitted only when --strict is used on the command
line.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter [Tue, 24 Sep 2013 22:27:45 +0000 (15:27 -0700)]
cciss: fix info leak in cciss_ioctl32_passthru()
The arg64 struct has a hole after ->buf_size which isn't cleared. Or if
any of the calls to copy_from_user() fail then that would cause an
information leak as well.
This was assigned CVE-2013-2147.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter [Tue, 24 Sep 2013 22:27:44 +0000 (15:27 -0700)]
cpqarray: fix info leak in ida_locked_ioctl()
The pciinfo struct has a two byte hole after ->dev_fn so stack
information could be leaked to the user.
This was assigned CVE-2013-2147.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chuansheng Liu [Tue, 24 Sep 2013 22:27:43 +0000 (15:27 -0700)]
kernel/reboot.c: re-enable the function of variable reboot_default
Commit
1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic
kernel") did some cleanup for reboot= command line, but it made the
reboot_default inoperative.
The default value of variable reboot_default should be 1, and if command
line reboot= is not set, system will use the default reboot mode.
[akpm@linux-foundation.org: fix comment layout]
Signed-off-by: Li Fei <fei.li@intel.com>
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Acked-by: Robin Holt <robinmholt@linux.com>
Cc: <stable@vger.kernel.org> [3.11.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Konstantin Khlebnikov [Tue, 24 Sep 2013 22:27:42 +0000 (15:27 -0700)]
audit: fix endless wait in audit_log_start()
After commit
829199197a43 ("kernel/audit.c: avoid negative sleep
durations") audit emitters will block forever if userspace daemon cannot
handle backlog.
After the timeout the waiting loop turns into busy loop and runs until
daemon dies or returns back to work. This is a minimal patch for that
bug.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Richard Guy Briggs <rgb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Chuck Anderson <chuck.anderson@oracle.com>
Cc: Dan Duval <dan.duval@oracle.com>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 24 Sep 2013 22:27:41 +0000 (15:27 -0700)]
revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code"
Revert commit
3b38722efd9f ("memcg, vmscan: integrate soft reclaim
tighter with zone shrinking code")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 24 Sep 2013 22:27:40 +0000 (15:27 -0700)]
revert "memcg: get rid of soft-limit tree infrastructure"
Revert commit
e883110aad71 ("memcg: get rid of soft-limit tree
infrastructure")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 24 Sep 2013 22:27:38 +0000 (15:27 -0700)]
revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim"
Revert commit
a5b7c87f9207 ("vmscan, memcg: do softlimit reclaim also
for targeted reclaim")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 24 Sep 2013 22:27:37 +0000 (15:27 -0700)]
revert "memcg: enhance memcg iterator to support predicates"
Revert commit
de57780dc659 ("memcg: enhance memcg iterator to support
predicates")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 24 Sep 2013 22:27:36 +0000 (15:27 -0700)]
revert "memcg: track children in soft limit excess to improve soft limit"
Revert commit
7d910c054be4 ("memcg: track children in soft limit excess
to improve soft limit")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 24 Sep 2013 22:27:35 +0000 (15:27 -0700)]
revert "memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything"
Revert commit
e839b6a1c8d0 ("memcg, vmscan: do not attempt soft limit
reclaim if it would not scan anything")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 24 Sep 2013 22:27:34 +0000 (15:27 -0700)]
revert "memcg: track all children over limit in the root"
Revert commit
1be171d60bdd ("memcg: track all children over limit in the
root")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 24 Sep 2013 22:27:33 +0000 (15:27 -0700)]
revert "memcg, vmscan: do not fall into reclaim-all pass too quickly"
Revert commit
e975de998b96 ("memcg, vmscan: do not fall into reclaim-all
pass too quickly")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Goldwyn Rodrigues [Tue, 24 Sep 2013 22:27:32 +0000 (15:27 -0700)]
fs/ocfs2/super.c: use a bigger nodestr in ocfs2_dismount_volume
While printing 32-bit node numbers, an 8-byte string is not enough.
Increase the size of the string to 12 chars.
This got left out in commit
49fa8140e487 ("fs/ocfs2/super.c: Use bigger
nodestr to accomodate 32-bit node numbers").
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Tue, 24 Sep 2013 22:27:30 +0000 (15:27 -0700)]
watchdog: update watchdog_thresh properly
watchdog_tresh controls how often nmi perf event counter checks per-cpu
hrtimer_interrupts counter and blows up if the counter hasn't changed
since the last check. The counter is updated by per-cpu
watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
period which guarantees that hrtimer is scheduled 2 times per the main
period. Both hrtimer and perf event are started together when the
watchdog is enabled.
So far so good. But...
But what happens when watchdog_thresh is updated from sysctl handler?
proc_dowatchdog will set a new sampling period and hrtimer callback
(watchdog_timer_fn) will use the new value in the next round. The
problem, however, is that nobody tells the perf event that the sampling
period has changed so it is ticking with the period configured when it
has been set up.
This might result in an ear ripping dissonance between perf and hrtimer
parts if the watchdog_thresh is increased. And even worse it might lead
to KABOOM if the watchdog is configured to panic on such a spurious
lockup.
This patch fixes the issue by updating both nmi perf even counter and
hrtimers if the threshold value has changed.
The nmi one is disabled and then reinitialized from scratch. This has
an unpleasant side effect that the allocation of the new event might
fail theoretically so the hard lockup detector would be disabled for
such cpus. On the other hand such a memory allocation failure is very
unlikely because the original event is deallocated right before.
It would be much nicer if we just changed perf event period but there
doesn't seem to be any API to do that right now. It is also unfortunate
that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
cannot use on_each_cpu() and do the same thing from the per-cpu context.
The update from the current CPU should be safe because
perf_event_disable removes the event atomically before it clears the
per-cpu watchdog_ev so it cannot change anything under running handler
feet.
The hrtimer is simply restarted (thanks to Don Zickus who has pointed
this out) if it is queued because we cannot rely it will fire&adopt to
the new sampling period before a new nmi event triggers (when the
treshold is decreased).
[akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Fabio Estevam <festevam@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Tue, 24 Sep 2013 22:27:29 +0000 (15:27 -0700)]
watchdog: update watchdog attributes atomically
proc_dowatchdog doesn't synchronize multiple callers which might lead to
confusion when two parallel callers might confuse watchdog_enable_all_cpus
resp watchdog_disable_all_cpus (eg watchdog gets enabled even if
watchdog_thresh was set to 0 already).
This patch adds a local mutex which synchronizes callers to the sysctl
handler.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 24 Sep 2013 21:42:03 +0000 (14:42 -0700)]
Merge branch 'bcache' (bcache fixes from Kent Overstreet)
Merge bcache fixes from Kent Overstreet:
"There's fixes for _three_ different data corruption bugs, all of which
were found by users hitting them in the wild.
The first one isn't bcache specific - in 3.11 bcache was switched to
the bio_copy_data in fs/bio.c, and that's when the bug in that code
was discovered, but it's also used by raid1 and pktcdvd. (That was my
code too, so the bug's doubly embarassing given that it was or
should've been just a cut and paste from bcache code. Dunno what
happened there).
Most of these (all the non data corruption bugs, actually) were ready
before the merge window and have been sitting in Jens' tree, but I
don't know what's been up with him lately..."
* emailed patches from Kent Overstreet <kmo@daterainc.com>:
bcache: Fix flushes in writeback mode
bcache: Fix for handling overlapping extents when reading in a btree node
bcache: Fix a shrinker deadlock
bcache: Fix a dumb CPU spinning bug in writeback
bcache: Fix a flush/fua performance bug
bcache: Fix a writeback performance regression
bcache: Correct printf()-style format length modifier
bcache: Fix for when no journal entries are found
bcache: Strip endline when writing the label through sysfs
bcache: Fix a dumb journal discard bug
block: Fix bio_copy_data()
Kent Overstreet [Tue, 24 Sep 2013 06:17:36 +0000 (23:17 -0700)]
bcache: Fix flushes in writeback mode
In writeback mode, when we get a cache flush we need to make sure we
issue a flush to the backing device.
The code for sending down an extra flush was wrong - by cloning the bio
we were probably getting flags that didn't make sense for a bare flush,
and also the old code was firing for FUA bios, for which we don't need
to send a flush to the backing device.
This was causing data corruption somehow - the mechanism was never
determined, but this patch fixes it for the users that were seeing it.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 24 Sep 2013 06:17:35 +0000 (23:17 -0700)]
bcache: Fix for handling overlapping extents when reading in a btree node
btree_sort_fixup() was overly clever, because it was trying to avoid
pulling a key off the btree iterator in more than one place.
This led to a really obscure bug where we'd break early from the loop in
btree_sort_fixup() if the current key overlapped with keys in more than
one older set, and the next key it overlapped with was zero size.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 24 Sep 2013 06:17:34 +0000 (23:17 -0700)]
bcache: Fix a shrinker deadlock
GFP_NOIO means we could be getting called recursively - mca_alloc() ->
mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then.
Whoops.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 24 Sep 2013 06:17:33 +0000 (23:17 -0700)]
bcache: Fix a dumb CPU spinning bug in writeback
schedule_timeout() != schedule_timeout_uninterruptible()
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 24 Sep 2013 06:17:32 +0000 (23:17 -0700)]
bcache: Fix a flush/fua performance bug
bch_journal_meta() was missing the flush to make the journal write
actually go down (instead of waiting up to journal_delay_ms)...
Whoops
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 24 Sep 2013 06:17:31 +0000 (23:17 -0700)]
bcache: Fix a writeback performance regression
Background writeback works by scanning the btree for dirty data and
adding those keys into a fixed size buffer, then for each dirty key in
the keybuf writing it to the backing device.
When read_dirty() finishes and it's time to scan for more dirty data, we
need to wait for the outstanding writeback IO to finish - they still
take up slots in the keybuf (so that foreground writes can check for
them to avoid races) - without that wait, we'll continually rescan when
we'll be able to add at most a key or two to the keybuf, and that takes
locks that starves foreground IO. Doh.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Geert Uytterhoeven [Tue, 24 Sep 2013 06:17:30 +0000 (23:17 -0700)]
bcache: Correct printf()-style format length modifier
Fix
drivers/md/bcache/btree.c: In function ‘bch_btree_node_read’:
drivers/md/bcache/btree.c:259: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘size_t’
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 24 Sep 2013 06:17:29 +0000 (23:17 -0700)]
bcache: Fix for when no journal entries are found
The journal replay code didn't handle this case, causing it to go into
an infinite loop...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gabriel de Perthuis [Tue, 24 Sep 2013 06:17:28 +0000 (23:17 -0700)]
bcache: Strip endline when writing the label through sysfs
sysfs attributes with unusual characters have crappy failure modes
in Squeeze (udev 164); later versions of udev are unaffected.
This should make these characters more unusual.
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 24 Sep 2013 06:17:27 +0000 (23:17 -0700)]
bcache: Fix a dumb journal discard bug
That switch statement was obviously wrong, leading to some sort of weird
spinning on rare occasion with discards enabled...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Tue, 24 Sep 2013 06:17:26 +0000 (23:17 -0700)]
block: Fix bio_copy_data()
The memcpy() in bio_copy_data() was using the wrong offset vars, leading
to data corruption in weird unusual setups.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.9
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Russell King [Tue, 24 Sep 2013 09:37:13 +0000 (10:37 +0100)]
drm/i2c: tda998x: fix audio muting
Fix a bug that was introduced in commit
c4c11dd160a8 ("drm/i2c: tda998x:
add video and audio input configuration") when Sebastian cleaned up my
original patch. Without this being fixed, audio is muted when the
display is turned off, never to be re-enabled.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Darren Etheridge <detheridge@ti.com>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Tue, 24 Sep 2013 00:04:45 +0000 (17:04 -0700)]
ipc: fix race with LSMs
Currently, IPC mechanisms do security and auditing related checks under
RCU. However, since security modules can free the security structure,
for example, through selinux_[sem,msg_queue,shm]_free_security(), we can
race if the structure is freed before other tasks are done with it,
creating a use-after-free condition. Manfred illustrates this nicely,
for instance with shared mem and selinux:
-> do_shmat calls rcu_read_lock()
-> do_shmat calls shm_object_check().
Checks that the object is still valid - but doesn't acquire any locks.
Then it returns.
-> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat)
-> selinux_shm_shmat calls ipc_has_perm()
-> ipc_has_perm accesses ipc_perms->security
shm_close()
-> shm_close acquires rw_mutex & shm_lock
-> shm_close calls shm_destroy
-> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security)
-> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm)
-> ipc_free_security calls kfree(ipc_perms->security)
This patch delays the freeing of the security structures after all RCU
readers are done. Furthermore it aligns the security life cycle with
that of the rest of IPC - freeing them based on the reference counter.
For situations where we need not free security, the current behavior is
kept. Linus states:
"... the old behavior was suspect for another reason too: having the
security blob go away from under a user sounds like it could cause
various other problems anyway, so I think the old code was at least
_prone_ to bugs even if it didn't have catastrophic behavior."
I have tested this patch with IPC testcases from LTP on both my
quad-core laptop and on a 64 core NUMA server. In both cases selinux is
enabled, and tests pass for both voluntary and forced preemption models.
While the mentioned races are theoretical (at least no one as reported
them), I wanted to make sure that this new logic doesn't break anything
we weren't aware of.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 23 Sep 2013 22:41:09 +0000 (15:41 -0700)]
Linux 3.12-rc2
Linus Torvalds [Mon, 23 Sep 2013 19:53:07 +0000 (12:53 -0700)]
Merge tag 'staging-3.12-rc2' of git://git./linux/kernel/git/gregkh/staging
Pull staging fixes from Greg KH:
"Here are a number of small staging tree and iio driver fixes. Nothing
major, just lots of little things"
* tag 'staging-3.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (34 commits)
iio:buffer_cb: Add missing iio_buffer_init()
iio: Prevent race between IIO chardev opening and IIO device free
iio: fix: Keep a reference to the IIO device for open file descriptors
iio: Stop sampling when the device is removed
iio: Fix crash when scan_bytes is computed with active_scan_mask == NULL
iio: Fix mcp4725 dev-to-indio_dev conversion in suspend/resume
iio: Fix bma180 dev-to-indio_dev conversion in suspend/resume
iio: Fix tmp006 dev-to-indio_dev conversion in suspend/resume
iio: iio_device_add_event_sysfs() bugfix
staging: iio:
ade7854-spi: Fix return value
staging:iio:hmc5843: Fix measurement conversion
iio: isl29018: Fix uninitialized value
staging:iio:dummy fix kfifo_buf kconfig dependency issue if kfifo modular and buffer enabled for built in dummy driver.
iio: at91: fix adc_clk overflow
staging: line6: add bounds check in snd_toneport_source_put()
Staging: comedi: Fix dependencies for drivers misclassified as PCI
staging: r8188eu: Adjust RX gain
staging: r8188eu: Fix smatch warning in core/rtw_ieee80211.
staging: r8188eu: Fix smatch error in core/rtw_mlme_ext.c
staging: r8188eu: Fix Smatch off-by-one warning in hal/rtl8188e_hal_init.c
...
Linus Torvalds [Mon, 23 Sep 2013 19:52:35 +0000 (12:52 -0700)]
Merge tag 'usb-3.12-rc2' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of small USB fixes for 3.12-rc2.
One is a revert of a EHCI change that isn't quite ready for 3.12.
Others are minor things, gadget fixes, Kconfig fixes, and some quirks
and documentation updates.
All have been in linux-next for a bit"
* tag 'usb-3.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: pl2303: distinguish between original and cloned HX chips
USB: Faraday fotg210: fix email addresses
USB: fix typo in usb serial simple driver Kconfig
Revert "USB: EHCI: support running URB giveback in tasklet context"
usb: s3c-hsotg: do not disconnect gadget when receiving ErlySusp intr
usb: s3c-hsotg: fix unregistration function
usb: gadget: f_mass_storage: reset endpoint driver data when disabled
usb: host: fsl-mph-dr-of: Staticize local symbols
usb: gadget: f_eem: Staticize eem_alloc
usb: gadget: f_ecm: Staticize ecm_alloc
usb: phy: omap-usb3: Fix return value
usb: dwc3: gadget: avoid memory leak when failing to allocate all eps
usb: dwc3: remove extcon dependency
usb: gadget: add '__ref' for rndis_config_register() and cdc_config_register()
usb: dwc3: pci: add support for BayTrail
usb: gadget: cdc2: fix conversion to new interface of f_ecm
usb: gadget: fix a bug and a WARN_ON in dummy-hcd
usb: gadget: mv_u3d_core: fix violation of locking discipline in mv_u3d_ep_disable()
Masoud Sharbiani [Fri, 20 Sep 2013 22:59:07 +0000 (15:59 -0700)]
x86/reboot: Add quirk to make Dell C6100 use reboot=pci automatically
Dell PowerEdge C6100 machines fail to completely reboot about 20% of the time.
Signed-off-by: Masoud Sharbiani <msharbiani@twitter.com>
Signed-off-by: Vinson Lee <vlee@twitter.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1379717947-18042-1-git-send-email-vlee@freedesktop.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yan, Zheng [Sun, 22 Sep 2013 08:19:13 +0000 (16:19 +0800)]
perf/x86/intel: Add model number for Avoton Silvermont
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Cc: a.p.zijlstra@chello.nl
Cc: eranian@google.com
Cc: ak@linux.intel.com
Link: http://lkml.kernel.org/r/1379837953-17755-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Mon, 23 Sep 2013 02:51:49 +0000 (19:51 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
- some small fixes for msm and exynos
- a regression revert affecting nouveau users with old userspace
- intel pageflip deadlock and gpu hang fixes, hsw modesetting hangs
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (22 commits)
Revert "drm: mark context support as a legacy subsystem"
drm/i915: Don't enable the cursor on a disable pipe
drm/i915: do not update cursor in crtc mode set
drm/exynos: fix return value check in lowlevel_buffer_allocate()
drm/exynos: Fix address space warnings in exynos_drm_fbdev.c
drm/exynos: Fix address space warning in exynos_drm_buf.c
drm/exynos: Remove redundant OF dependency
drm/msm: drop unnecessary set_need_resched()
drm/i915: kill set_need_resched
drm/msm: fix potential NULL pointer dereference
drm/i915/dvo: set crtc timings again for panel fixed modes
drm/i915/sdvo: Robustify the dtd<->drm_mode conversions
drm/msm: workaround for missing irq
drm/msm: return -EBUSY if bo still active
drm/msm: fix return value check in ERR_PTR()
drm/msm: fix cmdstream size check
drm/msm: hangcheck harder
drm/msm: handle read vs write fences
drm/i915/sdvo: Fully translate sync flags in the dtd->mode conversion
drm/i915: Use proper print format for debug prints
...
Linus Torvalds [Sun, 22 Sep 2013 22:00:11 +0000 (15:00 -0700)]
Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
"After merge window, no new stuff this time only a collection of neatly
confined and simple fixes"
* 'for-3.12/core' of git://git.kernel.dk/linux-block:
cfq: explicitly use 64bit divide operation for 64bit arguments
block: Add nr_bios to block_rq_remap tracepoint
If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
bio-integrity: Fix use of bs->bio_integrity_pool after free
blkcg: relocate root_blkg setting and clearing
block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
block: trace all devices plug operation
Linus Torvalds [Sun, 22 Sep 2013 21:58:49 +0000 (14:58 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"These are mostly bug fixes and a two small performance fixes. The
most important of the bunch are Josef's fix for a snapshotting
regression and Mark's update to fix compile problems on arm"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: create the uuid tree on remount rw
btrfs: change extent-same to copy entire argument struct
Btrfs: dir_inode_operations should use btrfs_update_time also
btrfs: Add btrfs: prefix to kernel log output
btrfs: refuse to remount read-write after abort
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0
Btrfs: don't leak transaction in btrfs_sync_file()
Btrfs: add the missing mutex unlock in write_all_supers()
Btrfs: iput inode on allocation failure
Btrfs: remove space_info->reservation_progress
Btrfs: kill delay_iput arg to the wait_ordered functions
Btrfs: fix worst case calculator for space usage
Revert "Btrfs: rework the overcommit logic to be based on the total size"
Btrfs: improve replacing nocow extents
Btrfs: drop dir i_size when adding new names on replay
Btrfs: replay dir_index items before other items
Btrfs: check roots last log commit when checking if an inode has been logged
Btrfs: actually log directory we are fsync()'ing
Btrfs: actually limit the size of delalloc range
Btrfs: allocate the free space by the existed max extent size when ENOSPC
...
Anatol Pomozov [Sun, 22 Sep 2013 18:43:47 +0000 (12:43 -0600)]
cfq: explicitly use 64bit divide operation for 64bit arguments
'samples' is 64bit operant, but do_div() second parameter is 32.
do_div silently truncates high 32 bits and calculated result
is invalid.
In case if low 32bit of 'samples' are zeros then do_div() produces
kernel crash.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Greg Kroah-Hartman [Sat, 21 Sep 2013 23:45:36 +0000 (16:45 -0700)]
Merge tag 'iio-fixes-for-3.12a' of git://git./linux/kernel/git/jic23/iio into staging-linus
Jonathan writes:
First round of IIO fixes for 3.12
A series of wrong 'struct dev' assumptions in suspend/resume callbacks
following on from this issue being identified in a new driver review.
One to watch out for in future.
A number of driver specific fixes
1) at91 - fix a overflow in clock rate computation
2) dummy - Kconfig dependency issue
3) isl29018 - uninitialized value
4) hmc5843 - measurement conversion bug introduced by recent cleanup.
5)
ade7854-spi - wrong return value.
Some IIO core fixes
1) Wrong value picked up for event code creation for a modified channel
2) A null dereference on failure to initialize a buffer after no buffer has
been in use, when using the available_scan_masks approach.
3) Sampling not stopped when a device is removed. Effects forced removal
such as hot unplugging.
4) Prevent device going away if a chrdev is still open in userspace.
5) Prevent race on chardev opening and device being freed.
6) Add a missing iio_buffer_init in the call back buffer.
These last few are the first part of a set from Lars-Peter Clausen who
has been taking a closer look at our removal paths and buffer handling
than anyone has for quite some time.
Linus Torvalds [Sat, 21 Sep 2013 22:59:41 +0000 (15:59 -0700)]
Merge tag 'nfs-for-3.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfix from Trond Myklebust:
"Fix a regression due to incorrect sharing of gss auth caches"
* tag 'nfs-for-3.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
RPCSEC_GSS: fix crash on destroying gss auth
Jun'ichi Nomura [Sat, 21 Sep 2013 19:57:47 +0000 (13:57 -0600)]
block: Add nr_bios to block_rq_remap tracepoint
Adding the number of bios in a remapped request to 'block_rq_remap'
tracepoint.
Request remapper clones bios in a request to track the completion
status of each bio. So the number of bios can be useful information
for investigation.
Related discussions:
http://www.redhat.com/archives/dm-devel/2013-August/msg00084.html
http://www.redhat.com/archives/dm-devel/2013-September/msg00024.html
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Sat, 21 Sep 2013 02:33:20 +0000 (22:33 -0400)]
Btrfs: create the uuid tree on remount rw
Users have been complaining of the uuid tree stuff warning that there is no uuid
root when trying to do snapshot operations. This is because if you mount -o ro
we will not create the uuid tree. But then if you mount -o rw,remount we will
still not create it and then any subsequent snapshot/subvol operations you try
to do will fail gloriously. Fix this by creating the uuid_root on remount rw if
it was not already there. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Mark Fasheh [Tue, 17 Sep 2013 22:43:54 +0000 (15:43 -0700)]
btrfs: change extent-same to copy entire argument struct
btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data
back to it's argument struct. Unfortunately, not all architectures provide
__put_user_unaligned(), so compiles break on them if btrfs is selected.
Instead, just copy the whole struct in / out at the start and end of
operations, respectively.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Guangyu Sun [Mon, 16 Sep 2013 17:42:03 +0000 (10:42 -0700)]
Btrfs: dir_inode_operations should use btrfs_update_time also
Commit
2bc5565286121d2a77ccd728eb3484dff2035b58 (Btrfs: don't update atime on
RO subvolumes) ensures that the access time of an inode is not updated when
the inode lives in a read-only subvolume.
However, if a directory on a read-only subvolume is accessed, the atime is
updated. This results in a write operation to a read-only subvolume. I
believe that access times should never be updated on read-only subvolumes.
To reproduce:
# mkfs.btrfs -f /dev/dm-3
(...)
# mount /dev/dm-3 /mnt
# btrfs subvol create /mnt/sub
Create subvolume '/mnt/sub'
# mkdir /mnt/sub/dir
# echo "abc" > /mnt/sub/dir/file
# btrfs subvol snapshot -r /mnt/sub /mnt/rosnap
Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap'
# stat /mnt/rosnap/dir
File: `/mnt/rosnap/dir'
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 16h/22d Inode: 257 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-09-11 07:21:49.
389157126 -0400
Modify: 2013-09-11 07:22:02.
330156079 -0400
Change: 2013-09-11 07:22:02.
330156079 -0400
# ls /mnt/rosnap/dir
file
# stat /mnt/rosnap/dir
File: `/mnt/rosnap/dir'
Size: 8 Blocks: 0 IO Block: 4096 directory
Device: 16h/22d Inode: 257 Links: 1
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-09-11 07:22:56.
797151670 -0400
Modify: 2013-09-11 07:22:02.
330156079 -0400
Change: 2013-09-11 07:22:02.
330156079 -0400
Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Frank Holton [Fri, 13 Sep 2013 15:46:50 +0000 (11:46 -0400)]
btrfs: Add btrfs: prefix to kernel log output
The kernel log entries for device label %s and device fsid %pU
are missing the btrfs: prefix. Add those here.
Signed-off-by: Frank Holton <fholton@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
David Sterba [Fri, 13 Sep 2013 15:41:20 +0000 (17:41 +0200)]
btrfs: refuse to remount read-write after abort
It's still possible to flip the filesystem into RW mode after it's
remounted RO due to an abort. There are lots of places that check for
the superblock error bit and will not write data, but we should not let
the filesystem appear read-write.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
chandan [Fri, 13 Sep 2013 14:04:10 +0000 (19:34 +0530)]
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0
This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default
subvolume by passing a subvolume id of 0.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Filipe David Borba Manana [Wed, 11 Sep 2013 19:36:44 +0000 (20:36 +0100)]
Btrfs: don't leak transaction in btrfs_sync_file()
In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns
a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would
return without ending/freeing the transaction.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Stefan Behrens [Wed, 11 Sep 2013 07:59:22 +0000 (09:59 +0200)]
Btrfs: add the missing mutex unlock in write_all_supers()
The BUG() was replaced by btrfs_error() and return -EIO with the
patch "get rid of one BUG() in write_all_supers()", but the missing
mutex_unlock() was overlooked.
The 0-DAY kernel build service from Intel reported the missing
unlock which was found by the coccinelle tool:
fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Tue, 17 Sep 2013 15:25:44 +0000 (11:25 -0400)]
Btrfs: iput inode on allocation failure
We don't do the iput when we fail to allocate our delayed delalloc work in
__start_delalloc_inodes, fix this.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Tue, 17 Sep 2013 15:02:25 +0000 (11:02 -0400)]
Btrfs: remove space_info->reservation_progress
This isn't used for anything anymore, just remove it.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Tue, 17 Sep 2013 14:55:51 +0000 (10:55 -0400)]
Btrfs: kill delay_iput arg to the wait_ordered functions
This is a left over of how we used to wait for ordered extents, which was to
grab the inode and then run filemap flush on it. However if we have an ordered
extent then we already are holding a ref on the inode, and we just use
btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on
the inode to start work on the ordered extent. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Tue, 17 Sep 2013 14:50:06 +0000 (10:50 -0400)]
Btrfs: fix worst case calculator for space usage
Forever ago I made the worst case calculator say that we could potentially split
into 3 blocks for every level on the way down, which isn't right. If we split
we're only going to get two new blocks, the one we originally cow'ed and the new
one we're going to split. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Tue, 17 Sep 2013 14:48:00 +0000 (10:48 -0400)]
Revert "Btrfs: rework the overcommit logic to be based on the total size"
This reverts commit
70afa3998c9baed4186df38988246de1abdab56d. It is causing
performance issues and wasn't actually correct. There were problems with the
way we flushed delalloc and that was the real cause of the early enospc.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Thu, 12 Sep 2013 20:58:28 +0000 (16:58 -0400)]
Btrfs: improve replacing nocow extents
Various people have hit a deadlock when running btrfs/011. This is because when
replacing nocow extents we will take the i_mutex to make sure nobody messes with
the file while we are replacing the extent. The problem is we are already
holding a transaction open, which is a locking inversion, so instead we need to
save these inodes we find and then process them outside of the transaction.
Further we can't just lock the inode and assume we are good to go. We need to
lock the extent range and then read back the extent cache for the inode to make
sure the extent really still points at the physical block we want. If it
doesn't we don't have to copy it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Wed, 11 Sep 2013 18:17:00 +0000 (14:17 -0400)]
Btrfs: drop dir i_size when adding new names on replay
So if we have dir_index items in the log that means we also have the inode item
as well, which means that the inode's i_size is correct. However when we
process dir_index'es we call btrfs_add_link() which will increase the
directory's i_size for the new entry. To fix this we need to just set the dir
items i_size to 0, and then as we find dir_index items we adjust the i_size.
btrfs_add_link() will do it for new entries, and if the entry already exists we
can just add the name_len to the i_size ourselves. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Wed, 11 Sep 2013 15:57:23 +0000 (11:57 -0400)]
Btrfs: replay dir_index items before other items
A user reported a bug where his log would not replay because he was getting
-EEXIST back. This was because he had a file moved into a directory that was
logged. What happens is the file had a lower inode number, and so it is
processed first when replaying the log, and so we add the inode ref in for the
directory it was moved to. But then we process the directories DIR_INDEX item
and try to add the inode ref for that inode and it fails because we already
added it when we replayed the inode. To solve this problem we need to just
process any DIR_INDEX items we have in the log first so this all is taken care
of, and then we can replay the rest of the items. With this patch my reproducer
can remount the file system properly instead of erroring out. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Wed, 11 Sep 2013 13:55:42 +0000 (09:55 -0400)]
Btrfs: check roots last log commit when checking if an inode has been logged
Liu introduced a local copy of the last log commit for an inode to make sure we
actually log an inode even if a log commit has already taken place. In order to
make sure we didn't relog the same inode multiple times he set this local copy
to the current trans when we log the inode, because usually we log the inode and
then sync the log. The exception to this is during rename, we will relog an
inode if the name changed and it is already in the log. The problem with this
is then we go to sync the inode, and our check to see if the inode has already
been logged is tripped and we don't sync the log. To fix this we need to _also_
check against the roots last log commit, because it could be less than what is
in our local copy of the log commit. This fixes a bug where we rename a file
into a directory and then fsync the directory and then on remount the directory
is no longer there. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Wed, 11 Sep 2013 13:36:30 +0000 (09:36 -0400)]
Btrfs: actually log directory we are fsync()'ing
If you just create a directory and then fsync that directory and then pull the
power plug you will come back up and the directory will not be there. That is
because we won't actually create directories if we've logged files inside of
them since they will be created on replay, but in this check we will set our
logged_trans of our current directory if it happens to be a directory, making us
think it doesn't need to be logged. Fix the logic to only do this to parent
directories. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Fri, 30 Aug 2013 18:38:49 +0000 (14:38 -0400)]
Btrfs: actually limit the size of delalloc range
So forever we have had this thing to limit the amount of delalloc pages we'll
setup to be written out to 128mb. This is because we have to lock all the pages
in this range, so anything above this gets a bit unweildly, and also without a
limit we'll happily allocate gigantic chunks of disk space. Turns out our check
for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
write out, we'd just stop looking for more space after we went over the limit.
So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
extents. This is fine normally, except when you go to relocate these extents
and we can't find enough space to relocate these moster extents, since we have
to be able to allocate exactly the same sized extent to move it around. So fix
this by actually enforcing the limit. With this patch I'm no longer seeing
giant 1.5gb extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Miao Xie [Mon, 9 Sep 2013 05:19:42 +0000 (13:19 +0800)]
Btrfs: allocate the free space by the existed max extent size when ENOSPC
By the current code, if the requested size is very large, and all the extents
in the free space cache are small, we will waste lots of the cpu time to cut
the requested size in half and search the cache again and again until it gets
down to the size the allocator can return. In fact, we can know the max extent
size in the cache after the first search, so we needn't cut the size in half
repeatedly, and just use the max extent size directly. This way can save
lots of cpu time and make the performance grow up when there are only fragments
in the free space cache.
According to my test, if there are only 4KB free space extents in the fs,
and the total size of those extents are 256MB, we can reduce the execute
time of the following test from 5.4s to 1.4s.
dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync
Changelog v2 -> v3:
- fix the problem that we skip the block group with the space which is
less than we need.
Changelog v1 -> v2:
- address the problem that we return a wrong start position when searching
the free space in a bitmap.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
David Sterba [Tue, 3 Sep 2013 16:28:57 +0000 (18:28 +0200)]
btrfs: add lockdep and tracing annotations for uuid tree
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Stefan Behrens [Tue, 3 Sep 2013 13:25:27 +0000 (15:25 +0200)]
btrfs: show compiled-in config features at module load time
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
(This commit message is a copy of David Sterba's commit message when
he introduced btrfs_print_info()).
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Filipe David Borba Manana [Mon, 2 Sep 2013 11:19:13 +0000 (12:19 +0100)]
Btrfs: more efficient inode tree replace operation
Instead of removing the current inode from the red black tree
and then add the new one, just use the red black tree replace
operation, which is more efficient.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Ilya Dryomov [Sun, 1 Sep 2013 15:56:44 +0000 (18:56 +0300)]
Btrfs: do not add replace target to the alloc_list
If replace was suspended by the umount, replace target device is added
to the fs_devices->alloc_list during a later mount. This is obviously
wrong. ->is_tgtdev_for_dev_replace is supposed to guard against that,
but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized
*after* everything is opened and fs_devices lists are populated. Fix
this by checking the devid instead: for replace targets it's always
equal to BTRFS_DEV_REPLACE_DEVID.
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Fri, 30 Aug 2013 19:09:51 +0000 (15:09 -0400)]
Btrfs: fixup error handling in btrfs_reloc_cow
If we failed to actually allocate the correct size of the extent to relocate we
will end up in an infinite loop because we won't return an error, we'll just
move on to the next extent. So fix this up by returning an error, and then fix
all the callers to return an error up the stack rather than BUG_ON()'ing.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Chris Mason [Sat, 21 Sep 2013 14:44:55 +0000 (10:44 -0400)]
Merge tag 'v3.11' into for-linus
Linux 3.11
Lars-Peter Clausen [Wed, 18 Sep 2013 20:02:00 +0000 (21:02 +0100)]
iio:buffer_cb: Add missing iio_buffer_init()
Make sure to properly initialize the IIO buffer data structure.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Lars-Peter Clausen [Wed, 18 Sep 2013 20:02:00 +0000 (21:02 +0100)]
iio: Prevent race between IIO chardev opening and IIO device free
Set the IIO device as the parent for the character device
We need to make sure that the IIO device is not freed while the character device
exists, otherwise the freeing of the IIO device might race against the file open
callback. Do this by setting the character device's parent to the IIO device,
this will cause the character device to grab a reference to the IIO device and
only release it once the character device itself has been removed.
Also move the registration of the character device before the registration of
the IIO device to avoid the (rather theoretical case) that the IIO device is
already freed again before we can add the character device and grab a reference
to the IIO device.
We also need to move the call to cdev_del() from iio_dev_release() to
iio_device_unregister() (where it should have been in the first place anyway) to
avoid a reference cycle. As iio_dev_release() is only called once all reference
are dropped, but the character device holds a reference to the IIO device.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Lars-Peter Clausen [Wed, 18 Sep 2013 20:02:00 +0000 (21:02 +0100)]
iio: fix: Keep a reference to the IIO device for open file descriptors
Make sure that the IIO device is not freed while we still have file descriptors
for it.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Lars-Peter Clausen [Wed, 18 Sep 2013 20:02:00 +0000 (21:02 +0100)]
iio: Stop sampling when the device is removed
Make sure to stop sampling when the device is removed, otherwise it will
continue to sample forever.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Peter Meerwald [Wed, 18 Sep 2013 21:10:00 +0000 (22:10 +0100)]
iio: Fix crash when scan_bytes is computed with active_scan_mask == NULL
if device has available_scan_masks set and the buffer is enabled without
any scan_elements enabled, in a NULL pointer is dereferenced in iio_compute_scan_bytes()
[ 18.993713] Unable to handle kernel NULL pointer dereference at virtual address
00000000
[ 19.002593] pgd =
debd4000
[ 19.005432] [
00000000] *pgd=
9ebc0831, *pte=
00000000, *ppte=
00000000
[ 19.012329] Internal error: Oops: 17 [#1] PREEMPT ARM
[ 19.017639] Modules linked in:
[ 19.020843] CPU: 0 Not tainted (
3.9.11-00036-g75c888a-dirty #207)
[ 19.027587] PC is at _find_first_bit_le+0xc/0x2c
[ 19.032440] LR is at iio_compute_scan_bytes+0x2c/0xf4
[ 19.037719] pc : [<
c021dc60>] lr : [<
c03198d0>] psr:
200d0013
[ 19.037719] sp :
debd9ed0 ip :
00000000 fp :
000802bc
[ 19.049713] r10:
00000000 r9 :
00000000 r8 :
deb67250
[ 19.055206] r7 :
00000000 r6 :
00000000 r5 :
00000000 r4 :
deb67000
[ 19.062011] r3 :
de96ec00 r2 :
00000000 r1 :
00000004 r0 :
00000000
[ 19.068847] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 19.076324] Control:
10c5387d Table:
9ebd4019 DAC:
00000015
problem is the rollback code in iio_update_buffers(), old_mask may be NULL (e.g. on first
call)
I'm not too confident about the fix; works for me...
Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net>
Reviewed-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Peter Meerwald [Wed, 18 Sep 2013 21:47:00 +0000 (22:47 +0100)]
iio: Fix mcp4725 dev-to-indio_dev conversion in suspend/resume
dev_to_iio_dev() is a false friend
Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Peter Meerwald [Wed, 18 Sep 2013 21:47:00 +0000 (22:47 +0100)]
iio: Fix bma180 dev-to-indio_dev conversion in suspend/resume
dev_to_iio_dev() is a false friend
Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net>
Cc: Oleksandr Kravchenko <o.v.kravchenko@globallogic.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Peter Meerwald [Wed, 18 Sep 2013 21:47:00 +0000 (22:47 +0100)]
iio: Fix tmp006 dev-to-indio_dev conversion in suspend/resume
dev_to_iio_dev() is a false friend
Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Linus Torvalds [Fri, 20 Sep 2013 22:17:14 +0000 (15:17 -0700)]
Merge tag 'pm+acpi-3.12-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
1) Four fixes for cpufreq regressions introduced by the changes that
removed Device Tree parsing for CPU device nodes from cpufreq
drivers from Sudeep KarkadaNagesha.
2) Two fixes for recent cpufreq regressions introduced by changes
related to the preservation of sysfs attributes over system
suspend/resume cycles from Viresh Kumar.
3) Fix for ACPI-based wakeup signaling in the PCI subsystem that
fails to stop PME polling for devices put into the D3cold power
state from Rafael J Wysocki.
4) Fix for bad interactions between cpufreq and udev on systems
supporting intel_pstate where acpi-cpufreq is available as well
from Yinghai Lu.
* tag 'pm+acpi-3.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: return EEXIST instead of EBUSY for second registering
PCI / ACPI / PM: Clear pme_poll for devices in D3cold on wakeup
ARM: shmobile: change dev_id to cpu0 while registering cpu clock
ARM: i.MX: change dev_id to cpu0 while registering cpu clock
cpufreq: imx6q-cpufreq: assign cpu_dev correctly to cpu0 device
cpufreq: cpufreq-cpu0: assign cpu_dev correctly to cpu0 device
cpufreq: unlock correct rwsem while updating policy->cpu
cpufreq: Clear policy->cpus bits in __cpufreq_remove_dev_finish()
Linus Torvalds [Fri, 20 Sep 2013 22:16:15 +0000 (15:16 -0700)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost
Pull vhost updates from Michael Tsirkin:
"vhost: minor changes on top of 3.12-rc1
This fixes module loading for vhost-scsi, and tweaks locking in vhost
core a bit. Both of these are not exactly release blockers but it's
early in the cycle so I think it's a good idea to apply them now"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost-scsi: whitespace tweak
vhost/scsi: use vmalloc for order-10 allocation
vhost: wake up worker outside spin_lock
David Howells [Fri, 20 Sep 2013 13:18:00 +0000 (14:18 +0100)]
CacheFiles: Don't try to dump the index key if the cookie has been cleared
Don't try to dump the index key that distinguishes an object if netfs
data in the cookie the object refers to has been cleared (ie. the
cookie has passed most of the way through
__fscache_relinquish_cookie()).
Since the netfs holds the index key, we can't get at it once the ->def
and ->netfs_data pointers have been cleared - and a NULL pointer
exception will ensue, usually just after a:
CacheFiles: Error: Unexpected object collision
error is reported.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josh Boyer [Fri, 20 Sep 2013 13:17:52 +0000 (14:17 +0100)]
CacheFiles: Fix memory leak in cachefiles_check_auxdata error paths
In cachefiles_check_auxdata(), we allocate auxbuf but fail to free it if
we determine there's an error or that the data is stale.
Further, assigning the output of vfs_getxattr() to auxbuf->len gives
problems with checking for errors as auxbuf->len is a u16. We don't
actually need to set auxbuf->len, so keep the length in a variable for
now. We shouldn't need to check the upper limit of the buffer as an
overflow there should be indicated by -ERANGE.
While we're at it, fscache_check_aux() returns an enum value, not an
int, so assign it to an appropriately typed variable rather than to ret.
Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hongyi Jia <jiayisuse@gmail.com>
cc: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Deacon [Thu, 19 Sep 2013 18:06:46 +0000 (19:06 +0100)]
lockref: use cmpxchg64 explicitly for lockless updates
The cmpxchg() function tends not to support 64-bit arguments on 32-bit
architectures. This could be either due to use of unsigned long
arguments (like on ARM) or lack of instruction support (cmpxchgq on
x86). However, these architectures may implement a specific cmpxchg64()
function to provide 64-bit cmpxchg support instead.
Since the lockref code requires a 64-bit cmpxchg and relies on the
architecture selecting ARCH_USE_CMPXCHG_LOCKREF, move to using cmpxchg64
instead of cmpxchg and allow 32-bit architectures to make use of the
lockless lockref implementation.
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael J. Wysocki [Fri, 20 Sep 2013 13:40:41 +0000 (15:40 +0200)]
Merge branch 'pm-cpufreq'
* pm-cpufreq:
cpufreq: return EEXIST instead of EBUSY for second registering
ARM: shmobile: change dev_id to cpu0 while registering cpu clock
ARM: i.MX: change dev_id to cpu0 while registering cpu clock
cpufreq: imx6q-cpufreq: assign cpu_dev correctly to cpu0 device
cpufreq: cpufreq-cpu0: assign cpu_dev correctly to cpu0 device
cpufreq: unlock correct rwsem while updating policy->cpu
cpufreq: Clear policy->cpus bits in __cpufreq_remove_dev_finish()
Rafael J. Wysocki [Fri, 20 Sep 2013 13:40:30 +0000 (15:40 +0200)]
Merge branch 'acpi-pci'
* acpi-pci:
PCI / ACPI / PM: Clear pme_poll for devices in D3cold on wakeup
Linus Torvalds [Fri, 20 Sep 2013 13:18:51 +0000 (08:18 -0500)]
Merge tag 'arm64-stable' of git://git./linux/kernel/git/cmarinas/linux-aarch64
Pull ARM64 fixes from Catalin Marinas:
- Compat register fault reporting fix
- Documentation clarification on tagged pointers
- hwcap widened to 64-bit (user space already reading it as 64-bit)
* tag 'arm64-stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64:
arm64: Widen hwcap to be 64 bit
arm64: Correctly report LR and SP for compat tasks
arm64: documentation: tighten up tagged pointer documentation
arm64: Make do_bad_area() function static
Vladimir Davydov [Sat, 14 Sep 2013 15:39:46 +0000 (19:39 +0400)]
sched/balancing: Fix cfs_rq->task_h_load calculation
Patch a003a2 (sched: Consider runnable load average in move_tasks())
sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is
always 0. This mistype leads to all tasks having weight 0 when load
balancing in a cpu-cgroup enabled setup. There obviously should be sum
of weights of all runnable tasks there instead. Fix it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379173186-11944-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vladimir Davydov [Sun, 15 Sep 2013 13:49:14 +0000 (17:49 +0400)]
sched/balancing: Fix 'local->avg_load > busiest->avg_load' case in fix_small_imbalance()
In busiest->group_imb case we can come to fix_small_imbalance() with
local->avg_load > busiest->avg_load. This can result in wrong imbalance
fix-up, because there is the following check there where all the
members are unsigned:
if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >=
(scaled_busy_load_per_task * imbn)) {
env->imbalance = busiest->load_per_task;
return;
}
As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.
Fix it by substituting the subtraction with an equivalent addition in
the check.
[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth. ]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ef167822e5c5b2d96cf5b0e3e4f4bdff3f0414a2.1379252740.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vladimir Davydov [Sun, 15 Sep 2013 13:49:13 +0000 (17:49 +0400)]
sched/balancing: Fix 'local->avg_load > sds->avg_load' case in calculate_imbalance()
In busiest->group_imb case we can come to calculate_imbalance() with
local->avg_load >= busiest->avg_load >= sds->avg_load. This can result
in imbalance overflow, because it is calculated as follows
env->imbalance = min(
max_pull * busiest->group_power,
(sds->avg_load - local->avg_load) * local->group_power) / SCHED_POWER_SCALE;
As a result we can end up constantly bouncing tasks from one cpu to
another if there are pinned tasks.
Fix this by skipping the assignment and assuming imbalance=0 in case
local->avg_load > sds->avg_load.
[ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus
belonging to different cores on an HT-enabled machine with N logical
cpus: just look at se.nr_migrations growth. ]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/8f596cc6bc0e5e655119dc892c9bfcad26e971f4.1379252740.git.vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>