Juri Lelli [Tue, 14 Jan 2014 11:03:51 +0000 (12:03 +0100)]
sched/deadline: No need to check p if dl_se is valid
Dan Carpenter reported new 'Smatch' warnings:
> tree: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
> head:
130816ce4d5f69167324f7272e70aa3d641677c6
> commit:
1baca4ce16b8cc7d4f50be1f7914799af30a2861 [17/50] sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
>
> kernel/sched/deadline.c:937 pick_next_task_dl() warn: variable dereferenced before check 'p' (see line 934)
BUG_ON() already fires if pick_next_dl_entity() doesn't return a valid
dl_se. No need to check if p is valid afterward.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Fixes: 1baca4ce16b8 ("sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic")
Link: http://lkml.kernel.org/r/52D54E25.6060100@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 14 Jan 2014 03:23:34 +0000 (11:23 +0800)]
sched/deadline: Remove unused variables
fix these new sparse warnings:
>> kernel/sched/core.c:305:14: sparse: symbol 'sysctl_sched_dl_period' was not declared. Should it be static?
>> kernel/sched/core.c:306:5: sparse: symbol 'sysctl_sched_dl_runtime' was not declared. Should it be static?
Better still, they're completely unused so remove them.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/n/tip-ke0shkG7vMnzmcdqhhiymyem@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fengguang Wu [Tue, 14 Jan 2014 00:06:36 +0000 (08:06 +0800)]
sched/deadline: Fix sparse static warnings
new sparse warnings:
>> kernel/sched/cpudeadline.c:38:6: sparse: symbol 'cpudl_exchange' was not declared. Should it be static?
>> kernel/sched/cpudeadline.c:46:6: sparse: symbol 'cpudl_heapify' was not declared. Should it be static?
>> kernel/sched/cpudeadline.c:71:6: sparse: symbol 'cpudl_change_key' was not declared. Should it be static?
>> kernel/sched/cpudeadline.c:195:15: sparse: memset with byte count of 163928
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Fixes: 6bfd6d72f51c ("sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap")
Link: http://lkml.kernel.org/r/52d47f8c.EYJsA5+mELPBk4t6\%fengguang.wu@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Sun, 12 Jan 2014 21:24:56 +0000 (22:24 +0100)]
m68k: Fix build warning in mac_via.h
Fengguang Wu's kbuild test robot reported the following new m68k warnings:
In file included from drivers/nubus/nubus.c:22:0:
>> arch/m68k/include/asm/mac_via.h:262:47: warning: 'struct irq_desc' declared inside parameter list [enabled by default]
>> arch/m68k/include/asm/mac_via.h:262:47: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
Caused by the reworking of the generic local_bh{dis,en}able() code.
To fix it, forward declare 'struct irq_desc'.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: c795eb55e740 ("sched/preempt, locking: Rework local_bh_{dis,en}able()")
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: geert@linux-m68k.org
Link: http://lkml.kernel.org/r/20140112212456.GQ7572@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 11 Dec 2013 11:21:17 +0000 (12:21 +0100)]
sched, thermal: Clean up preempt_enable_no_resched() abuse
The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-zcfvacdlvlr63qmnn5i58vuj@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 19 Nov 2013 15:13:38 +0000 (16:13 +0100)]
sched, net: Fixup busy_loop_us_clock()
The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.
This busy_poll stuff looks to be completely and utterly broken,
sched_clock() can return utter garbage with interrupts enabled (rare
but still) and it can drift unbounded between CPUs.
This means that if you get preempted/migrated and your new CPU is
years behind on the previous CPU we get to busy spin for a _very_ long
time.
There is a _REASON_ sched_clock() warns about preemptability -
papering over it with a preempt_disable()/preempt_enable_no_resched()
is just terminal brain damage on so many levels.
Replace sched_clock() usage with local_clock() which has a bounded
drift between CPUs (<2 jiffies).
There is a further problem with the entire busy wait poll thing in
that the spin time is additive to the syscall timeout, not inclusive.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 19 Nov 2013 15:13:38 +0000 (16:13 +0100)]
sched, net: Clean up preempt_enable_no_resched() abuse
The only valid use of preempt_enable_no_resched() is if the very next
line is schedule() or if we know preemption cannot actually be enabled
by that statement due to known more preempt_count 'refs'.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 20 Nov 2013 11:22:37 +0000 (12:22 +0100)]
sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
With various drivers wanting to inject idle time; we get people
calling idle routines outside of the idle loop proper.
Therefore we need to be extra careful about not missing
TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations.
While looking at this, I also realized there's a small window in the
existing idle loop where we can miss TIF_NEED_RESCHED; when it hits
right after the tif_need_resched() test at the end of the loop but
right before the need_resched() test at the start of the loop.
So move preempt_fold_need_resched() out of the loop where we're
guaranteed to have TIF_NEED_RESCHED set.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-x9jgh45oeayzajz2mjt0y7d6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar [Mon, 13 Jan 2014 16:37:05 +0000 (17:37 +0100)]
Merge branch 'x86/idle' into sched/core
Merge these x86 specific bits - we are going to add generic bits as well.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 19 Nov 2013 15:13:38 +0000 (16:13 +0100)]
sched/preempt, locking: Rework local_bh_{dis,en}able()
Currently local_bh_disable() is out-of-line for no apparent reason.
So inline it to save a few cycles on call/return nonsense, the
function body is a single add on x86 (a few loads and store extra on
load/store archs).
Also expose two new local_bh functions:
__local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);
Which implement the actual local_bh_{dis,en}able() behaviour.
The next patch uses the exposed @cnt argument to optimize bh lock
functions.
With build fixes from Jacob Pan.
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 28 Nov 2013 18:01:40 +0000 (19:01 +0100)]
sched/clock, x86: Avoid a runtime condition in native_sched_clock()
Use a static_key to avoid touching tsc_disabled and a runtime
condition in native_sched_clock() -- less cachelines touched is always
better.
MAINLINE PRE POST
sched_clock_stable: 1 1 1
(cold) sched_clock: 329841 215295 213039
(cold) local_clock: 301773 220773 216084
(warm) sched_clock: 38375 25659 25231
(warm) local_clock: 100371 27242 27601
(warm) rdtsc: 27340 24208 24203
sched_clock_stable: 0 0 0
(cold) sched_clock: 382634 237019 240055
(cold) local_clock: 396890 294819 299942
(warm) sched_clock: 38194 25609 25276
(warm) local_clock: 143452 71232 73232
(warm) rdtsc: 27345 24243 24244
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-hrz87bo37qke25bty6pnfy4b@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 11 Dec 2013 17:55:53 +0000 (18:55 +0100)]
sched/clock: Fix up clear_sched_clock_stable()
The below tells us the static_key conversion has a problem; since the
exact point of clearing that flag isn't too important, delay the flip
and use a workqueue to process it.
[ ] TSC synchronization [CPU#0 -> CPU#22]:
[ ] Measured 8 cycles TSC warp between CPUs, turning off TSC clock.
[ ]
[ ] ======================================================
[ ] [ INFO: possible circular locking dependency detected ]
[ ]
3.13.0-rc3-01745-g848b0d0322cb-dirty #637 Not tainted
[ ] -------------------------------------------------------
[ ] swapper/0/1 is trying to acquire lock:
[ ] (jump_label_mutex){+.+...}, at: [<
ffffffff8115a637>] jump_label_lock+0x17/0x20
[ ]
[ ] but task is already holding lock:
[ ] (cpu_hotplug.lock){+.+.+.}, at: [<
ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60
[ ]
[ ] which lock already depends on the new lock.
[ ]
[ ]
[ ] the existing dependency chain (in reverse order) is:
[ ]
[ ] -> #1 (cpu_hotplug.lock){+.+.+.}:
[ ] [<
ffffffff810def00>] lock_acquire+0x90/0x130
[ ] [<
ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
[ ] [<
ffffffff81093fdc>] get_online_cpus+0x3c/0x60
[ ] [<
ffffffff8104cc67>] arch_jump_label_transform+0x37/0x130
[ ] [<
ffffffff8115a3cf>] __jump_label_update+0x5f/0x80
[ ] [<
ffffffff8115a48d>] jump_label_update+0x9d/0xb0
[ ] [<
ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0
[ ] [<
ffffffff810c0f65>] sched_feat_set+0xf5/0x100
[ ] [<
ffffffff810c5bdc>] set_numabalancing_state+0x2c/0x30
[ ] [<
ffffffff81d12f3d>] numa_policy_init+0x1af/0x1b7
[ ] [<
ffffffff81cebdf4>] start_kernel+0x35d/0x41f
[ ] [<
ffffffff81ceb5a5>] x86_64_start_reservations+0x2a/0x2c
[ ] [<
ffffffff81ceb6a2>] x86_64_start_kernel+0xfb/0xfe
[ ]
[ ] -> #0 (jump_label_mutex){+.+...}:
[ ] [<
ffffffff810de141>] __lock_acquire+0x1701/0x1eb0
[ ] [<
ffffffff810def00>] lock_acquire+0x90/0x130
[ ] [<
ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
[ ] [<
ffffffff8115a637>] jump_label_lock+0x17/0x20
[ ] [<
ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0
[ ] [<
ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
[ ] [<
ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
[ ] [<
ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
[ ] [<
ffffffff81076612>] native_cpu_up+0x3a2/0x890
[ ] [<
ffffffff810941cb>] _cpu_up+0xdb/0x160
[ ] [<
ffffffff810942c9>] cpu_up+0x79/0x90
[ ] [<
ffffffff81d0af6b>] smp_init+0x60/0x8c
[ ] [<
ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
[ ] [<
ffffffff8164e32e>] kernel_init+0xe/0x130
[ ] [<
ffffffff8166beec>] ret_from_fork+0x7c/0xb0
[ ]
[ ] other info that might help us debug this:
[ ]
[ ] Possible unsafe locking scenario:
[ ]
[ ] CPU0 CPU1
[ ] ---- ----
[ ] lock(cpu_hotplug.lock);
[ ] lock(jump_label_mutex);
[ ] lock(cpu_hotplug.lock);
[ ] lock(jump_label_mutex);
[ ]
[ ] *** DEADLOCK ***
[ ]
[ ] 2 locks held by swapper/0/1:
[ ] #0: (cpu_add_remove_lock){+.+.+.}, at: [<
ffffffff81094037>] cpu_maps_update_begin+0x17/0x20
[ ] #1: (cpu_hotplug.lock){+.+.+.}, at: [<
ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60
[ ]
[ ] stack backtrace:
[ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
3.13.0-rc3-01745-g848b0d0322cb-dirty #637
[ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[ ]
ffffffff82c9c270 ffff880236843bb8 ffffffff8165c5f5 ffffffff82c9c270
[ ]
ffff880236843bf8 ffffffff81658c02 ffff880236843c80 ffff8802368586a0
[ ]
ffff880236858678 0000000000000001 0000000000000002 ffff880236858000
[ ] Call Trace:
[ ] [<
ffffffff8165c5f5>] dump_stack+0x4e/0x7a
[ ] [<
ffffffff81658c02>] print_circular_bug+0x1f9/0x207
[ ] [<
ffffffff810de141>] __lock_acquire+0x1701/0x1eb0
[ ] [<
ffffffff816680ff>] ? __atomic_notifier_call_chain+0x8f/0xb0
[ ] [<
ffffffff810def00>] lock_acquire+0x90/0x130
[ ] [<
ffffffff8115a637>] ? jump_label_lock+0x17/0x20
[ ] [<
ffffffff8115a637>] ? jump_label_lock+0x17/0x20
[ ] [<
ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
[ ] [<
ffffffff8115a637>] ? jump_label_lock+0x17/0x20
[ ] [<
ffffffff8115a637>] jump_label_lock+0x17/0x20
[ ] [<
ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0
[ ] [<
ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
[ ] [<
ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
[ ] [<
ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
[ ] [<
ffffffff81076612>] native_cpu_up+0x3a2/0x890
[ ] [<
ffffffff810941cb>] _cpu_up+0xdb/0x160
[ ] [<
ffffffff810942c9>] cpu_up+0x79/0x90
[ ] [<
ffffffff81d0af6b>] smp_init+0x60/0x8c
[ ] [<
ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
[ ] [<
ffffffff8164e320>] ? rest_init+0xd0/0xd0
[ ] [<
ffffffff8164e32e>] kernel_init+0xe/0x130
[ ] [<
ffffffff8166beec>] ret_from_fork+0x7c/0xb0
[ ] [<
ffffffff8164e320>] ? rest_init+0xd0/0xd0
[ ] ------------[ cut here ]------------
[ ] WARNING: CPU: 0 PID: 1 at /usr/src/linux-2.6/kernel/smp.c:374 smp_call_function_many+0xad/0x300()
[ ] Modules linked in:
[ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted
3.13.0-rc3-01745-g848b0d0322cb-dirty #637
[ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[ ]
0000000000000009 ffff880236843be0 ffffffff8165c5f5 0000000000000000
[ ]
ffff880236843c18 ffffffff81093d8c 0000000000000000 0000000000000000
[ ]
ffffffff81ccd1a0 ffffffff810ca951 0000000000000000 ffff880236843c28
[ ] Call Trace:
[ ] [<
ffffffff8165c5f5>] dump_stack+0x4e/0x7a
[ ] [<
ffffffff81093d8c>] warn_slowpath_common+0x8c/0xc0
[ ] [<
ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
[ ] [<
ffffffff81093dda>] warn_slowpath_null+0x1a/0x20
[ ] [<
ffffffff8110b72d>] smp_call_function_many+0xad/0x300
[ ] [<
ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
[ ] [<
ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
[ ] [<
ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
[ ] [<
ffffffff8110ba96>] smp_call_function+0x46/0x80
[ ] [<
ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
[ ] [<
ffffffff8110bb3c>] on_each_cpu+0x3c/0xa0
[ ] [<
ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20
[ ] [<
ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
[ ] [<
ffffffff8104f964>] text_poke_bp+0x64/0xd0
[ ] [<
ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20
[ ] [<
ffffffff8104ccde>] arch_jump_label_transform+0xae/0x130
[ ] [<
ffffffff8115a3cf>] __jump_label_update+0x5f/0x80
[ ] [<
ffffffff8115a48d>] jump_label_update+0x9d/0xb0
[ ] [<
ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0
[ ] [<
ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
[ ] [<
ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
[ ] [<
ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
[ ] [<
ffffffff81076612>] native_cpu_up+0x3a2/0x890
[ ] [<
ffffffff810941cb>] _cpu_up+0xdb/0x160
[ ] [<
ffffffff810942c9>] cpu_up+0x79/0x90
[ ] [<
ffffffff81d0af6b>] smp_init+0x60/0x8c
[ ] [<
ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
[ ] [<
ffffffff8164e320>] ? rest_init+0xd0/0xd0
[ ] [<
ffffffff8164e32e>] kernel_init+0xe/0x130
[ ] [<
ffffffff8166beec>] ret_from_fork+0x7c/0xb0
[ ] [<
ffffffff8164e320>] ? rest_init+0xd0/0xd0
[ ] ---[ end trace
6ff1df5620c49d26 ]---
[ ] tsc: Marking TSC unstable due to check_tsc_sync_source failed
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-v55fgqj3nnyqnngmvuu8ep6h@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 28 Nov 2013 18:38:42 +0000 (19:38 +0100)]
sched/clock, x86: Use a static_key for sched_clock_stable
In order to avoid the runtime condition and variable load turn
sched_clock_stable into a static_key.
Also provide a shorter implementation of local_clock() and
cpu_clock(int) when sched_clock_stable==1.
MAINLINE PRE POST
sched_clock_stable: 1 1 1
(cold) sched_clock: 329841 221876 215295
(cold) local_clock: 301773 234692 220773
(warm) sched_clock: 38375 25602 25659
(warm) local_clock: 100371 33265 27242
(warm) rdtsc: 27340 24214 24208
sched_clock_stable: 0 0 0
(cold) sched_clock: 382634 235941 237019
(cold) local_clock: 396890 297017 294819
(warm) sched_clock: 38194 25233 25609
(warm) local_clock: 143452 71234 71232
(warm) rdtsc: 27345 24245 24243
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 28 Nov 2013 18:31:23 +0000 (19:31 +0100)]
sched/clock: Remove local_irq_disable() from the clocks
Now that x86 no longer requires IRQs disabled for sched_clock() and
ia64 never had this requirement (it doesn't seem to do cpufreq at
all), we can remove the requirement of disabling IRQs.
MAINLINE PRE POST
sched_clock_stable: 1 1 1
(cold) sched_clock: 329841 257223 221876
(cold) local_clock: 301773 309889 234692
(warm) sched_clock: 38375 25280 25602
(warm) local_clock: 100371 85268 33265
(warm) rdtsc: 27340 24247 24214
sched_clock_stable: 0 0 0
(cold) sched_clock: 382634 301224 235941
(cold) local_clock: 396890 399870 297017
(warm) sched_clock: 38194 25630 25233
(warm) local_clock: 143452 129629 71234
(warm) rdtsc: 27345 24307 24245
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-36e5kohiasnr106d077mgubp@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Fri, 29 Nov 2013 14:40:29 +0000 (15:40 +0100)]
sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
Use a ring-buffer like multi-version object structure which allows
always having a coherent object; we use this to avoid having to
disable IRQs while reading sched_clock() and avoids a problem when
getting an NMI while changing the cyc2ns data.
MAINLINE PRE POST
sched_clock_stable: 1 1 1
(cold) sched_clock: 329841 331312 257223
(cold) local_clock: 301773 310296 309889
(warm) sched_clock: 38375 38247 25280
(warm) local_clock: 100371 102713 85268
(warm) rdtsc: 27340 27289 24247
sched_clock_stable: 0 0 0
(cold) sched_clock: 382634 372706 301224
(cold) local_clock: 396890 399275 399870
(warm) sched_clock: 38194 38124 25630
(warm) local_clock: 143452 148698 129629
(warm) rdtsc: 27345 27365 24307
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-s567in1e5ekq2nlyhn8f987r@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Fri, 29 Nov 2013 14:39:25 +0000 (15:39 +0100)]
sched/clock, x86: Move some cyc2ns() code around
There are no __cycles_2_ns() users outside of arch/x86/kernel/tsc.c,
so move it there.
There are no cycles_2_ns() users.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-01lslnavfgo3kmbo4532zlcj@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Fri, 29 Nov 2013 17:04:39 +0000 (18:04 +0100)]
sched/clock, x86: Use mul_u64_u32_shr() for native_sched_clock()
Use mul_u64_u32_shr() so that x86_64 can use a single 64x64->128 mul.
Before:
0000000000000560 <native_sched_clock>:
560: 44 8b 1d 00 00 00 00 mov 0x0(%rip),%r11d # 567 <native_sched_clock+0x7>
567: 55 push %rbp
568: 48 89 e5 mov %rsp,%rbp
56b: 45 85 db test %r11d,%r11d
56e: 75 4f jne 5bf <native_sched_clock+0x5f>
570: 0f 31 rdtsc
572: 89 c0 mov %eax,%eax
574: 48 c1 e2 20 shl $0x20,%rdx
578: 48 c7 c1 00 00 00 00 mov $0x0,%rcx
57f: 48 09 c2 or %rax,%rdx
582: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
589: 65 8b 04 25 00 00 00 mov %gs:0x0,%eax
590: 00
591: 48 98 cltq
593: 48 8b 34 c5 00 00 00 mov 0x0(,%rax,8),%rsi
59a: 00
59b: 48 89 d0 mov %rdx,%rax
59e: 81 e2 ff 03 00 00 and $0x3ff,%edx
5a4: 48 c1 e8 0a shr $0xa,%rax
5a8: 48 0f af 14 0e imul (%rsi,%rcx,1),%rdx
5ad: 48 0f af 04 0e imul (%rsi,%rcx,1),%rax
5b2: 5d pop %rbp
5b3: 48 03 04 3e add (%rsi,%rdi,1),%rax
5b7: 48 c1 ea 0a shr $0xa,%rdx
5bb: 48 01 d0 add %rdx,%rax
5be: c3 retq
After:
0000000000000550 <native_sched_clock>:
550: 8b 3d 00 00 00 00 mov 0x0(%rip),%edi # 556 <native_sched_clock+0x6>
556: 55 push %rbp
557: 48 89 e5 mov %rsp,%rbp
55a: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
55e: 85 ff test %edi,%edi
560: 75 2c jne 58e <native_sched_clock+0x3e>
562: 0f 31 rdtsc
564: 89 c0 mov %eax,%eax
566: 48 c1 e2 20 shl $0x20,%rdx
56a: 48 09 c2 or %rax,%rdx
56d: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
574: 00 00
576: 89 c0 mov %eax,%eax
578: 48 f7 e2 mul %rdx
57b: 65 48 8b 0c 25 00 00 mov %gs:0x0,%rcx
582: 00 00
584: c9 leaveq
585: 48 0f ac d0 0a shrd $0xa,%rdx,%rax
58a: 48 01 c8 add %rcx,%rax
58d: c3 retq
MAINLINE POST
sched_clock_stable: 1 1
(cold) sched_clock: 329841 331312
(cold) local_clock: 301773 310296
(warm) sched_clock: 38375 38247
(warm) local_clock: 100371 102713
(warm) rdtsc: 27340 27289
sched_clock_stable: 0 0
(cold) sched_clock: 382634 372706
(cold) local_clock: 396890 399275
(warm) sched_clock: 38194 38124
(warm) local_clock: 143452 148698
(warm) rdtsc: 27345 27365
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-piu203ses5y1g36bnyw2n16x@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 20 Nov 2013 15:52:19 +0000 (16:52 +0100)]
sched/preempt: Take away preempt_enable_no_resched() from modules
Discourage drivers/modules to be creative with preemption.
Sadly all is implemented in macros and inline so if they want to do
evil they still can, but at least try and discourage some.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: rui.zhang@intel.com
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-fn7h6vu8wtgxk0ih402qcijx@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 19 Nov 2013 15:13:38 +0000 (16:13 +0100)]
locking: Optimize lock_bh functions
Currently all _bh_ lock functions do two preempt_count operations:
local_bh_disable();
preempt_disable();
and for the unlock:
preempt_enable_no_resched();
local_bh_enable();
Since its a waste of perfectly good cycles to modify the same variable
twice when you can do it in one go; use the new
__local_bh_{dis,en}able_ip() functions that allow us to provide a
preempt_count value to add/sub.
So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to
add/sub to be done in one go.
As a bonus it gets rid of the preempt_enable_no_resched() usage.
This reduces a 1000 loops of:
spin_lock_bh(&bh_lock);
spin_unlock_bh(&bh_lock);
from 53596 cycles to 51995 cycles. I didn't do enough measurements to
say for absolute sure that the result is significant but the the few
runs I did for each suggest it is so.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: jacob.jun.pan@linux.intel.com
Cc: Mike Galbraith <bitbucket@online.de>
Cc: hpa@zytor.com
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: lenb@kernel.org
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano [Mon, 6 Jan 2014 11:34:45 +0000 (12:34 +0100)]
sched: Factor out the on_null_domain() checks in trigger_load_balance()
The test on_null_domain is done twice in the trigger_load_balance function.
Move the test at the begin of the function, so there is only one check.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389008085-9069-9-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano [Mon, 6 Jan 2014 11:34:44 +0000 (12:34 +0100)]
sched: Pass 'struct rq' to nohz_idle_balance()
The cpu information is stored in the struct rq. Pass the struct rq to
nohz_idle_balance, so all the functions called in run_rebalance_domains have
the same parameters and the 'this_cpu' variable becomes pointless.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ Added !SMP build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389008085-9069-8-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano [Mon, 6 Jan 2014 11:34:43 +0000 (12:34 +0100)]
sched: Pass 'struct rq' to rebalance_domains()
The cpu information is stored in the struct rq and the caller of the
rebalance_domains function pass the cpu to retrieve the struct rq but
it already has the struct rq info. Replace the cpu parameter with the
struct rq.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389008085-9069-7-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano [Mon, 6 Jan 2014 11:34:42 +0000 (12:34 +0100)]
sched: Remove unused parameter from nohz_balancer_kick()
The cpu parameter is no longer needed in nohz_balancer_kick, let's remove
the parameter.
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389008085-9069-6-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano [Mon, 6 Jan 2014 11:34:41 +0000 (12:34 +0100)]
sched: Remove unused parameter from find_new_ilb()
The 'call_cpu' is never used in the function. Remove it.
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389008085-9069-5-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano [Mon, 6 Jan 2014 11:34:40 +0000 (12:34 +0100)]
sched: Pass 'struct rq' to on_null_domain()
The on_null_domain() function is getting the cpu to retrieve the struct rq
associated with it.
Pass 'struct rq' directly to the function as the caller already has the info.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389008085-9069-4-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano [Mon, 6 Jan 2014 11:34:39 +0000 (12:34 +0100)]
sched: Reduce nohz_kick_needed() parameters
The cpu information is already stored in the struct rq, so no need to pass it
as parameter to the nohz_kick_needed function.
The caller of this function just called idle_cpu() before to fill the
rq->idle_balance field.
Use rq->cpu and rq->idle_balance.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389008085-9069-3-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano [Mon, 6 Jan 2014 11:34:38 +0000 (12:34 +0100)]
sched: Reduce trigger_load_balance() parameters
The cpu information is already stored in the struct rq, so no need to pass it
as parameter to the trigger_load_balance function.
Cc: linaro-kernel@lists.linaro.org
Cc: preeti.lkml@gmail.com
Cc: mingo@redhat.com
Cc: peterz@infradead.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389008085-9069-2-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 19 Dec 2013 10:54:45 +0000 (11:54 +0100)]
sched/deadline: Fix hotplug admission control
The current hotplug admission control is broken because:
CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task()
cannot fail and hard assumes it _will_ move all tasks off of the dying
cpu, failing this will break hotplug.
The much simpler solution is a DOWN_PREPARE handler that fails when
removing one CPU gets us below the total allocated bandwidth.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131220171343.GL2480@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 17 Dec 2013 11:44:49 +0000 (12:44 +0100)]
sched/deadline: Remove the sysctl_sched_dl knobs
Remove the deadline specific sysctls for now. The problem with them is
that the interaction with the exisiting rt knobs is nearly impossible
to get right.
The current (as per before this patch) situation is that the rt and dl
bandwidth is completely separate and we enforce rt+dl < 100%. This is
undesirable because this means that the rt default of 95% leaves us
hardly any room, even though dl tasks are saver than rt tasks.
Another proposed solution was (a discarted patch) to have the dl
bandwidth be a fraction of the rt bandwidth. This is highly
confusing imo.
Furthermore neither proposal is consistent with the situation we
actually want; which is rt tasks ran from a dl server. In which case
the rt bandwidth is a direct subset of dl.
So whichever way we go, the introduction of dl controls at this point
is painful. Therefore remove them and instead share the rt budget.
This means that for now the rt knobs are used for dl admission control
and the dl runtime is accounted against the rt runtime. I realise that
this isn't entirely desirable either; but whatever we do we appear to
need to change the interface later, so better have a small interface
for now.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 17 Dec 2013 09:03:34 +0000 (10:03 +0100)]
sched/deadline: Fix up the smp-affinity mask tests
For now deadline tasks are not allowed to set smp affinity; however
the current tests are wrong, cure this.
The test in __sched_setscheduler() also uses an on-stack cpumask_t
which is a no-no.
Change both tests to use cpumask_subset() such that we test the root
domain span to be a subset of the cpus_allowed mask. This way we're
sure the tasks can always run on all CPUs they can be balanced over,
and have no effective affinity constraints.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-fyqtb1lapxca3lhsxv9cumdc@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Juri Lelli [Thu, 7 Nov 2013 13:43:47 +0000 (14:43 +0100)]
sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap
Data from tests confirmed that the original active load balancing
logic didn't scale neither in the number of CPU nor in the number of
tasks (as sched_rt does).
Here we provide a global data structure to keep track of deadlines
of the running tasks in the system. The structure is composed by
a bitmask showing the free CPUs and a max-heap, needed when the system
is heavily loaded.
The implementation and concurrent access scheme are kept simple by
design. However, our measurements show that we can compete with sched_rt
on large multi-CPUs machines [1].
Only the push path is addressed, the extension to use this structure
also for pull decisions is straightforward. However, we are currently
evaluating different (in order to decrease/avoid contention) data
structures to solve possibly both problems. We are also going to re-run
tests considering recent changes inside cpupri [2].
[1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
[2] http://www.spinics.net/lists/linux-rt-users/msg06778.html
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dario Faggioli [Thu, 7 Nov 2013 13:43:45 +0000 (14:43 +0100)]
sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.
Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.
However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).
Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.
This patch, therefore:
- adds system wide deadline bandwidth management by means of:
* /proc/sys/kernel/sched_dl_runtime_us,
* /proc/sys/kernel/sched_dl_period_us,
that determine (i.e., runtime / period) the total bandwidth
available on each CPU of each root_domain for -deadline tasks;
- couples the RT and deadline bandwidth management, i.e., enforces
that the sum of how much bandwidth is being devoted to -rt
-deadline tasks to stay below 100%.
This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:
M * (sched_dl_runtime_us / sched_dl_period_us)
It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dario Faggioli [Thu, 7 Nov 2013 13:43:44 +0000 (14:43 +0100)]
sched/deadline: Add SCHED_DEADLINE inheritance logic
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Thu, 7 Nov 2013 13:43:43 +0000 (14:43 +0100)]
rtmutex: Turn the plist into an rb-tree
Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.
This is done mainly because:
- classical prio field of the plist is just an int, which might
not be enough for representing a deadline;
- manipulating such a list would become O(nr_deadline_tasks),
which might be to much, as the number of -deadline task increases.
Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
- among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
one with the higher (lower, actually!) prio wins;
- among a -priority and a -deadline task, the latter always wins;
- among two -deadline tasks, the one with the earliest deadline
wins.
Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-again-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dario Faggioli [Thu, 7 Nov 2013 13:43:42 +0000 (14:43 +0100)]
sched/deadline: Add latency tracing for SCHED_DEADLINE tasks
It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.
For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.
As a consequence of applying this patch there will be three wakeup
latency tracer:
* "wakeup", that deals with all tasks in the system;
* "wakeup_rt", that deals with -rt and -deadline tasks only;
* "wakeup_dl", that deals with -deadline tasks only.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-9-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Harald Gustafsson [Thu, 7 Nov 2013 13:43:40 +0000 (14:43 +0100)]
sched/deadline: Add period support for SCHED_DEADLINE tasks
Make it possible to specify a period (different or equal than
deadline) for -deadline tasks. Relative deadlines (D_i) are used on
task arrivals to generate new scheduling (absolute) deadlines as "d =
t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d
= d + P_i" when the budget is zero.
This is in general useful to model (and schedule) tasks that have slow
activation rates (long periods), but have to be scheduled soon once
activated (short deadlines).
Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-7-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dario Faggioli [Thu, 7 Nov 2013 13:43:39 +0000 (14:43 +0100)]
sched/deadline: Add SCHED_DEADLINE avg_update accounting
Make the core scheduler and load balancer aware of the load
produced by -deadline tasks, by updating the moving average
like for sched_rt.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-6-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Juri Lelli [Thu, 7 Nov 2013 13:43:38 +0000 (14:43 +0100)]
sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.
Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.
The very same approach used in sched_rt is utilised:
- -deadline tasks are kept into CPU-specific runqueues,
- -deadline tasks are migrated among runqueues to achieve the
following:
* on an M-CPU system the M earliest deadline ready tasks
are always running;
* affinity/cpusets settings of all the -deadline tasks is
always respected.
Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.
To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.
In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dario Faggioli [Thu, 28 Nov 2013 10:14:43 +0000 (11:14 +0100)]
sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.
Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.
Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.
The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.
The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.
To summarize, this patch:
- introduces the data structures, constants and symbols needed;
- implements the core logic of the scheduling algorithm in the new
scheduling class file;
- provides all the glue code between the new scheduling class and
the core scheduler and refines the interactions between sched/dl
and the other existing scheduling classes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dario Faggioli [Thu, 7 Nov 2013 13:43:36 +0000 (14:43 +0100)]
sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).
In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.
For these reasons, this patch:
- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;
- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().
Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.
Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar [Mon, 13 Jan 2014 12:35:28 +0000 (13:35 +0100)]
Merge branch 'sched/urgent' into sched/core
Pick up the latest fixes before applying new changes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Mon, 6 Jan 2014 11:39:12 +0000 (11:39 +0000)]
sched: Calculate effective load even if local weight is 0
Thomas Hellstrom bisected a regression where erratic 3D performance is
experienced on virtual machines as measured by glxgears. It identified
commit
58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA
node") as the problem which had modified the behaviour of effective_load.
Effective load calculates the difference to the system-wide load if a
scheduling entity was moved to another CPU. The task group is not heavier
as a result of the move but overall system load can increase/decrease as a
result of the change. Commit
58d081b5 ("sched/numa: Avoid overloading CPUs
on a preferred NUMA node") changed effective_load to make it suitable for
calculating if a particular NUMA node was compute overloaded. To reduce
the cost of the function, it assumed that a current sched entity weight
of 0 was uninteresting but that is not the case.
wake_affine() uses a weight of 0 for sync wakeups on the grounds that it
is assuming the waking task will sleep and not contribute to load in the
near future. In this case, we still want to calculate the effective load
of the sched entity hierarchy. As effective_load is no longer used by
task_numa_compare since commit
fb13c7ee (sched/numa: Use a system-wide
search to find swap/migration candidates), this patch simply restores the
historical behaviour.
Reported-and-tested-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
[ Wrote changelog]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140106113912.GC6178@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Fri, 10 Jan 2014 23:37:11 +0000 (06:37 +0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"Famouse last words: "final pull request" :-)
I'm sending this because Jason Wang's fixes are pretty important
1) Add missing per-cpu stats initialization to ip6_vti. Otherwise
lockdep spits out a call trace. From Li RongQing.
2) Fix NULL oops in wireless hwsim, from Javier Lopez
3) TIPC deferred packet queue unlink must NULL out skb->next to avoid
crashes. From Erik Hugne
4) Fix access to uninitialized buffer in nf_nat netfilter code, from
Daniel Borkmann
5) Fix lifetime of ipv6 loopback and SIT tunnel addresses, otherwise
they basically timeout immediately. From Hannes Frederic Sowa
6) Fix DMA unmapping of TSO packets in bnx2x driver, from Michal
Schmidt
7) Do not allow L2 forwarding offload via macvtap device, the way
things are now it will not end up being forwaded at all. From
Jason Wang
8) Fix transmit queue selection via ndo_dfwd_start_xmit(), fixing
things like applying NETIF_F_LLTX to the wrong device (!!) and
eliding the proper transmit watchdog handling
9) qlcnic driver was not updating tx statistics at all, from Manish
Chopra"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
qlcnic: Fix ethtool statistics length calculation
qlcnic: Fix bug in TX statistics
net: core: explicitly select a txq before doing l2 forwarding
macvlan: forbid L2 fowarding offload for macvtap
bnx2x: fix DMA unmapping of TSO split BDs
ipv6: add link-local, sit and loopback address with INFINITY_LIFE_TIME
bnx2x: prevent WARN during driver unload
tipc: correctly unlink packets from deferred packet queue
ipv6: pcpu_tstats.syncp should be initialised in ip6_vti.c
netfilter: only warn once on wrong seqadj usage
netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
NFC: Fix target mode p2p link establishment
iwlwifi: add new devices for 7265 series
mac80211: move "bufferable MMPDU" check to fix AP mode scan
mac80211_hwsim: Fix NULL pointer dereference
Linus Torvalds [Fri, 10 Jan 2014 23:33:03 +0000 (06:33 +0700)]
Merge tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
"Here we have a bugfix for an off-by-one in the remote attribute
verifier that results in a forced shutdown which you can hit with v5
superblock by creating a 64k xattr, and a fix for a missing
destroy_work_on_stack() in the allocation worker.
It's a bit late, but they are both fairly straightforward"
* tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs:
xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
xfs: fix off-by-one error in xfs_attr3_rmt_verify
Linus Torvalds [Fri, 10 Jan 2014 23:26:27 +0000 (06:26 +0700)]
Merge branch 'leds-fixes-for-3.13' of git://git./linux/kernel/git/cooloney/linux-leds
Pull LED fix from Bryan Wu:
"Pali Rohár and Pavel Machek reported the LED of Nokia N900 doesn't
work with our latest 3.13-rc6 kernel. Milo fixed the regression here"
* 'leds-fixes-for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds:
leds: lp5521/5523: Remove duplicate mutex
Linus Torvalds [Fri, 10 Jan 2014 23:25:02 +0000 (06:25 +0700)]
Merge tag 'pm+acpi-3.13-rc8' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
- Recent commits modifying the lists of C-states in the intel_idle
driver introduced bugs leading to crashes on some systems. Two fixes
from Jiang Liu.
- The ACPI AC driver should receive all types of notifications, but
recent change made it ignore some of them. Fix from Alexander Mezin.
- intel_pstate's validity checks for MSRs it depends on are not
sufficient to catch the lack of support in nested KVM setups, so they
are extended to cover that case. From Dirk Brandewie.
- NEC LZ750/LS has a botched up _BIX method in its ACPI tables, so our
ACPI battery driver needs a quirk for it. From Lan Tianyu.
- The tpm_ppi driver sometimes leaks memory allocated by
acpi_get_name(). Fix from Jiang Liu.
* tag 'pm+acpi-3.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
intel_idle: close avn_cstates array with correct marker
Revert "intel_idle: mark states tables with __initdata tag"
ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS
intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.
ACPI / TPM: fix memory leak when walking ACPI namespace
ACPI / AC: change notification handler type to ACPI_ALL_NOTIFY
Linus Torvalds [Fri, 10 Jan 2014 23:23:57 +0000 (06:23 +0700)]
Merge tag 'mfd-fixes-3.13-2' of git://git./linux/kernel/git/sameo/mfd-fixes
Pull MFD fix from Samuel Ortiz:
"This is the 2nd MFD pull request for 3.13
It only contains one fix for the rtsx_pcr driver. Without it we see a
kernel panic on some machines, when resuming from suspend to RAM"
* tag 'mfd-fixes-3.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-fixes:
mfd: rtsx_pcr: Disable interrupts before cancelling delayed works
Milo Kim [Tue, 3 Dec 2013 01:21:44 +0000 (17:21 -0800)]
leds: lp5521/5523: Remove duplicate mutex
It can be a problem when a pattern is loaded via the firmware interface.
LP55xx common driver has already locked the mutex in 'lp55xx_firmware_loaded()'.
So it should be deleted.
On the other hand, locks are required in store_engine_load()
on updating program memory.
Reported-by: Pali Rohár <pali.rohar@gmail.com>
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Milo Kim <milo.kim@ti.com>
Signed-off-by: Bryan Wu <cooloney@gmail.com>
Cc: <stable@vger.kernel.org>
Chuansheng Liu [Tue, 7 Jan 2014 08:53:34 +0000 (16:53 +0800)]
xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit
6f96b3063cdd473c68664a190524ed966ac0cd92)
Jie Liu [Wed, 1 Jan 2014 11:28:03 +0000 (19:28 +0800)]
xfs: fix off-by-one error in xfs_attr3_rmt_verify
With CRC check is enabled, if trying to set an attributes value just
equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
attr write verification procedure failure, which would yield the back
trace like below:
<snip>
XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c
<snip>
Call Trace:
[<
ffffffff816f0042>] dump_stack+0x45/0x56
[<
ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs]
[<
ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<
ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs]
[<
ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
[<
ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<
ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<
ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<
ffffffff81184cda>] ? vm_map_ram+0x31a/0x460
[<
ffffffff81097230>] ? wake_up_state+0x20/0x20
[<
ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<
ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs]
[<
ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<
ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs]
[<
ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs]
[<
ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
[<
ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs]
[<
ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs]
[<
ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs]
[<
ffffffff811df9b2>] generic_setxattr+0x62/0x80
[<
ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0
[<
ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10
[<
ffffffff811e0415>] vfs_setxattr+0xb5/0xc0
[<
ffffffff811e054e>] setxattr+0x12e/0x1c0
[<
ffffffff811c6e82>] ? final_putname+0x22/0x50
[<
ffffffff811c708b>] ? putname+0x2b/0x40
[<
ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90
[<
ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0
[<
ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0
[<
ffffffff811e07df>] SyS_setxattr+0x8f/0xe0
[<
ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f
Tests:
setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile
This patch fix it to check the remote EA size is greater than the
XATTR_SIZE_MAX rather than more than or equal to it, because it's
valid if the specified EA value size is equal to the limitation as
per VFS setxattr interface.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit
85dd0707f0cad26d60f2dc574d17a5ab948d10f7)
Shahed Shaikh [Thu, 9 Jan 2014 17:41:05 +0000 (12:41 -0500)]
qlcnic: Fix ethtool statistics length calculation
o Consider number of Tx queues while calculating the length of
Tx statistics as part of ethtool stats.
o Calculate statistics lenght properly for 82xx and 83xx adapter
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Thu, 9 Jan 2014 17:41:04 +0000 (12:41 -0500)]
qlcnic: Fix bug in TX statistics
o Driver was not updating TX stats so it was not populating
statistics in `ifconfig` command output.
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 10 Jan 2014 08:18:26 +0000 (16:18 +0800)]
net: core: explicitly select a txq before doing l2 forwarding
Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
will cause several issues:
- NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
instead of lower device which misses the necessary txq synchronization for
lower device such as txq stopping or frozen required by dev watchdog or
control path.
- dev_hard_start_xmit() was called with NULL txq which bypasses the net device
watchdog.
- dev_hard_start_xmit() does not check txq everywhere which will lead a crash
when tso is disabled for lower device.
Fix this by explicitly introducing a new param for .ndo_select_queue() for just
selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
forwarding transmission.
With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
dev_queue_xmit() to do the transmission.
In the future, it was also required for macvtap l2 forwarding support since it
provides a necessary synchronization method.
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: e1000-devel@lists.sourceforge.net
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Fri, 10 Jan 2014 08:18:25 +0000 (16:18 +0800)]
macvlan: forbid L2 fowarding offload for macvtap
L2 fowarding offload will bypass the rx handler of real device. This will make
the packet could not be forwarded to macvtap device. Another problem is the
dev_hard_start_xmit() called for macvtap does not have any synchronization.
Fix this by forbidding L2 forwarding for macvtap.
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Jan 2014 18:21:22 +0000 (13:21 -0500)]
Merge branch 'for-davem' of git://git./linux/kernel/git/linville/wireless
John W. Linville says:
====================
For the mac80211 bits, Johannes says:
"I have a fix from Javier for mac80211_hwsim when used with wmediumd
userspace, and a fix from Felix for buffering in AP mode."
For the NFC bits, Samuel says:
"This pull request only contains one fix for a regression introduced with
commit
e29a9e2ae165620d. Without this fix, we can not establish a p2p link
in target mode. Only initiator mode works."
For the iwlwifi bits, Emmanuel says:
"It only includes new device IDs so it's not vital. If you have a pull
request to net.git anyway, I'd happy to have this in."
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Schmidt [Thu, 9 Jan 2014 13:36:27 +0000 (14:36 +0100)]
bnx2x: fix DMA unmapping of TSO split BDs
bnx2x triggers warnings with CONFIG_DMA_API_DEBUG=y:
WARNING: CPU: 0 PID: 2253 at lib/dma-debug.c:887 check_unmap+0xf8/0x920()
bnx2x 0000:28:00.0: DMA-API: device driver frees DMA memory with
different size [device address=0x00000000da2b389e] [map size=1490 bytes]
[unmap size=66 bytes]
The reason is that bnx2x splits a TSO BD into two BDs (headers + data)
using one DMA mapping for both, but it uses only the length of the first
BD when unmapping.
This patch fixes the bug by unmapping the whole length of the two BDs.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 10 Jan 2014 08:57:23 +0000 (15:57 +0700)]
Merge tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux
Pull clock fixes from Mike Turquette:
"Late fixes for clock drivers. All of these fixes are for user-visible
regressions, typically boot failures or other unsafe system
configuration that causes badness"
* tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux:
clk: clk-divider: fix divisor > 255 bug
clk: exynos: File scope reg_save array should depend on PM_SLEEP
clk: samsung: exynos5250: Add CLK_IGNORE_UNUSED flag for the sysreg clock
ARM: dts: exynos5250: Fix MDMA0 clock number
clk: samsung: exynos5250: Add MDMA0 clocks
clk: samsung: exynos5250: Fix ACP gate register offset
clk: exynos5250: fix sysmmu_mfc{l,r} gate clocks
clk: samsung: exynos4: Correct SRC_MFC register
Linus Torvalds [Fri, 10 Jan 2014 08:54:49 +0000 (15:54 +0700)]
Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"A few fixes for Renesas platforms to fixup DMA masks (this started
causing errors once the DMA API added checks for valid masks in 3.13)"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: shmobile: mackerel: Fix coherent DMA mask
ARM: shmobile: kzm9g: Fix coherent DMA mask
ARM: shmobile: armadillo: Fix coherent DMA mask
Hannes Frederic Sowa [Wed, 8 Jan 2014 14:43:22 +0000 (15:43 +0100)]
ipv6: add link-local, sit and loopback address with INFINITY_LIFE_TIME
In the past the IFA_PERMANENT flag indicated, that the valid and preferred
lifetime where ignored. Since change
fad8da3e085ddf ("ipv6 addrconf: fix
preferred lifetime state-changing behavior while valid_lft is infinity")
we honour at least the preferred lifetime on those addresses. As such
the valid lifetime gets recalculated and updated to 0.
If loopback address is added manually this problem does not occur.
Also if NetworkManager manages IPv6, those addresses will get added via
inet6_rtm_newaddr and thus will have a correct lifetime, too.
Reported-by: François-Xavier Le Bail <fx.lebail@yahoo.com>
Reported-by: Damien Wyart <damien.wyart@gmail.com>
Fixes: fad8da3e085ddf ("ipv6 addrconf: fix preferred lifetime state-changing behavior while valid_lft is infinity")
Cc: Yasushi Asano <yasushi.asano@jp.fujitsu.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Tue, 7 Jan 2014 10:07:41 +0000 (12:07 +0200)]
bnx2x: prevent WARN during driver unload
Starting with commit
80c33dd "net: add might_sleep() call to napi_disable"
bnx2x fails the might_sleep tests causing a stack trace to appear whenever
the driver is unloaded, as local_bh_disable() is being called before
napi_disable().
This changes the locking schematics related to CONFIG_NET_RX_BUSY_POLL,
preventing the need for calling local_bh_disable() and thus eliminating
the issue.
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafael J. Wysocki [Fri, 10 Jan 2014 02:08:58 +0000 (03:08 +0100)]
Merge branch 'pm-cpuidle'
* pm-cpuidle:
intel_idle: close avn_cstates array with correct marker
Revert "intel_idle: mark states tables with __initdata tag"
Jiang Liu [Thu, 9 Jan 2014 07:30:27 +0000 (15:30 +0800)]
intel_idle: close avn_cstates array with correct marker
Close avn_cstates array with correct marker to avoid overflow
in function intel_idle_cpu_init().
[rjw: The problem was introduced when commit
22e580d07f65 was merged
on top of
eba682a5aeb6 (intel_idle: shrink states tables).]
Fixes: 22e580d07f65 (intel_idle: Fixed C6 state on Avoton/Rangeley processors)
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
John W. Linville [Thu, 9 Jan 2014 15:19:01 +0000 (10:19 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/linville/wireless into for-davem
Jiang Liu [Thu, 9 Jan 2014 07:30:26 +0000 (15:30 +0800)]
Revert "intel_idle: mark states tables with __initdata tag"
This reverts commit
9d046ccb98085f1d437585f84748c783a04ba240.
Commit
9d046ccb98085 marks all state tables with __initdata, but
the state table may be accessed when doing CPU online, which then
causing system crash as below:
[ 204.188841] BUG: unable to handle kernel paging request at
ffffffff8227cce8
[ 204.196844] IP: [<
ffffffff814aa1c0>] intel_idle_cpu_init+0x40/0x130
[ 204.203996] PGD
1e11067 PUD
1e12063 PMD
455859063 PTE
800000000227c062
[ 204.211638] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 204.216975] Modules linked in: x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd gpio_ich microcode joydev sb_edac edac_core ipmi_si lpc_ich ipmi_msghandler lp tpm_tis parport wmi mac_hid acpi_pad hid_generic ixgbe isci usbhid dca hid libsas ptp ahci libahci scsi_transport_sas megaraid_sas pps_core mdio
[ 204.262815] CPU: 11 PID: 1489 Comm: bash Not tainted 3.13.0-rc7+ #48
[ 204.269993] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRIVTIN1.86B.0047.L09.
1312061514 12/06/2013
[ 204.281646] task:
ffff8804303a24a0 ti:
ffff880440fac000 task.ti:
ffff880440fac000
[ 204.290311] RIP: 0010:[<
ffffffff814aa1c0>] [<
ffffffff814aa1c0>] intel_idle_cpu_init+0x40/0x130
[ 204.300184] RSP: 0018:
ffff880440fadd28 EFLAGS:
00010286
[ 204.306192] RAX:
ffffffff8227cca0 RBX:
ffffe8fff1a03400 RCX:
0000000000000007
[ 204.314244] RDX:
ffff88045f400000 RSI:
0000000000000009 RDI:
0000000000001120
[ 204.322296] RBP:
ffff880440fadd38 R08:
0000000000000000 R09:
0000000000000001
[ 204.330411] R10:
0000000000000001 R11:
0000000000000000 R12:
000000000000001e
[ 204.338482] R13:
00000000ffffffdb R14:
0000000000000001 R15:
0000000000000000
[ 204.346743] FS:
00007f64f7b0c740(0000) GS:
ffff88045ce00000(0000) knlGS:
0000000000000000
[ 204.355919] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 204.362449] CR2:
ffffffff8227cce8 CR3:
0000000444ab0000 CR4:
00000000001407e0
[ 204.370520] Stack:
[ 204.372853]
000000000000001e ffffffff81f10240 ffff880440fadd50 ffffffff814aa307
[ 204.381519]
ffffffff81ea80e0 ffff880440fadda0 ffffffff8185a230 0000000000000000
[ 204.390196]
000000000000001e 0000000000000002 0000000000000002 0000000000000000
[ 204.398856] Call Trace:
[ 204.401683] [<
ffffffff814aa307>] cpu_hotplug_notify+0x57/0x70
[ 204.408638] [<
ffffffff8185a230>] notifier_call_chain+0x100/0x150
[ 204.415553] [<
ffffffff810a7dae>] __raw_notifier_call_chain+0xe/0x10
[ 204.422772] [<
ffffffff81072163>] cpu_notify+0x23/0x50
[ 204.428616] [<
ffffffff810723b2>] _cpu_up+0x132/0x1a0
[ 204.434361] [<
ffffffff8107249d>] cpu_up+0x7d/0xa0
[ 204.439819] [<
ffffffff81836c9c>] cpu_subsys_online+0x3c/0x90
[ 204.446345] [<
ffffffff81554625>] device_online+0x45/0xa0
[ 204.452471] [<
ffffffff815546ce>] online_store+0x4e/0x80
[ 204.458511] [<
ffffffff815519a8>] dev_attr_store+0x18/0x30
[ 204.464744] [<
ffffffff812a68f1>] sysfs_write_file+0x151/0x1c0
[ 204.471681] [<
ffffffff81217ef1>] vfs_write+0xe1/0x160
[ 204.477524] [<
ffffffff8121889c>] SyS_write+0x4c/0x90
[ 204.483270] [<
ffffffff8185f2ed>] system_call_fastpath+0x1a/0x1f
[ 204.490081] Code: 41 54 41 89 fc 8b 3d 48 25 85 01 53 48 8b 1d 30 25 85 01 48 03 1c c5 40 90 fb 81 48 8b 05 19 25 85 01 c7 43 0c 01 00 00 00 66 90 <48> 83 78 48 00 74 4f 41 83 c0 01 41 39 f0 7e 10 48 c7 c7 38 79
[ 204.515723] RIP [<
ffffffff814aa1c0>] intel_idle_cpu_init+0x40/0x130
[ 204.522996] RSP <
ffff880440fadd28>
[ 204.526976] CR2:
ffffffff8227cce8
[ 204.530766] ---[ end trace
336f56cc3d1cfc8c ]---
Fixes: 9d046ccb98085 (intel_idle: mark states tables with __initdata tag)
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Linus Torvalds [Thu, 9 Jan 2014 01:09:26 +0000 (09:09 +0800)]
Merge branch 'parisc-3.13' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fix from Helge Deller:
"This patch fixes the kmap/kunmap implementation on parisc and finally
makes AIO work on parisc"
* 'parisc-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Ensure full cache coherency for kmap/kunmap
Linus Torvalds [Thu, 9 Jan 2014 01:08:23 +0000 (09:08 +0800)]
Merge branch 'for-3.13-fixes' of git://git./linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
"Late fixes for libata. Nothing too interesting. Adding missing PM
callbacks to satat_sis and an additional PCI ID for ahci"
* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
sata_sis: missing PM support
ahci: add PCI ID for Marvell 88SE9170 SATA controller
John David Anglin [Mon, 6 Jan 2014 02:25:00 +0000 (21:25 -0500)]
parisc: Ensure full cache coherency for kmap/kunmap
Helge Deller noted a few weeks ago problems with the AIO support on
parisc. This change is the result of numerous iterations on how best to
deal with this problem.
The solution adopted here is to provide full cache coherency in a
uniform manner on all parisc systems. This involves calling
flush_dcache_page() on kmap operations and flush_kernel_dcache_page() on
kunmap operations. As a result, the copy_user_page() and
clear_user_page() functions can be removed and the overall code is
simpler.
The change ensures that both userspace and kernel aliases to a mapped
page are invalidated and flushed. This is necessary for the correct
operation of PA8800 and PA8900 based systems which do not support
inequivalent aliases.
With this change, I have observed no cache related issues on c8000 and
rp3440. It is now possible for example to do kernel builds with "-j64"
on four way systems.
On systems using XFS file systems, the patch recently posted by Mikulas
Patocka to "fix crash using XFS on loopback" is needed to avoid a hang
caused by an uninitialized lock passed to flush_dcache_page() in the
page struct.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Helge Deller <deller@gmx.de>
John W. Linville [Wed, 8 Jan 2014 18:36:17 +0000 (13:36 -0500)]
Merge tag 'nfc-fixes-3.13-1' of git://git./linux/kernel/git/sameo/nfc-fixes
Samuel Ortiz <sameo@linux.intel.com> says:
"This is the first NFC fixes pull request for 3.13.
It only contains one fix for a regression introduced with commit
e29a9e2ae165620d. Without this fix, we can not establish a p2p link in
target mode. Only initiator mode works."
Signed-off-by: John W. Linville <linville@tuxdriver.com>
James Hogan [Mon, 16 Dec 2013 10:41:38 +0000 (10:41 +0000)]
clk: clk-divider: fix divisor > 255 bug
Commit
6d9252bd9a4bb (clk: Add support for power of two type dividers)
merged in v3.6 added the _get_val function to convert a divisor value to
a register field value depending on the flags. However it used the type
u8 for the div field, causing divisors larger than 255 to be masked
and the resultant clock rate to be too high.
E.g. in my case an 11bit divider was supposed to divide 24.576 MHz down
to 32.768KHz. The divisor was correctly calculated as 750 (0x2ee). This
was masked to 238 (0xee) resulting in a frequency of 103.26KHz.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Rajendra Nayak <rnayak@ti.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Signed-off-by: Mike Turquette <mturquette@linaro.org>
Dave Airlie [Wed, 8 Jan 2014 07:57:45 +0000 (17:57 +1000)]
Merge branch 'drm-nouveau-next' of git://anongit.freedesktop.org/nouveau/linux-2.6 into drm-fixes
misc fixes for nouveau, one more msi rearm, regression fix for old bioses
crash and leak fixes.
* 'drm-nouveau-next' of git://anongit.freedesktop.org/nouveau/linux-2.6:
drm/nouveau/nouveau: fix memory leak in nouveau_crtc_page_flip()
drm/nouveau/bios: fix offset calculation for BMPv1 bioses
drm/nouveau: return offset of allocated notifier
drm/nouveau/bios: make jump conditional
drm/nvce/mc: fix msi rearm on GF114
drm/nvc0/gr: fix mthd data submission
drm/nouveau: populate master subdev pointer only when fully constructed
Dave Airlie [Wed, 8 Jan 2014 07:55:44 +0000 (17:55 +1000)]
Merge tag 'drm-intel-fixes-2014-01-08' of git://people.freedesktop.org/~danvet/drm-intel into drm-fixes
Just a revert (gen4 backlight seems a lost cause) and a tlb coherency fix
for bdw, plus the patch to sign up Jani for co-maintainer. Thanks to Ben
for taking care of -fixes while I've enjoyed a bit of vacation.
* tag 'drm-intel-fixes-2014-01-08' of git://people.freedesktop.org/~danvet/drm-intel:
MAINTAINERS: Updates for drm/i915
Revert "drm/i915: assume all GM45 Acer laptops use inverted backlight PWM"
drm/i915/bdw: Flush system agent on gen8 also
Christian Engelmayer [Sun, 29 Dec 2013 22:08:54 +0000 (23:08 +0100)]
drm/nouveau/nouveau: fix memory leak in nouveau_crtc_page_flip()
Fix a memory leak in the nouveau_crtc_page_flip() error handling path.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ilia Mirkin [Tue, 7 Jan 2014 17:33:59 +0000 (12:33 -0500)]
drm/nouveau/bios: fix offset calculation for BMPv1 bioses
The only BIOS on record that needs the 14 offset has a bios major
version 2 but BMP version 1.01. Another bunch of BIOSes that need the 18
offset have BMP version 2.01 or 5.01 or higher. So instead of looking at the
bios major version, look at the BMP version. BIOSes with BMP version 0
do not contain a detectable script, so always return 0 for them.
See https://bugs.freedesktop.org/show_bug.cgi?id=68835
Reported-by: Mauro Molinari <mauromol@tiscali.it>
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
David S. Miller [Tue, 7 Jan 2014 23:38:03 +0000 (18:38 -0500)]
Merge branch 'master' of git://git./linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
The following patchset contains two patches:
* fix the IRC NAT helper which was broken when adding (incomplete) IPv6
support, from Daniel Borkmann.
* Refine the previous bugtrap that Jesper added to catch problems for the
usage of the sequence adjustment extension in IPVs in Dec 16th, it may
spam messages in case of finding a real bug.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne [Tue, 7 Jan 2014 20:51:36 +0000 (15:51 -0500)]
tipc: correctly unlink packets from deferred packet queue
When we pull a received packet from a link's 'deferred packets' queue
for processing, its 'next' pointer is not cleared, and still refers to
the next packet in that queue, if any. This is incorrect, but caused
no harm before commit
40ba3cdf542a469aaa9083fa041656e59b109b90 ("tipc:
message reassembly using fragment chain") was introduced. After that
commit, it may sometimes lead to the following oops:
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: tipc
CPU: 4 PID: 0 Comm: swapper/4 Tainted: G W 3.13.0-rc2+ #6
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
task:
ffff880017af4880 ti:
ffff880017aee000 task.ti:
ffff880017aee000
RIP: 0010:[<
ffffffff81710694>] [<
ffffffff81710694>] skb_try_coalesce+0x44/0x3d0
RSP: 0018:
ffff880016603a78 EFLAGS:
00010212
RAX:
6b6b6b6bd6d6d6d6 RBX:
ffff880013106ac0 RCX:
ffff880016603ad0
RDX:
ffff880016603ad7 RSI:
ffff88001223ed00 RDI:
ffff880013106ac0
RBP:
ffff880016603ab8 R08:
0000000000000000 R09:
0000000000000000
R10:
0000000000000001 R11:
0000000000000000 R12:
ffff88001223ed00
R13:
ffff880016603ad0 R14:
000000000000058c R15:
ffff880012297650
FS:
0000000000000000(0000) GS:
ffff880016600000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
000000000805b000 CR3:
0000000011f5d000 CR4:
00000000000006e0
Stack:
ffff880016603a88 ffffffff810a38ed ffff880016603aa8 ffff88001223ed00
0000000000000001 ffff880012297648 ffff880016603b68 ffff880012297650
ffff880016603b08 ffffffffa0006c51 ffff880016603b08 00ffffffa00005fc
Call Trace:
<IRQ>
[<
ffffffff810a38ed>] ? trace_hardirqs_on+0xd/0x10
[<
ffffffffa0006c51>] tipc_link_recv_fragment+0xd1/0x1b0 [tipc]
[<
ffffffffa0007214>] tipc_recv_msg+0x4e4/0x920 [tipc]
[<
ffffffffa00016f0>] ? tipc_l2_rcv_msg+0x40/0x250 [tipc]
[<
ffffffffa000177c>] tipc_l2_rcv_msg+0xcc/0x250 [tipc]
[<
ffffffffa00016f0>] ? tipc_l2_rcv_msg+0x40/0x250 [tipc]
[<
ffffffff8171e65b>] __netif_receive_skb_core+0x80b/0xd00
[<
ffffffff8171df94>] ? __netif_receive_skb_core+0x144/0xd00
[<
ffffffff8171eb76>] __netif_receive_skb+0x26/0x70
[<
ffffffff8171ed6d>] netif_receive_skb+0x2d/0x200
[<
ffffffff8171fe70>] napi_gro_receive+0xb0/0x130
[<
ffffffff815647c2>] e1000_clean_rx_irq+0x2c2/0x530
[<
ffffffff81565986>] e1000_clean+0x266/0x9c0
[<
ffffffff81985f7b>] ? notifier_call_chain+0x2b/0x160
[<
ffffffff8171f971>] net_rx_action+0x141/0x310
[<
ffffffff81051c1b>] __do_softirq+0xeb/0x480
[<
ffffffff819817bb>] ? _raw_spin_unlock+0x2b/0x40
[<
ffffffff810b8c42>] ? handle_fasteoi_irq+0x72/0x100
[<
ffffffff81052346>] irq_exit+0x96/0xc0
[<
ffffffff8198cbc3>] do_IRQ+0x63/0xe0
[<
ffffffff81981def>] common_interrupt+0x6f/0x6f
<EOI>
This happens when the last fragment of a message has passed through the
the receiving link's 'deferred packets' queue, and at least one other
packet was added to that queue while it was there. After the fragment
chain with the complete message has been successfully delivered to the
receiving socket, it is released. Since 'next' pointer of the last
fragment in the released chain now is non-NULL, we get the crash shown
above.
We fix this by clearing the 'next' pointer of all received packets,
including those being pulled from the 'deferred' queue, before they
undergo any further processing.
Fixes: 40ba3cdf542a4 ("tipc: message reassembly using fragment chain")
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Li RongQing [Tue, 7 Jan 2014 07:39:43 +0000 (15:39 +0800)]
ipv6: pcpu_tstats.syncp should be initialised in ip6_vti.c
initialise pcpu_tstats.syncp to kill the calltrace
[ 11.973950] Call Trace:
[ 11.973950] [<
819bbaff>] dump_stack+0x48/0x60
[ 11.973950] [<
819bbaff>] dump_stack+0x48/0x60
[ 11.973950] [<
81078dcf>] __lock_acquire.isra.22+0x1bf/0xc10
[ 11.973950] [<
81078dcf>] __lock_acquire.isra.22+0x1bf/0xc10
[ 11.973950] [<
81079fa7>] lock_acquire+0x77/0xa0
[ 11.973950] [<
81079fa7>] lock_acquire+0x77/0xa0
[ 11.973950] [<
817ca7ab>] ? dev_get_stats+0xcb/0x130
[ 11.973950] [<
817ca7ab>] ? dev_get_stats+0xcb/0x130
[ 11.973950] [<
8183862d>] ip_tunnel_get_stats64+0x6d/0x230
[ 11.973950] [<
8183862d>] ip_tunnel_get_stats64+0x6d/0x230
[ 11.973950] [<
817ca7ab>] ? dev_get_stats+0xcb/0x130
[ 11.973950] [<
817ca7ab>] ? dev_get_stats+0xcb/0x130
[ 11.973950] [<
811cf8c1>] ? __nla_reserve+0x21/0xd0
[ 11.973950] [<
811cf8c1>] ? __nla_reserve+0x21/0xd0
[ 11.973950] [<
817ca7ab>] dev_get_stats+0xcb/0x130
[ 11.973950] [<
817ca7ab>] dev_get_stats+0xcb/0x130
[ 11.973950] [<
817d5409>] rtnl_fill_ifinfo+0x569/0xe20
[ 11.973950] [<
817d5409>] rtnl_fill_ifinfo+0x569/0xe20
[ 11.973950] [<
810352e0>] ? kvm_clock_read+0x20/0x30
[ 11.973950] [<
810352e0>] ? kvm_clock_read+0x20/0x30
[ 11.973950] [<
81008e38>] ? sched_clock+0x8/0x10
[ 11.973950] [<
81008e38>] ? sched_clock+0x8/0x10
[ 11.973950] [<
8106ba45>] ? sched_clock_local+0x25/0x170
[ 11.973950] [<
8106ba45>] ? sched_clock_local+0x25/0x170
[ 11.973950] [<
810da6bd>] ? __kmalloc+0x3d/0x90
[ 11.973950] [<
810da6bd>] ? __kmalloc+0x3d/0x90
[ 11.973950] [<
817b8c10>] ? __kmalloc_reserve.isra.41+0x20/0x70
[ 11.973950] [<
817b8c10>] ? __kmalloc_reserve.isra.41+0x20/0x70
[ 11.973950] [<
810da81a>] ? slob_alloc_node+0x2a/0x60
[ 11.973950] [<
810da81a>] ? slob_alloc_node+0x2a/0x60
[ 11.973950] [<
817b919a>] ? __alloc_skb+0x6a/0x2b0
[ 11.973950] [<
817b919a>] ? __alloc_skb+0x6a/0x2b0
[ 11.973950] [<
817d8795>] rtmsg_ifinfo+0x65/0xe0
[ 11.973950] [<
817d8795>] rtmsg_ifinfo+0x65/0xe0
[ 11.973950] [<
817cbd31>] register_netdevice+0x531/0x5a0
[ 11.973950] [<
817cbd31>] register_netdevice+0x531/0x5a0
[ 11.973950] [<
81892b87>] ? ip6_tnl_get_cap+0x27/0x90
[ 11.973950] [<
81892b87>] ? ip6_tnl_get_cap+0x27/0x90
[ 11.973950] [<
817cbdb6>] register_netdev+0x16/0x30
[ 11.973950] [<
817cbdb6>] register_netdev+0x16/0x30
[ 11.973950] [<
81f574a6>] vti6_init_net+0x1c4/0x1d4
[ 11.973950] [<
81f574a6>] vti6_init_net+0x1c4/0x1d4
[ 11.973950] [<
81f573af>] ? vti6_init_net+0xcd/0x1d4
[ 11.973950] [<
81f573af>] ? vti6_init_net+0xcd/0x1d4
[ 11.973950] [<
817c16df>] ops_init.constprop.11+0x17f/0x1c0
[ 11.973950] [<
817c16df>] ops_init.constprop.11+0x17f/0x1c0
[ 11.973950] [<
817c1779>] register_pernet_operations.isra.9+0x59/0x90
[ 11.973950] [<
817c1779>] register_pernet_operations.isra.9+0x59/0x90
[ 11.973950] [<
817c18d1>] register_pernet_device+0x21/0x60
[ 11.973950] [<
817c18d1>] register_pernet_device+0x21/0x60
[ 11.973950] [<
81f574b6>] ? vti6_init_net+0x1d4/0x1d4
[ 11.973950] [<
81f574b6>] ? vti6_init_net+0x1d4/0x1d4
[ 11.973950] [<
81f574c7>] vti6_tunnel_init+0x11/0x68
[ 11.973950] [<
81f574c7>] vti6_tunnel_init+0x11/0x68
[ 11.973950] [<
81f572a1>] ? mip6_init+0x73/0xb4
[ 11.973950] [<
81f572a1>] ? mip6_init+0x73/0xb4
[ 11.973950] [<
81f0cba4>] do_one_initcall+0xbb/0x15b
[ 11.973950] [<
81f0cba4>] do_one_initcall+0xbb/0x15b
[ 11.973950] [<
811a00d8>] ? sha_transform+0x528/0x1150
[ 11.973950] [<
811a00d8>] ? sha_transform+0x528/0x1150
[ 11.973950] [<
81f0c544>] ? repair_env_string+0x12/0x51
[ 11.973950] [<
81f0c544>] ? repair_env_string+0x12/0x51
[ 11.973950] [<
8105c30d>] ? parse_args+0x2ad/0x440
[ 11.973950] [<
8105c30d>] ? parse_args+0x2ad/0x440
[ 11.973950] [<
810546be>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[ 11.973950] [<
810546be>] ? __usermodehelper_set_disable_depth+0x3e/0x50
[ 11.973950] [<
81f0cd27>] kernel_init_freeable+0xe3/0x182
[ 11.973950] [<
81f0cd27>] kernel_init_freeable+0xe3/0x182
[ 11.973950] [<
81f0c532>] ? do_early_param+0x7a/0x7a
[ 11.973950] [<
81f0c532>] ? do_early_param+0x7a/0x7a
[ 11.973950] [<
819b5b1b>] kernel_init+0xb/0x100
[ 11.973950] [<
819b5b1b>] kernel_init+0xb/0x100
[ 11.973950] [<
819cebf7>] ret_from_kernel_thread+0x1b/0x28
[ 11.973950] [<
819cebf7>] ret_from_kernel_thread+0x1b/0x28
[ 11.973950] [<
819b5b10>] ? rest_init+0xc0/0xc0
[ 11.973950] [<
819b5b10>] ? rest_init+0xc0/0xc0
Before
469bdcefdc ("ipv6: fix the use of pcpu_tstats in ip6_vti.c"),
the pcpu_tstats.syncp is not used to pretect the 64bit elements of
pcpu_tstats, so not appear this calltrace.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Vetter [Tue, 7 Jan 2014 08:24:31 +0000 (09:24 +0100)]
MAINTAINERS: Updates for drm/i915
Jani for co-maintainer!
Jani has been a really active bug-scrubber in the past few months.
I've asked him whether he wants to do this in a more official capacity
and he agreed. I've already chatted with Dave and Jesse and they
support this.
Note that everyone can't now just relax because "Jani will do all the
bug scrubbing" - au contraire expect more nagging and poking now that
we have more bandwidth.
Longer-term the plan is to share more of the maintainer duties, but we
need to fix up the infrastructure a bit first (like moving the git
repo to a common location).
While at it also add the newly set-up patchwork instance.
Cc: Dave Airlie <airlied@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Bob Gleitsmann [Sun, 5 Jan 2014 22:59:07 +0000 (08:59 +1000)]
drm/nouveau: return offset of allocated notifier
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ilia Mirkin [Mon, 6 Jan 2014 01:07:02 +0000 (20:07 -0500)]
drm/nouveau/bios: make jump conditional
This fixes a hang in VBIOS scripts of the form "condition; jump".
The jump used to always be executed, while now it will only be
executed if the condition is true.
See https://bugs.freedesktop.org/show_bug.cgi?id=72943
Reported-by: Darcy Brás da Silva <dardevelin@cidadecool.com>
Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Cc: stable@vger.kernel.org
Sid Boyce [Sun, 5 Jan 2014 23:12:05 +0000 (09:12 +1000)]
drm/nvce/mc: fix msi rearm on GF114
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Kelly Doran [Fri, 20 Dec 2013 17:07:26 +0000 (11:07 -0600)]
drm/nvc0/gr: fix mthd data submission
If the initial data element is 0, it will never be written, even
though the value from the previous method may be there.
Signed-off-by: Kelly Doran <kel.p.doran@gmail.com>
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ben Skeggs [Tue, 26 Nov 2013 23:46:56 +0000 (09:46 +1000)]
drm/nouveau: populate master subdev pointer only when fully constructed
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Linus Torvalds [Tue, 7 Jan 2014 00:22:42 +0000 (08:22 +0800)]
Merge tag 'ext4_for_linus_stable' of git://git./linux/kernel/git/tytso/ext4
Pull ext4 bugfix from Ted Ts'o:
"Fix a regression introduced in v3.13-rc6"
* tag 'ext4_for_linus_stable' of http://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix bigalloc regression
Linus Torvalds [Tue, 7 Jan 2014 00:16:28 +0000 (08:16 +0800)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"I'm hoping this is the very last batch of networking fixes for 3.13,
here goes nothing:
1) Fix crashes in VLAN's header_ops passthru.
2) Bridge multicast code needs to use BH spinlocks to prevent
deadlocks with timers. From Curt Brune.
3) ipv6 tunnels lack proper synchornization when updating percpu
statistics. From Li RongQing.
4) Fixes to bnx2x driver from Yaniv Rosner, Dmitry Kravkov and Michal
Kalderon.
5) Avoid undefined operator evaluation order in llc code, from Daniel
Borkmann.
6) Error paths in various GSO offload paths do not unwind properly,
in particular they must undo any modifications they have made to
the SKB. From Wei-Chun Chao.
7) Fix RX refill races during restore in virtio-net, from Jason Wang.
8) Fix SKB use after free in LLC code, from Daniel Borkmann.
9) Missing unlock and OOPS in netpoll code when VLAN tag handling
fails.
10) Fix vxlan device attachment wrt ipv6, from Fan Du.
11) Don't allow creating infiniband links to non-infiniband devices,
from Hangbin Liu.
12) Revert FEC phy reset active low change, it breaks things. From
Fabio Estevam.
13) Fix header pointer handling in 6lowpan header building code, from
Daniel Borkmann.
14) Fix RSS handling in be2net driver, from Vasundhara Volam.
15) Fix modem port indexing in HSO driver, from Dan Williams"
* http://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
bridge: use spin_lock_bh() in br_multicast_set_hash_max
ipv6: don't install anycast address for /128 addresses on routers
hso: fix handling of modem port SERIAL_STATE notifications
isdn: Drop big endian cpp checks from telespci and hfc_pci drivers
be2net: fix max_evt_qs calculation for BE3 in SR-IOV config
be2net: increase the timeout value for loopback-test FW cmd
be2net: disable RSS when number of RXQs is reduced to 1 via set-channels
xen-netback: Include header for vmalloc
net: 6lowpan: fix lowpan_header_create non-compression memcpy call
fec: Revert "fec: Do not assume that PHY reset is active low"
bnx2x: fix VLAN configuration for VFs.
bnx2x: fix AFEX memory overflow
bnx2x: Clean before update RSS arrives
bnx2x: Correct number of MSI-X vectors for VFs
bnx2x: limit number of interrupt vectors for 57711
qlcnic: Fix bug in Tx completion path
infiniband: make sure the src net is infiniband when create new link
{vxlan, inet6} Mark vxlan_dev flags with VXLAN_F_IPV6 properly
cxgb4: allow large buffer size to have page size
netpoll: Fix missing TXQ unlock and and OOPS.
...
Rafael J. Wysocki [Mon, 6 Jan 2014 21:49:08 +0000 (22:49 +0100)]
Merge branches 'acpi-battery' and 'pm-cpufreq'
* acpi-battery:
ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS
* pm-cpufreq:
intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.
Curt Brune [Mon, 6 Jan 2014 19:00:32 +0000 (11:00 -0800)]
bridge: use spin_lock_bh() in br_multicast_set_hash_max
br_multicast_set_hash_max() is called from process context in
net/bridge/br_sysfs_br.c by the sysfs store_hash_max() function.
br_multicast_set_hash_max() calls spin_lock(&br->multicast_lock),
which can deadlock the CPU if a softirq that also tries to take the
same lock interrupts br_multicast_set_hash_max() while the lock is
held . This can happen quite easily when any of the bridge multicast
timers expire, which try to take the same lock.
The fix here is to use spin_lock_bh(), preventing other softirqs from
executing on this CPU.
Steps to reproduce:
1. Create a bridge with several interfaces (I used 4).
2. Set the "multicast query interval" to a low number, like 2.
3. Enable the bridge as a multicast querier.
4. Repeatedly set the bridge hash_max parameter via sysfs.
# brctl addbr br0
# brctl addif br0 eth1 eth2 eth3 eth4
# brctl setmcqi br0 2
# brctl setmcquerier br0 1
# while true ; do echo 4096 > /sys/class/net/br0/bridge/hash_max; done
Signed-off-by: Curt Brune <curt@cumulusnetworks.com>
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hannes Frederic Sowa [Mon, 6 Jan 2014 16:53:14 +0000 (17:53 +0100)]
ipv6: don't install anycast address for /128 addresses on routers
It does not make sense to create an anycast address for an /128-prefix.
Suppress it.
As
32019e651c6fce ("ipv6: Do not leave router anycast address for /127
prefixes.") shows we also may not leave them, because we could accidentally
remove an anycast address the user has allocated or got added via another
prefix.
Cc: François-Xavier Le Bail <fx.lebail@yahoo.com>
Cc: Thomas Haller <thaller@redhat.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Williams [Mon, 6 Jan 2014 16:07:29 +0000 (10:07 -0600)]
hso: fix handling of modem port SERIAL_STATE notifications
The existing serial state notification handling expected older Option
devices, having a hardcoded assumption that the Modem port was always
USB interface #2. That isn't true for devices from the past few years.
hso_serial_state_notification is a local cache of a USB Communications
Interface Class SERIAL_STATE notification from the device, and the
USB CDC specification (section 6.3, table 67 "Class-Specific Notifications")
defines wIndex as the USB interface the event applies to. For hso
devices this will always be the Modem port, as the Modem port is the
only port which is set up to receive them by the driver.
So instead of always expecting USB interface #2, instead validate the
notification with the actual USB interface number of the Modem port.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Tested-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lan Tianyu [Mon, 6 Jan 2014 14:50:37 +0000 (22:50 +0800)]
ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS
The AML method _BIX of NEC LZ750/LS returns a broken package which
skips the first member "Revision" (ACPI 5.0, Table 10-234).
Add a quirk for this machine to skip member "Revision" during parsing
the package returned by _BIX.
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=67351
Reported-and-tested-by: Francisco Castro <fcr@adinet.com.uy>
Cc: 3.8+ <stable@vger.kernel.org> " 3.8+
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Dirk Brandewie [Mon, 6 Jan 2014 18:59:16 +0000 (10:59 -0800)]
intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.
KVM environments do not support APERF/MPERF MSRs. intel_pstate cannot
operate without these registers.
The previous validity checks in intel_pstate_msrs_not_valid() are
insufficent in nested KVMs.
References: https://bugzilla.redhat.com/show_bug.cgi?id=
1046317
Signed-off-by: Dirk Brandewie <dirk.j.brandewie@intel.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Guenter Roeck [Mon, 6 Jan 2014 04:31:39 +0000 (20:31 -0800)]
isdn: Drop big endian cpp checks from telespci and hfc_pci drivers
With arm:allmodconfig, building the Teles PCI driver fails with
telespci.c:294:2: error: #error "not running on big endian machines now"
Similar, building the driver for HFC PCI-Bus cards fails with
hfc_pci.c:1647:2: error: #error "not running on big endian machines now"
Remove the big endian cpp check from both drivers to fix the build errors.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville [Mon, 6 Jan 2014 19:20:07 +0000 (14:20 -0500)]
Merge branch 'for-john' of git://git./linux/kernel/git/iwlwifi/iwlwifi-fixes
John W. Linville [Mon, 6 Jan 2014 19:19:18 +0000 (14:19 -0500)]
Merge branch 'for-john' of git://git./linux/kernel/git/jberg/mac80211
Eric Whitney [Mon, 6 Jan 2014 19:00:23 +0000 (14:00 -0500)]
ext4: fix bigalloc regression
Commit
f5a44db5d2 introduced a regression on filesystems created with
the bigalloc feature (cluster size > blocksize). It causes xfstests
generic/006 and /013 to fail with an unexpected JBD2 failure and
transaction abort that leaves the test file system in a read only state.
Other xfstests run on bigalloc file systems are likely to fail as well.
The cause is the accidental use of a cluster mask where a cluster
offset was needed in ext4_ext_map_blocks().
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
David S. Miller [Mon, 6 Jan 2014 18:09:26 +0000 (13:09 -0500)]
Merge branch 'be2net'
Sathya Perla says:
====================
be2net: patch set
Pls apply the following bug fixes to the 'net' tree. Thanks.
Suresh Reddy (2):
be2net: increase the timeout value for loopback-test FW cmd
be2net: fix max_evt_qs calculation for BE3 in SR-IOV config
Vasundhara Volam (1):
be2net: disable RSS when number of RXQs is reduced to 1 via
set-channels
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Mon, 6 Jan 2014 07:32:25 +0000 (13:02 +0530)]
be2net: fix max_evt_qs calculation for BE3 in SR-IOV config
The driver wrongly assumes 16 EQs/vectors are available for each BE3 PF.
When SR-IOV is enabled, a BE3 PF can support only a max of 8 EQs.
Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Mon, 6 Jan 2014 07:32:24 +0000 (13:02 +0530)]
be2net: increase the timeout value for loopback-test FW cmd
The loopback test FW cmd may need upto 15 seconds to complete on
certain PHYs. This patch also fixes the name of the completion variable
used to synchronize FW cmd completions as it not used by the flashing
cmd alone anymore.
Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Mon, 6 Jan 2014 07:32:23 +0000 (13:02 +0530)]
be2net: disable RSS when number of RXQs is reduced to 1 via set-channels
When *only* the default RXQ is used, the RSS policy must be disabled so
that all IP and no-IP traffic is placed into the default RXQ. If not,
IP traffic is dropped.
Also, issue the RSS_CONFIG cmd only if FW advertises RSS capability for
the interface.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer [Sat, 4 Jan 2014 13:10:43 +0000 (14:10 +0100)]
netfilter: only warn once on wrong seqadj usage
Avoid potentially spamming the kernel log with WARN splash messages
when catching wrong usage of seqadj, by simply using WARN_ONCE.
This is a followup to commit
db12cf274353 (netfilter: WARN about
wrong usage of sequence number adjustments)
Suggested-by: Flavio Leitner <fbl@redhat.com>
Suggested-by: Daniel Borkmann <dborkman@redhat.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Daniel Borkmann [Tue, 31 Dec 2013 15:28:39 +0000 (16:28 +0100)]
netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
Commit
5901b6be885e attempted to introduce IPv6 support into
IRC NAT helper. By doing so, the following code seemed to be removed
by accident:
ip = ntohl(exp->master->tuplehash[IP_CT_DIR_REPLY].tuple.dst.u3.ip);
sprintf(buffer, "%u %u", ip, port);
pr_debug("nf_nat_irc: inserting '%s' == %pI4, port %u\n", buffer, &ip, port);
This leads to the fact that buffer[] was left uninitialized and
contained some stack value. When we call nf_nat_mangle_tcp_packet(),
we call strlen(buffer) on excatly this uninitialized buffer. If we
are unlucky and the skb has enough tailroom, we overwrite resp. leak
contents with values that sit on our stack into the packet and send
that out to the receiver.
Since the rather informal DCC spec [1] does not seem to specify
IPv6 support right now, we log such occurences so that admins can
act accordingly, and drop the packet. I've looked into XChat source,
and IPv6 is not supported there: addresses are in u32 and print
via %u format string.
Therefore, restore old behaviour as in IPv4, use snprintf(). The
IRC helper does not support IPv6 by now. By this, we can safely use
strlen(buffer) in nf_nat_mangle_tcp_packet() and prevent a buffer
overflow. Also simplify some code as we now have ct variable anyway.
[1] http://www.irchelp.org/irchelp/rfc/ctcpspec.html
Fixes: 5901b6be885e ("netfilter: nf_nat: support IPv6 in IRC NAT helper")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Harald Welte <laforge@gnumonks.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>