Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: /debug/sched_features
provide a text based interface to the scheduler features; this saves the
'user' from setting bits using decimal arithmetic.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Sat, 19 Apr 2008 07:25:58 +0000 (09:25 +0200)]
sched: add SCHED_FEAT_DEADLINE
unused at the moment.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: debug: show a weight tree
Print a tree of weights.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: fair: weight calculations
In order to level the hierarchy, we need to calculate load based on the
root view. That is, each task's load is in the same unit.
A
/ \
B 1
/ \
2 3
To compute 1's load we do:
weight(1)
--------------
rq_weight(A)
To compute 2's load we do:
weight(2) weight(B)
------------ * -----------
rq_weight(B) rw_weight(A)
This yields load fractions in comparable units.
The consequence is that it changes virtual time. We used to have:
time_{i}
vtime_{i} = ------------
weight_{i}
vtime = \Sum vtime_{i} = time / rq_weight.
But with the new way of load calculation we get that vtime equals time.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: fair-group: de-couple load-balancing from the rb-trees
De-couple load-balancing from the rb-trees, so that I can change their
organization.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: fair-group scheduling vs latency
Currently FAIR_GROUP sched grows the scheduler latency outside of
sysctl_sched_latency, invert this so it stays within.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: rt-group: optimize dequeue_rt_stack
Now that the group hierarchy can have an arbitrary depth the O(n^2) nature
of RT task dequeues will really hurt. Optimize this by providing space to
store the tree path, so we can walk it the other way.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: debug: add some debug code to handle the full hierarchy
Add some extra debug output so we can get a better overview of the
full hierarchy.
We print the cgroup path after each cfs_rq, so we can see what group
we're looking at.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: fair-group: SMP-nice for group scheduling
Implement SMP nice support for the full group hierarchy.
On each load-balance action, compile a sched_domain wide view of the full
task_group tree. We compute the domain wide view when walking down the
hierarchy, and readjust the weights when walking back up.
After collecting and readjusting the domain wide view, we try to balance the
tasks within the task_groups. The current approach is a naively balance each
task group until we've moved the targeted amount of load.
Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
paper.
XXX: there will be some numerical issues due to the limited nature of
SCHED_LOAD_SCALE wrt to representing a task_groups influence on the
total weight. When the tree is deep enough, or the task weight small
enough, we'll run out of bits.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Abhishek Chandra <chandra@cs.umn.edu>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hidetoshi Seto [Tue, 15 Apr 2008 05:04:23 +0000 (14:04 +0900)]
sched, cpuset: customize sched domains, core
[rebased for sched-devel/latest]
- Add a new cpuset file, having levels:
sched_relax_domain_level
- Modify partition_sched_domains() and build_sched_domains()
to take attributes parameter passed from cpuset.
- Fill newidle_idx for node domains which currently unused but
might be required if sched_relax_domain_level become higher.
- We can change the default level by boot option 'relax_domain_level='.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hidetoshi Seto [Tue, 15 Apr 2008 05:03:17 +0000 (14:03 +0900)]
sched, cpuset: customize sched domains, docs
This patch introduces new feature of cpuset - sched domain customization.
This version provides a per-cpuset file 'sched_relax_domain_level' that
enable us to change the searching range of scheduler, which used to limit
how many cpus the scheduler searches at some schedule events, such as
wakening task and running out of runqueue.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: prepatory code movement
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: rt: multi level group constraints
multi level rt constraints
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: task_group hierarchy
Add the full parent<->child relation thing into task_groups as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:45:00 +0000 (19:45 +0200)]
sched: fix the task_group hierarchy for UID grouping
UID grouping doesn't actually have a task_group representing the root of
the task_group tree. Add one.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dhaval Giani [Sat, 19 Apr 2008 17:44:59 +0000 (19:44 +0200)]
sched: allow the group scheduler to have multiple levels
This patch makes the group scheduler multi hierarchy aware.
[a.p.zijlstra@chello.nl: rt-parts and assorted fixes]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dhaval Giani [Sat, 19 Apr 2008 17:44:59 +0000 (19:44 +0200)]
sched: mix tasks and groups
This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.
[a.p.zijlstra@chello.nl: rt bits and assorted fixes]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Tue, 25 Mar 2008 12:51:45 +0000 (13:51 +0100)]
sched: fix checks
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Wed, 19 Mar 2008 10:43:36 +0000 (11:43 +0100)]
sched: old sleeper bonus
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Wed, 26 Mar 2008 21:23:49 +0000 (14:23 -0700)]
sched: add new set_cpus_allowed_ptr function
Add a new function that accepts a pointer to the "newly allowed cpus"
cpumask argument.
int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)
The current set_cpus_allowed() function is modified to use the above
but this does not result in an ABI change. And with some compiler
optimization help, it may not introduce any additional overhead.
Additionally, to enforce the read only nature of the new_mask arg, the
"const" property is migrated to sub-functions called by set_cpus_allowed.
This silences compiler warnings.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Wed, 26 Mar 2008 21:23:48 +0000 (14:23 -0700)]
init: move setup of nr_cpu_ids to as early as possible
Move the setting of nr_cpu_ids from sched_init() to start_kernel()
so that it's available as early as possible.
Note that an arch has the option of setting it even earlier if need be,
but it should not result in a different value than the setup_nr_cpu_ids()
function.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Tue, 15 Apr 2008 23:35:52 +0000 (16:35 -0700)]
sched: remove another cpumask_t variable from stack
* Remove another cpumask_t variable from stack that was missed in the
last kernel_sched_c updates.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Tue, 8 Apr 2008 18:43:04 +0000 (11:43 -0700)]
cpumask: add show cpu map functions
* Add cpu_sysdev_class functions to display the following maps
with cpulist_scnprintf().
cpu_online_map
cpu_present_map
cpu_possible_map
* Small change to include/linux/sysdev.h to allow the attribute
name and label to be different (to avoid collision with the
"attr_online" entry for bringing cpus on- and off-line.)
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Tue, 8 Apr 2008 18:43:03 +0000 (11:43 -0700)]
cpumask: use new cpus_scnprintf function
* Cleaned up references to cpumask_scnprintf() and added new
cpulist_scnprintf() interfaces where appropriate.
* Fix some small bugs (or code efficiency improvments) for various uses
of cpumask_scnprintf.
* Clean up some checkpatch errors.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Tue, 8 Apr 2008 18:43:02 +0000 (11:43 -0700)]
x86: modify show_shared_cpu_map in intel_cacheinfo
* Removed kmalloc (or local array) in show_shared_cpu_map().
* Added show_shared_cpu_list() function.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:01 +0000 (18:11 -0700)]
x86: convert cpumask_of_cpu macro to allocated array
* Here is a simple patch to use an allocated array of cpumasks to
represent cpumask_of_cpu() instead of constructing one on the stack.
It's based on the Kconfig option "HAVE_CPUMASK_OF_CPU_MAP" which is
currently only set for x86_64 SMP. Otherwise the the existing
cpumask_of_cpu() is used but has been changed to produce an lvalue
so a pointer to it can be used.
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:02 +0000 (18:11 -0700)]
cpumask: add CPU_MASK_ALL_PTR macro
* Add a static cpumask_t variable "CPU_MASK_ALL_PTR" to use as
a pointer reference to CPU_MASK_ALL. This reduces where possible
the instances where CPU_MASK_ALL allocates and fills a large
array on the stack. Used only if NR_CPUS > BITS_PER_LONG.
* Change init/main.c to use new set_cpus_allowed_ptr().
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:11 +0000 (18:11 -0700)]
cpumask: reduce stack usage in SD_x_INIT initializers
* Remove empty cpumask_t (and all non-zero/non-null) variables
in SD_*_INIT macros. Use memset(0) to clear. Also, don't
inline the initializer functions to save on stack space in
build_sched_domains().
* Merge change to include/linux/topology.h that uses the new
node_to_cpumask_ptr function in the nr_cpus_node macro into
this patch.
Depends on:
[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:10 +0000 (18:11 -0700)]
nodemask: use new node_to_cpumask_ptr function
* Use new node_to_cpumask_ptr. This creates a pointer to the
cpumask for a given node. This definition is in mm patch:
asm-generic-add-node_to_cpumask_ptr-macro.patch
* Use new set_cpus_allowed_ptr function.
Depends on:
[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
[sched-devel]: sched: add new set_cpus_allowed_ptr function
[x86/latest]: x86: add cpus_scnprintf function
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Greg Banks <gnb@melbourne.sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:08 +0000 (18:11 -0700)]
generic: reduce stack pressure in sched_affinity
* Modify sched_affinity functions to pass cpumask_t variables by reference
instead of by value.
* Use new set_cpus_allowed_ptr function.
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Cc: Paul Jackson <pj@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:07 +0000 (18:11 -0700)]
cpuset: modify cpuset_set_cpus_allowed to use cpumask pointer
* Modify cpuset_cpus_allowed to return the currently allowed cpuset
via a pointer argument instead of as the function return value.
* Use new set_cpus_allowed_ptr function.
* Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:06 +0000 (18:11 -0700)]
generic: use new set_cpus_allowed_ptr function
* Use new set_cpus_allowed_ptr() function added by previous patch,
which instead of passing the "newly allowed cpus" cpumask_t arg
by value, pass it by pointer:
-int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
+int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)
* Modify CPU_MASK_ALL
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:05 +0000 (18:11 -0700)]
x86: use new set_cpus_allowed_ptr function
* Use new set_cpus_allowed_ptr() function added by previous patch,
which instead of passing the "newly allowed cpus" cpumask_t arg
by value, pass it by pointer:
-int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
+int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)
* Cleanup uses of CPU_MASK_ALL.
* Collapse other NR_CPUS changes to arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
Use pointers to cpumask_t arguments whenever possible.
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Cc: Len Brown <len.brown@intel.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:04 +0000 (18:11 -0700)]
sched: remove fixed NR_CPUS sized arrays in kernel_sched_c
* Change fixed size arrays to per_cpu variables or dynamically allocated
arrays in sched_init() and sched_init_smp().
(1) static struct sched_entity *init_sched_entity_p[NR_CPUS];
(1) static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
(1) static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS];
(1) static struct rt_rq *init_rt_rq_p[NR_CPUS];
static struct sched_group **sched_group_nodes_bycpu[NR_CPUS];
(1) - these arrays are allocated via alloc_bootmem_low()
* Change sched_domain_debug_one() to use cpulist_scnprintf instead of
cpumask_scnprintf. This reduces the output buffer required and improves
readability when large NR_CPU count machines arrive.
* In sched_create_group() we allocate new arrays based on nr_cpu_ids.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:12 +0000 (18:11 -0700)]
cpumask: Cleanup more uses of CPU_MASK and NODE_MASK
* Replace usages of CPU_MASK_NONE, CPU_MASK_ALL, NODE_MASK_NONE,
NODE_MASK_ALL to reduce stack requirements for large NR_CPUS
and MAXNODES counts.
* In some cases, the cpumask variable was initialized but then overwritten
with another value. This is the case for changes like this:
- cpumask_t oldmask = CPU_MASK_ALL;
+ cpumask_t oldmask;
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Sat, 5 Apr 2008 01:11:09 +0000 (18:11 -0700)]
numa: move large array from stack to _initdata section
* Move large array "struct bootnode nodes" from stack to _initdata
section to reduce amount of stack space required.
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Mon, 31 Mar 2008 15:41:55 +0000 (08:41 -0700)]
asm-generic: add node_to_cpumask_ptr macro
Create a simple macro to always return a pointer to the node_to_cpumask(node)
value. This relies on compiler optimization to remove the extra indirection:
#define node_to_cpumask_ptr(v, node) \
cpumask_t _##v = node_to_cpumask(node), *v = &_##v
For those systems with a large cpumask size, then a true pointer
to the array element can be used:
#define node_to_cpumask_ptr(v, node) \
cpumask_t *v = &(node_to_cpumask_map[node])
A node_to_cpumask_ptr_next() macro is provided to access another
node_to_cpumask value.
The other change is to always include asm-generic/topology.h moving the
ifdef CONFIG_NUMA to this same file.
Note: there are no references to either of these new macros in this patch,
only the definition.
Based on 2.6.25-rc5-mm1
# alpha
Cc: Richard Henderson <rth@twiddle.net>
# fujitsu
Cc: David Howells <dhowells@redhat.com>
# ia64
Cc: Tony Luck <tony.luck@intel.com>
# powerpc
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anton Blanchard <anton@samba.org>
# sparc
Cc: David S. Miller <davem@davemloft.net>
Cc: William L. Irwin <wli@holomorphy.com>
# x86
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Tue, 25 Mar 2008 22:06:59 +0000 (15:06 -0700)]
x86: oprofile: remove NR_CPUS arrays in arch/x86/oprofile/nmi_int.c
Change the following arrays sized by NR_CPUS to be PERCPU variables:
static struct op_msrs cpu_msrs[NR_CPUS];
static unsigned long saved_lvtpc[NR_CPUS];
Also some minor complaints from checkpatch.pl fixed.
Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
All changes were transparent except for:
static void nmi_shutdown(void)
{
+ struct op_msrs *msrs = &__get_cpu_var(cpu_msrs);
nmi_enabled = 0;
on_each_cpu(nmi_cpu_shutdown, NULL, 0, 1);
unregister_die_notifier(&profile_exceptions_nb);
- model->shutdown(cpu_msrs);
+ model->shutdown(msrs);
free_msrs();
}
The existing code passed a reference to cpu 0's instance of struct op_msrs
to model->shutdown, whilst the other functions are passed a reference to
<this cpu's> instance of a struct op_msrs. This seemed to be a bug to me
even though as long as cpu 0 and <this cpu> are of the same type it would
have the same effect...?
Cc: Philippe Elie <phil.el@wanadoo.fr>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Tue, 25 Mar 2008 22:06:56 +0000 (15:06 -0700)]
x86: reduce memory and stack usage in intel_cacheinfo
* Change the following static arrays sized by NR_CPUS to
per_cpu data variables:
_cpuid4_info *cpuid4_info[NR_CPUS];
_index_kobject *index_kobject[NR_CPUS];
kobject * cache_kobject[NR_CPUS];
* Remove the local NR_CPUS array with a kmalloc'd region in
show_shared_cpu_map().
Also some minor complaints from checkpatch.pl fixed.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mike Travis [Tue, 25 Mar 2008 22:06:55 +0000 (15:06 -0700)]
cpumask: add cpumask_scnprintf_len function
Add a new function cpumask_scnprintf_len() to return the number of
characters needed to display "len" cpumask bits. The current method
of allocating NR_CPUS bytes is incorrect as what's really needed is
9 characters per 32-bit word of cpumask bits (8 hex digits plus the
seperator [','] or the terminating NULL.) This function provides the
caller the means to allocate the correct string length.
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Gregory Haskins [Tue, 12 Feb 2008 18:30:05 +0000 (13:30 -0500)]
sched: fix cpus_allowed settings
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dhaval Giani [Fri, 29 Feb 2008 04:32:44 +0000 (10:02 +0530)]
sched: allow cpuacct stats to be reset
Currently the schedstats implementation does not allow the statistics
to be reset. This patch aims to allow that.
echo 0 > cpuacct.usage
resets the usage. Any other value is not allowed and returns -EINVAL.
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dhaval Giani [Fri, 29 Feb 2008 04:32:43 +0000 (10:02 +0530)]
sched: cleanup cpuacct variable names
Change the variable names to the common convention for the cpuacct
subsystem.
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Olof Johansson [Tue, 4 Mar 2008 23:23:25 +0000 (15:23 -0800)]
tasklets: execute tasklets in the same order they were queued
I noticed this when looking at an openswan issue. Openswan (ab?)uses the
tasklet API to defer processing of packets in some situations, with one
packet per tasklet_action(). I started noticing sequences of
backwards-ordered sequence numbers coming over the wire, since new tasklets
are always queued at the head of the list but processed sequentially.
Convert it to instead append new entries to the tail of the list. As an
extra bonus, the splicing code in takeover_tasklets() no longer has to
iterate over the list.
Signed-off-by: Olof Johansson <olof@lixom.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:44:58 +0000 (19:44 +0200)]
sched: rt-group: smp balancing
Currently the rt group scheduling does a per cpu runtime limit, however
the rt load balancer makes no guarantees about an equal spread of real-
time tasks, just that at any one time, the highest priority tasks run.
Solve this by making the runtime limit a global property by borrowing
excessive runtime from the other cpus once the local limit runs out.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:44:57 +0000 (19:44 +0200)]
sched: rt-group: synchonised bandwidth period
Various SMP balancing algorithms require that the bandwidth period
run in sync.
Possible improvements are moving the rt_bandwidth thing into root_domain
and keeping a span per rt_bandwidth which marks throttled cpus.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Wed, 27 Feb 2008 13:05:10 +0000 (14:05 +0100)]
time: add ns_to_ktime()
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Mon, 18 Feb 2008 12:39:37 +0000 (13:39 +0100)]
sched: fix regression with sched yield
Balbir Singh reported:
> 1:mon> t
> [
c0000000e7677da0]
c000000000067de0 .sys_sched_yield+0x6c/0xbc
> [
c0000000e7677e30]
c000000000008748 syscall_exit+0x0/0x40
> --- Exception: c01 (System Call) at
00000400001d09e4
> SP (
4000664cb10) is in userspace
> 1:mon> r
> cpu 0x1: Vector: 300 (Data Access) at [
c0000000e7677aa0]
> pc:
c000000000068e50: .yield_task_fair+0x94/0xc4
> lr:
c000000000067de0: .sys_sched_yield+0x6c/0xbc
the check that should have avoided that is:
/*
* Are we the only task in the tree?
*/
if (unlikely(rq->load.weight == curr->se.load.weight))
return;
But I guess that overlooks rt tasks, they also increase the load.
So I guess something like this ought to fix it..
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dmitry Adamushko [Sun, 17 Feb 2008 21:34:07 +0000 (22:34 +0100)]
latencytop: optimize LT_BACKTRACEDEPTH loops a bit
There is no need to loop any longer when 'same == 0'.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Fri, 14 Mar 2008 15:09:59 +0000 (16:09 +0100)]
sched: remove sysctl_sched_batch_wakeup_granularity
it's unused.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Wed, 19 Mar 2008 00:37:10 +0000 (01:37 +0100)]
sched: reenable sync wakeups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Mon, 17 Mar 2008 08:36:53 +0000 (09:36 +0100)]
sched: cache hot buddy
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Wed, 19 Mar 2008 00:39:19 +0000 (01:39 +0100)]
sched: feat affine wakeups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Sun, 16 Mar 2008 19:03:22 +0000 (20:03 +0100)]
sched: introduce SCHED_FEAT_SYNC_WAKEUPS, turn it off
turn off sync wakeups by default. They are not needed anymore - the
buddy logic should be smart enough to keep the system from
overscheduling.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Sat, 19 Apr 2008 17:44:57 +0000 (19:44 +0200)]
sched: fix wakeup granularity for buddies
The wakeup buddy logic didn't use the same wakeup granularity logic as the
wakeup preemption did, this might cause the ->next buddy to be selected past
the point where we would have preempted had the task been a single running
instance.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Guillaume Chazarain [Sat, 19 Apr 2008 17:44:57 +0000 (19:44 +0200)]
sched: fix rq->clock overflows detection with CONFIG_NO_HZ
When using CONFIG_NO_HZ, rq->tick_timestamp is not updated every TICK_NSEC.
We check that the number of skipped ticks matches the clock jump seen in
__update_rq_clock().
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reynes Philippe [Mon, 17 Mar 2008 23:19:05 +0000 (16:19 -0700)]
sched: sched.c needs tick.h
kernel/sched.c:506: erreur: implicit declaration of function tick_get_tick_sched
kernel/sched.c:506: erreur: invalid type argument of ->
kernel/sched.c:506: erreur: NOHZ_MODE_INACTIVE undeclared (first use in this function)
kernel/sched.c:506: erreur: (Each undeclared identifier is reported only once
kernel/sched.c:506: erreur: for each function it appears in.)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Thu, 28 Feb 2008 20:00:21 +0000 (21:00 +0100)]
sched: make cpu_clock() globally synchronous
Alexey Zaytsev reported (and bisected) that the introduction of
cpu_clock() in printk made the timestamps jump back and forth.
Make cpu_clock() more reliable while still keeping it fast when it's
called frequently.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Mon, 14 Apr 2008 06:53:32 +0000 (08:53 +0200)]
sched: re-do "sched: fix fair sleepers"
re-apply:
| commit
e22ecef1d2658ba54ed7d3fdb5d60829fb434c23
| Author: Ingo Molnar <mingo@elte.hu>
| Date: Fri Mar 14 22:16:08 2008 +0100
|
| sched: fix fair sleepers
|
| Fair sleepers need to scale their latency target down by runqueue
| weight. Otherwise busy systems will gain ever larger sleep bonus.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus Torvalds [Sat, 19 Apr 2008 01:18:30 +0000 (18:18 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
security: fix up documentation for security_module_enable
Security: Introduce security= boot parameter
Audit: Final renamings and cleanup
SELinux: use new audit hooks, remove redundant exports
Audit: internally use the new LSM audit hooks
LSM/Audit: Introduce generic Audit LSM hooks
SELinux: remove redundant exports
Netlink: Use generic LSM hook
Audit: use new LSM hooks instead of SELinux exports
SELinux: setup new inode/ipc getsecid hooks
LSM: Introduce inode_getsecid and ipc_getsecid hooks
Linus Torvalds [Sat, 19 Apr 2008 01:02:35 +0000 (18:02 -0700)]
Merge git://git./linux/kernel/git/davem/net-2.6.26
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.26: (1090 commits)
[NET]: Fix and allocate less memory for ->priv'less netdevices
[IPV6]: Fix dangling references on error in fib6_add().
[NETLABEL]: Fix NULL deref in netlbl_unlabel_staticlist_gen() if ifindex not found
[PKT_SCHED]: Fix datalen check in tcf_simp_init().
[INET]: Uninline the __inet_inherit_port call.
[INET]: Drop the inet_inherit_port() call.
SCTP: Initialize partial_bytes_acked to 0, when all of the data is acked.
[netdrvr] forcedeth: internal simplifications; changelog removal
phylib: factor out get_phy_id from within get_phy_device
PHY: add BCM5464 support to broadcom PHY driver
cxgb3: Fix __must_check warning with dev_dbg.
tc35815: Statistics cleanup
natsemi: fix MMIO for PPC 44x platforms
[TIPC]: Cleanup of TIPC reference table code
[TIPC]: Optimized initialization of TIPC reference table
[TIPC]: Remove inlining of reference table locking routines
e1000: convert uint16_t style integers to u16
ixgb: convert uint16_t style integers to u16
sb1000.c: make const arrays static
sb1000.c: stop inlining largish static functions
...
James Morris [Fri, 7 Mar 2008 01:23:49 +0000 (12:23 +1100)]
security: fix up documentation for security_module_enable
security_module_enable() can only be called during kernel init.
Signed-off-by: James Morris <jmorris@namei.org>
Ahmed S. Darwish [Thu, 6 Mar 2008 16:09:10 +0000 (18:09 +0200)]
Security: Introduce security= boot parameter
Add the security= boot parameter. This is done to avoid LSM
registration clashes in case of more than one bult-in module.
User can choose a security module to enable at boot. If no
security= boot parameter is specified, only the first LSM
asking for registration will be loaded. An invalid security
module name will be treated as if no module has been chosen.
LSM modules must check now if they are allowed to register
by calling security_module_enable(ops) first. Modify SELinux
and SMACK to do so.
Do not let SMACK register smackfs if it was not chosen on
boot. Smackfs assumes that smack hooks are registered and
the initial task security setup (swapper->security) is done.
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Ahmed S. Darwish [Fri, 18 Apr 2008 23:59:43 +0000 (09:59 +1000)]
Audit: Final renamings and cleanup
Rename the se_str and se_rule audit fields elements to
lsm_str and lsm_rule to avoid confusion.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Ahmed S. Darwish [Sat, 1 Mar 2008 20:03:14 +0000 (22:03 +0200)]
SELinux: use new audit hooks, remove redundant exports
Setup the new Audit LSM hooks for SELinux.
Remove the now redundant exported SELinux Audit interface.
Audit: Export 'audit_krule' and 'audit_field' to the public
since their internals are needed by the implementation of the
new LSM hook 'audit_rule_known'.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Ahmed S. Darwish [Sat, 1 Mar 2008 20:01:11 +0000 (22:01 +0200)]
Audit: internally use the new LSM audit hooks
Convert Audit to use the new LSM Audit hooks instead of
the exported SELinux interface.
Basically, use:
security_audit_rule_init
secuirty_audit_rule_free
security_audit_rule_known
security_audit_rule_match
instad of (respectively) :
selinux_audit_rule_init
selinux_audit_rule_free
audit_rule_has_selinux
selinux_audit_rule_match
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Ahmed S. Darwish [Sat, 1 Mar 2008 20:00:05 +0000 (22:00 +0200)]
LSM/Audit: Introduce generic Audit LSM hooks
Introduce a generic Audit interface for security modules
by adding the following new LSM hooks:
audit_rule_init(field, op, rulestr, lsmrule)
audit_rule_known(krule)
audit_rule_match(secid, field, op, rule, actx)
audit_rule_free(rule)
Those hooks are only available if CONFIG_AUDIT is enabled.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Reviewed-by: Paul Moore <paul.moore@hp.com>
Ahmed S. Darwish [Sat, 1 Mar 2008 19:58:32 +0000 (21:58 +0200)]
SELinux: remove redundant exports
Remove the following exported SELinux interfaces:
selinux_get_inode_sid(inode, sid)
selinux_get_ipc_sid(ipcp, sid)
selinux_get_task_sid(tsk, sid)
selinux_sid_to_string(sid, ctx, len)
They can be substitued with the following generic equivalents
respectively:
new LSM hook, inode_getsecid(inode, secid)
new LSM hook, ipc_getsecid*(ipcp, secid)
LSM hook, task_getsecid(tsk, secid)
LSM hook, sid_to_secctx(sid, ctx, len)
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Reviewed-by: Paul Moore <paul.moore@hp.com>
Ahmed S. Darwish [Sat, 1 Mar 2008 19:56:22 +0000 (21:56 +0200)]
Netlink: Use generic LSM hook
Don't use SELinux exported selinux_get_task_sid symbol.
Use the generic LSM equivalent instead.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Paul Moore <paul.moore@hp.com>
Ahmed S. Darwish [Sat, 1 Mar 2008 19:54:38 +0000 (21:54 +0200)]
Audit: use new LSM hooks instead of SELinux exports
Stop using the following exported SELinux interfaces:
selinux_get_inode_sid(inode, sid)
selinux_get_ipc_sid(ipcp, sid)
selinux_get_task_sid(tsk, sid)
selinux_sid_to_string(sid, ctx, len)
kfree(ctx)
and use following generic LSM equivalents respectively:
security_inode_getsecid(inode, secid)
security_ipc_getsecid*(ipcp, secid)
security_task_getsecid(tsk, secid)
security_sid_to_secctx(sid, ctx, len)
security_release_secctx(ctx, len)
Call security_release_secctx only if security_secid_to_secctx
succeeded.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Reviewed-by: Paul Moore <paul.moore@hp.com>
Ahmed S. Darwish [Sat, 1 Mar 2008 19:52:30 +0000 (21:52 +0200)]
SELinux: setup new inode/ipc getsecid hooks
Setup the new inode_getsecid and ipc_getsecid() LSM hooks
for SELinux.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Reviewed-by: Paul Moore <paul.moore@hp.com>
Ahmed S. Darwish [Sat, 1 Mar 2008 19:51:09 +0000 (21:51 +0200)]
LSM: Introduce inode_getsecid and ipc_getsecid hooks
Introduce inode_getsecid(inode, secid) and ipc_getsecid(ipcp, secid)
LSM hooks. These hooks will be used instead of similar exported
SELinux interfaces.
Let {inode,ipc,task}_getsecid hooks set the secid to 0 by default
if CONFIG_SECURITY is not defined or if the hook is set to
NULL (dummy). This is done to notify the caller that no valid
secid exists.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Reviewed-by: Paul Moore <paul.moore@hp.com>
Alexey Dobriyan [Fri, 18 Apr 2008 22:43:32 +0000 (15:43 -0700)]
[NET]: Fix and allocate less memory for ->priv'less netdevices
This patch effectively reverts commit
d0498d9ae1a5cebac363e38907266d5cd2eedf89
aka "[NET]: Do not allocate unneeded memory for dev->priv alignment."
It was found to be buggy because of final unconditional += NETDEV_ALIGN_CONST
removal.
For example, for sizeof(struct net_device) being 2048 bytes, "alloc_size"
was also 2048 bytes, but allocator with debugging options turned on started
giving out !32-byte aligned memory resulting in redzones overwrites.
Patch does small optimization in ->priv'less case: bumping size to next
32-byte boundary was always done to ensure ->priv will also be aligned.
But, no ->priv, no need to do that.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ingo Molnar [Fri, 18 Apr 2008 19:32:22 +0000 (21:32 +0200)]
x86 PAT: fix mmap() of holes
do not return a -EINVAL when mmap()-ing PCI holes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Linus Torvalds [Fri, 18 Apr 2008 18:25:31 +0000 (11:25 -0700)]
Merge git://git./linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (137 commits)
[SCSI] iscsi: bidi support for iscsi_tcp
[SCSI] iscsi: bidi support at the generic libiscsi level
[SCSI] iscsi: extended cdb support
[SCSI] zfcp: Fix error handling for blocked unit for send FCP command
[SCSI] zfcp: Remove zfcp_erp_wait from slave destory handler to fix deadlock
[SCSI] zfcp: fix 31 bit compile warnings
[SCSI] bsg: no need to set BSG_F_BLOCK bit in bsg_complete_all_commands
[SCSI] bsg: remove minor in struct bsg_device
[SCSI] bsg: use better helper list functions
[SCSI] bsg: replace kobject_get with blk_get_queue
[SCSI] bsg: takes a ref to struct device in fops->open
[SCSI] qla1280: remove version check
[SCSI] libsas: fix endianness bug in sas_ata
[SCSI] zfcp: fix compiler warning caused by poking inside new semaphore (linux-next)
[SCSI] aacraid: Do not describe check_reset parameter with its value
[SCSI] aacraid: Fix down_interruptible() to check the return value
[SCSI] sun3_scsi_vme: add MODULE_LICENSE
[SCSI] st: rename flush_write_buffer()
[SCSI] tgt: use KMEM_CACHE macro
[SCSI] initio: fix big endian problems for auto request sense
...
Linus Torvalds [Fri, 18 Apr 2008 18:24:29 +0000 (11:24 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/ieee1394/linux1394-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6: (43 commits)
firewire: cleanups
firewire: fix synchronization of gap counts
firewire: wait until PHY configuration packet was transmitted (fix bus reset loop)
firewire: remove unused struct member
firewire: use bitwise and to get reg in handle_registers
firewire: replace more hex values with defined csr constants
firewire: reread config ROM when device reset the bus
firewire: replace static ROM cache by allocated cache
firewire: fw-ohci: work around generation bug in TI controllers (fix AV/C and more)
firewire: fw-ohci: extend logging of bus generations and node ID
firewire: fw-ohci: conditionally log busReset interrupts
firewire: fw-ohci: don't append to AT context when it's not active
firewire: fw-ohci: log regAccessFail events
firewire: fw-ohci: make sure HCControl register LPS bit is set
firewire: fw-ohci: missing PPC PMac feature calls in failure path
firewire: fw-ohci: untangle a mixed unsigned/signed expression
firewire: debug interrupt events
firewire: fw-ohci: catch self_id_count == 0
firewire: fw-ohci: add self ID error check
firewire: fw-ohci: refactor probe, remove, suspend, resume
...
James Bottomley [Fri, 18 Apr 2008 18:18:48 +0000 (13:18 -0500)]
libata: fix boot panic with SATAPI devices on non-SFF HBAs
The kernel now panics reliably on boot if you have a SATAPI device
connected.
The problem was introduced by the libata merge trying to pull out all
the SFF code into a separate module. Unfortunately, if you're a satapi
device you usually need to call atapi_request_sense, which has a bare
invocation of a SFF callback which is NULL on non-SFF HBAs. Fix this by
making the call conditional.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 18 Apr 2008 17:15:22 +0000 (10:15 -0700)]
Merge branch 'upstream-linus' of git://git./linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (64 commits)
ocfs2/net: Add debug interface to o2net
ocfs2: Only build ocfs2/dlm with the o2cb stack module
ocfs2/cluster: Get rid of arguments to the timeout routines
ocfs2: Put tree in MAINTAINERS
ocfs2: Use BUG_ON
ocfs2: Convert ocfs2 over to unlocked_ioctl
ocfs2: Improve rename locking
fs/ocfs2/aops.c: test for IS_ERR rather than 0
ocfs2: Add inode stealing for ocfs2_reserve_new_inode
ocfs2: Add ac_alloc_slot in ocfs2_alloc_context
ocfs2: Add a new parameter for ocfs2_reserve_suballoc_bits
ocfs2: Enable cross extent block merge.
ocfs2: Add support for cross extent block
ocfs2: Move /sys/o2cb to /sys/fs/o2cb
sysfs: Allow removal of symlinks in the sysfs root
ocfs2: Reconnect after idle time out.
ocfs2/dlm: Cleanup lockres print
ocfs2/dlm: Fix lockname in lockres print function
ocfs2/dlm: Move dlm_print_one_mle() from dlmmaster.c to dlmdebug.c
ocfs2/dlm: Dumps the purgelist into a debugfs file
...
Linus Torvalds [Fri, 18 Apr 2008 17:02:46 +0000 (10:02 -0700)]
Merge git://git./linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (49 commits)
[GFS2] fix assertion in log_refund()
[GFS2] fix GFP_KERNEL misuses
[GFS2] test for IS_ERR rather than 0
[GFS2] Invalidate cache at correct point
[GFS2] fs/gfs2/recovery.c: suppress warnings
[GFS2] Faster gfs2_bitfit algorithm
[GFS2] Streamline quota lock/check for no-quota case
[GFS2] Remove drop of module ref where not needed
[GFS2] gfs2_adjust_quota has broken unstuffing code
[GFS2] possible null pointer dereference fixup
[GFS2] Need to ensure that sector_t is 64bits for GFS2
[GFS2] re-support special inode
[GFS2] remove gfs2_dev_iops
[GFS2] fix file_system_type leak on gfs2meta mount
[GFS2] Allow bmap to allocate extents
[GFS2] Fix a page lock / glock deadlock
[GFS2] proper extern for gfs2/locking/dlm/mount.c:gdlm_ops
[GFS2] gfs2/ops_file.c should #include "ops_inode.h"
[GFS2] be*_add_cpu conversion
[GFS2] Fix bug where we called drop_bh incorrectly
...
Harvey Harrison [Fri, 18 Apr 2008 16:54:38 +0000 (09:54 -0700)]
x86: kgdb build fix
TF_MASK is no longer defined, use X86_EFLAGS_TF.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Boaz Harrosh [Fri, 18 Apr 2008 15:11:53 +0000 (10:11 -0500)]
[SCSI] iscsi: bidi support for iscsi_tcp
access the right scsi_in() and/or scsi_out() side of things.
also for resid
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Reviewed-by: Pete Wyckoff <pw@osc.edu>
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Boaz Harrosh [Fri, 18 Apr 2008 15:11:52 +0000 (10:11 -0500)]
[SCSI] iscsi: bidi support at the generic libiscsi level
- prepare the additional bidi_read rlength header.
- access the right scsi_in() and/or scsi_out() side of things.
also for resid.
- Handle BIDI underflow overflow from target
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Reviewed-by: Pete Wyckoff <pw@osc.edu>
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Boaz Harrosh [Fri, 18 Apr 2008 15:11:51 +0000 (10:11 -0500)]
[SCSI] iscsi: extended cdb support
Support for extended CDBs in iscsi.
All we need is to check if command spills over 16 bytes then allocate
an iscsi-extended-header for the leftovers.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Reviewed-by: Pete Wyckoff <pw@osc.edu>
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Christof Schmitt [Fri, 18 Apr 2008 10:51:57 +0000 (12:51 +0200)]
[SCSI] zfcp: Fix error handling for blocked unit for send FCP command
In the case the unit is blocked, zfcp_unit_get has not been called
yet, so the error handling path should not call zfcp_unit_put.
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Christof Schmitt [Fri, 18 Apr 2008 10:51:56 +0000 (12:51 +0200)]
[SCSI] zfcp: Remove zfcp_erp_wait from slave destory handler to fix deadlock
The testcase
# chchp -v 0 0.da && sleep 59 && chchp -v 1 0.da
results in this deadlock situation:
STACK TRACE FOR TASK: 0x7e9a2048 (zfcperp0.0.c613)
0 schedule+816 [0x356b3c]
1 schedule_timeout+172 [0x357340]
2 wait_for_common+192 [0x3565fc]
3 flush_cpu_workqueue+116 [0x52af0]
4 flush_workqueue+116 [0x533b8]
5 fc_remote_port_add+64 [0x1c83ec]
6 zfcp_erp_thread+4534 [0x26585a]
7 kernel_thread_starter+6 [0x195d2]
STACK TRACE FOR TASK: 0x7f8ec048 (fc_wq_0)
0 schedule+816 [0x356b3c]
1 zfcp_erp_wait+104 [0x264568]
2 zfcp_scsi_slave_destroy+64 [0x261b24]
3 __scsi_remove_device+154 [0x1c24ba]
4 scsi_remove_device+62 [0x1c2512]
5 __scsi_remove_target+198 [0x1c25ea]
6 __remove_child+58 [0x1c26d6]
7 device_for_each_child+66 [0x1ab566]
8 scsi_remove_target+98 [0x1c268a]
9 run_workqueue+200 [0x5272c]
10 worker_thread+146 [0x52882]
11 kthread+140 [0x58360]
12 kernel_thread_starter+6 [0x195d2]
Remove the zfcp_erp_wait call that is not required here to prevent the
deadlock situation.
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Martin Peschke [Fri, 18 Apr 2008 10:51:55 +0000 (12:51 +0200)]
[SCSI] zfcp: fix 31 bit compile warnings
drivers/s390/scsi/zfcp_aux.c: In function ‘zfcp_fsf_incoming_els_rscn’:
drivers/s390/scsi/zfcp_aux.c:1379: warning: cast from pointer to integer of
different size
drivers/s390/scsi/zfcp_aux.c: In function ‘zfcp_fsf_incoming_els_plogi’:
drivers/s390/scsi/zfcp_aux.c:1432: warning: cast from pointer to integer of
different size
drivers/s390/scsi/zfcp_aux.c: In function ‘zfcp_fsf_incoming_els_logo’:
drivers/s390/scsi/zfcp_aux.c:1457: warning: cast from pointer to integer of
different size
..
Just passing pointers rids us of these warnings and improves readability.
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
FUJITA Tomonori [Mon, 31 Mar 2008 01:03:42 +0000 (10:03 +0900)]
[SCSI] bsg: no need to set BSG_F_BLOCK bit in bsg_complete_all_commands
Before bsg_complete_all_commands is called, BSG_F_BLOCK bit is always
set.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
FUJITA Tomonori [Mon, 31 Mar 2008 01:03:41 +0000 (10:03 +0900)]
[SCSI] bsg: remove minor in struct bsg_device
minor in struct bsg_device is used as identifier to find the
corresponding struct bsg_device_class. However, request_queuse can be
used as identifier for that and the minor in struct bsg_device is
unnecessary.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
FUJITA Tomonori [Mon, 31 Mar 2008 01:03:40 +0000 (10:03 +0900)]
[SCSI] bsg: use better helper list functions
This replace hlist_for_each and list_entry with hlist_for_each_entry
and list_first_entry respectively.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
FUJITA Tomonori [Mon, 31 Mar 2008 01:03:39 +0000 (10:03 +0900)]
[SCSI] bsg: replace kobject_get with blk_get_queue
Both takes a ref to a queue. But blk_get_queue checks QUEUE_FLAG_DEAD
and is more appropriate interface here.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
FUJITA Tomonori [Mon, 31 Mar 2008 01:03:38 +0000 (10:03 +0900)]
[SCSI] bsg: takes a ref to struct device in fops->open
bsg_register_queue() takes a ref to struct device that a caller
passes. For example, bsg takes a ref to the sdev_gendev for scsi
devices. However, bsg doesn't inrease the refcount in fops->open. So
while an application opens a bsg device, the scsi device that the bsg
device holds can go away (bsg also takes a ref to a queue, but it
doesn't prevent the device from going away).
With this patch, bsg increases the refcount of struct device in
fops->open and decreases it in fops->release.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Linus Torvalds [Fri, 18 Apr 2008 16:44:55 +0000 (09:44 -0700)]
Merge branch 'release' of git://git./linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: (27 commits)
[IA64] kdump: Add crash_save_vmcoreinfo for INIT
[IA64] Fix NUMA configuration issue
[IA64] Itanium Spec updates
[IA64] Untangle sync_icache_dcache() page size determination
[IA64] arch/ia64/kernel/: use time_* macros
[IA64] remove redundant display of free swap space in show_mem()
[IA64] make IOMMU respect the segment boundary limits
[IA64] kprobes: kprobe-booster for ia64
[IA64] fix getpid and set_tid_address fast system calls for pid namespaces
[IA64] Replace explicit jiffies tests with time_* macros.
[IA64] use goto to jump out do/while_each_thread
[IA64] Fix unlock ordering in smp_callin
[IA64] pgd_offset() constfication.
[IA64] kdump: crash.c coding style fix
[IA64] kdump: add kdump_on_fatal_mca
[IA64] Minimize per_cpu reservations.
[IA64] Correct pernodesize calculation.
[IA64] Kernel parameter for max number of concurrent global TLB purges
[IA64] Multiple outstanding ptc.g instruction support
[IA64] Implement smp_call_function_mask for ia64
...
Sunil Mushran [Mon, 14 Apr 2008 17:46:19 +0000 (10:46 -0700)]
ocfs2/net: Add debug interface to o2net
This patch exposes o2net information via debugfs. The information includes
the list of sockets (sock_containers) as well as the list of outstanding
messages (send_tracking). Useful for o2dlm debugging.
(This patch is derived from an earlier one written by Zach Brown that
exposed the same information via /proc.)
[Mark: checkpatch fixes]
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Mark Fasheh [Fri, 4 Apr 2008 19:45:55 +0000 (12:45 -0700)]
ocfs2: Only build ocfs2/dlm with the o2cb stack module
fs/ocfs2/dlm/ocfs2_dlm.ko and fs/ocfs2/dlm/ocfs2_dlmfs.ko get built if
CONFIG_FS_OCFS2 is specified. This isn't quite how it should happen any more
- the "o2cb" dlm modules should only be built if CONFIG_FS_OCFS2_O2CB is
set, so update the dlm Makefile accordingly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Jeff Mahoney [Fri, 28 Mar 2008 23:44:13 +0000 (16:44 -0700)]
ocfs2/cluster: Get rid of arguments to the timeout routines
We keep seeing bug reports related to NULL pointer derefs in
o2net_set_nn_state(). When I originally wrote up the configurable timeout
patch, I had tried to plan for multiple clusters. This was silly.
The timeout routines all use o2nm_single_cluster so there's no point in
passing an argument at all. This patch removes the arguments and kills those
bugs dead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Joel Becker [Sun, 23 Mar 2008 05:08:07 +0000 (22:08 -0700)]
ocfs2: Put tree in MAINTAINERS
The ocfs2 MAINTAINERS entry should have the git tree URL.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Julia Lawall [Tue, 4 Mar 2008 23:21:05 +0000 (15:21 -0800)]
ocfs2: Use BUG_ON
if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
side-effects to allow a definition of BUG_ON that drops the code completely.
The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@ disable unlikely @ expression E,f; @@
(
if (<... f(...) ...>) { BUG(); }
|
- if (unlikely(E)) { BUG(); }
+ BUG_ON(E);
)
@@ expression E,f; @@
(
if (<... f(...) ...>) { BUG(); }
|
- if (E) { BUG(); }
+ BUG_ON(E);
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Andi Kleen [Sun, 27 Jan 2008 02:17:17 +0000 (03:17 +0100)]
ocfs2: Convert ocfs2 over to unlocked_ioctl
As far as I can see there is nothing in ocfs2_ioctl that requires the BKL,
so use unlocked_ioctl
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Jan Kara [Thu, 21 Feb 2008 17:00:00 +0000 (18:00 +0100)]
ocfs2: Improve rename locking
ocfs2_rename() was being too aggressive with the rename lock - we only need
it for certain forms of directory rename.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Julia Lawall [Fri, 28 Mar 2008 21:43:10 +0000 (14:43 -0700)]
fs/ocfs2/aops.c: test for IS_ERR rather than 0
The function ocfs2_start_trans always returns either a valid pointer or a
value made with ERR_PTR, so its result should be tested with IS_ERR, not
with a test for 0.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>