git.openwrt.org Git - openwrt/staging/blogic.git/log

percpu: add __read_mostly to variables which are mostly read only

Most global variables in percpu allocator are initialized during boot
and read only from that point on. Add __read_mostly as per Rusty's
suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>

x86: add remapping percpu first chunk allocator

Impact: add better first percpu allocation for NUMA

On NUMA, embedding allocator can't be used as different units can't be
made to fall in the correct NUMA nodes.  To use large page mapping,
each unit needs to be remapped.  However, percpu areas are usually
much smaller than large page size and unused space hurts a lot as the
number of cpus grow.  This allocator remaps large pages for each chunk
but gives back unused part to the bootmem allocator making the large
pages mapped twice.

This adds slightly to the TLB pressure but is much better than using
4k mappings while still being NUMA-friendly.

Ingo suggested that this would be the correct approach for NUMA.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>

x86: add embedding percpu first chunk allocator

Impact: add better first percpu allocation for !NUMA

On !NUMA, we can simply allocate contiguous memory and use it for the
first chunk without mapping it into vmalloc area. As the memory area
is covered by the large page physical memory mapping, it allows the
dynamic perpcu allocator to not add any TLB overhead for the static
percpu area and whatever falls into the first chunk and the
implementation is very simple too.

Signed-off-by: Tejun Heo <tj@kernel.org>

x86: separate out setup_pcpu_4k() from setup_per_cpu_areas()

Impact: modularize percpu first chunk allocation

x86 is gonna have a few different strategies for the first chunk
allocation. Modularize it by separating out the current allocation
mechanism into pcpu_alloc_bootmem() and setup_pcpu_4k().

Signed-off-by: Tejun Heo <tj@kernel.org>

percpu: give more latitude to arch specific first chunk initialization

Impact: more latitude for first percpu chunk allocation

The first percpu chunk serves the kernel static percpu area and may or
may not contain extra room for further dynamic allocation.
Initialization of the first chunk needs to be done before normal
memory allocation service is up, so it has its own init path -
pcpu_setup_static().

It seems archs need more latitude while initializing the first chunk
for example to take advantage of large page mapping.  This patch makes
the following changes to allow this.

* Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space
  to reserve in the first chunk for further dynamic allocation.

* Rename pcpu_setup_static() to pcpu_setup_first_chunk().

* Make pcpu_setup_first_chunk() much more flexible by fetching page
  pointer by callback and adding optional @unit_size, @free_size and
  @base_addr arguments which allow archs to selectively part of chunk
  initialization to their likings.

Signed-off-by: Tejun Heo <tj@kernel.org>

percpu: remove unit_size power-of-2 restriction

Impact: allow unit_size to be arbitrary multiple of PAGE_SIZE

In dynamic percpu allocator, there is no reason the unit size should
be power of two. Remove the restriction.

As non-power-of-two unit size means that empty chunks fall into the
same slot index as lightly occupied chunks which is bad for reclaming.
Reserve an extra slot for empty chunks.

Signed-off-by: Tejun Heo <tj@kernel.org>

x86: update populate_extra_pte() and add populate_extra_pmd()

Impact: minor change to populate_extra_pte() and addition of pmd flavor

Update populate_extra_pte() to return pointer to the pte_t for the
specified address and add populate_extra_pmd() which only populates
till the pmd and returns pointer to the pmd entry for the address.

For 64bit, pud/pmd/pte fill functions are separated out from
set_pte_vaddr[_pud]() and used for set_pte_vaddr[_pud]() and
populate_extra_{pte|pmd}().

Signed-off-by: Tejun Heo <tj@kernel.org>

vmalloc: add @align to vm_area_register_early()

Impact: allow larger alignment for early vmalloc area allocation

Some early vmalloc users might want larger alignment, for example, for
custom large page mapping. Add @align to vm_area_register_early().
While at it, drop docbook comment on non-existent @size.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>

bootmem: reorder interface functions and add a missing one

Impact: cleanup and addition of missing interface wrapper

The interface functions in bootmem.h was ordered in not so orderly
manner. Reorder them such that

* functions allocating the same area group together -
ie. alloc_bootmem group and alloc_bootmem_low group.

* functions w/o node parameter come before the ones w/ node parameter.

* nopanic variants are immediately below their panicky counterparts.

While at it, add alloc_bootmem_pages_node_nopanic() which was missing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@saeurebad.de>

bootmem: clean up arch-specific bootmem wrapping

Impact: cleaner and consistent bootmem wrapping

By setting CONFIG_HAVE_ARCH_BOOTMEM_NODE, archs can define
arch-specific wrappers for bootmem allocation.  However, this is done
a bit strangely in that only the high level convenience macros can be
changed while lower level, but still exported, interface functions
can't be wrapped.  This not only is messy but also leads to strange
situation where alloc_bootmem() does what the arch wants it to do but
the equivalent __alloc_bootmem() call doesn't although they should be
able to be used interchangeably.

This patch updates bootmem such that archs can override / wrap the
backend function - alloc_bootmem_core() instead of the highlevel
interface functions to allow simpler and consistent wrapping.  Also,
HAVE_ARCH_BOOTMEM_NODE is renamed to HAVE_ARCH_BOOTMEM.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@saeurebad.de>

percpu: fix pcpu_chunk_struct_size

Impact: fix short allocation leading to memory corruption

While dropping rvalue wrapping macros around global parameters,
pcpu_chunk_struct_size was set incorrectly resulting in shorter page
pointer array. Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>

percpu: clean up size usage

Andrew was concerned about the unit of variables named or have suffix
size. Every usage in percpu allocator is in bytes but make it super
clear by adding comments.

While at it, make pcpu_depopulate_chunk() take int @off and @size like
everyone else.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>

x86: convert to the new dynamic percpu allocator

Impact: use new dynamic allocator, unified access to static/dynamic
percpu memory

Convert to the new dynamic percpu allocator.

* implement populate_extra_pte() for both 32 and 64
* update setup_per_cpu_areas() to use pcpu_setup_static()
* define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr()
* define config HAVE_DYNAMIC_PER_CPU_AREA

Signed-off-by: Tejun Heo <tj@kernel.org>

percpu: implement new dynamic percpu allocator

Impact: new scalable dynamic percpu allocator which allows dynamic
        percpu areas to be accessed the same way as static ones

Implement scalable dynamic percpu allocator which can be used for both
static and dynamic percpu areas.  This will allow static and dynamic
areas to share faster direct access methods.  This feature is optional
and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
arch.  Please read comment on top of mm/percpu.c for details.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>

vmalloc: add un/map_kernel_range_noflush()

Impact: two more public map/unmap functions

Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
These functions respectively map and unmap address range in kernel VM
area but doesn't do any vcache or tlb flushing. These will be used by
new percpu allocator.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

vmalloc: implement vm_area_register_early()

Impact: allow multiple early vm areas

There are places where kernel VM area needs to be allocated before
vmalloc is initialized.  This is done by allocating static vm_struct,
initializing several fields and linking it to vmlist and later vmalloc
initialization picking up these from vmlist.  This is currently done
manually and if there's more than one such areas, there's no defined
way to arbitrate who gets which address.

This patch implements vm_area_register_early(), which takes vm_area
struct with flags and size initialized, assigns address to it and puts
it on the vmlist.  This way, multiple early vm areas can determine
which addresses they should use.  The only current user - alpha mm
init - is converted to use it.

Signed-off-by: Tejun Heo <tj@kernel.org>

percpu: kill percpu_alloc() and friends

Impact: kill unused functions

percpu_alloc() and its friends never saw much action. It was supposed
to replace the cpu-mask unaware __alloc_percpu() but it never happened
and in fact __percpu_alloc_mask() itself never really grew proper
up/down handling interface either (no exported interface for
populate/depopulate).

percpu allocation is about to go through major reimplementation and
there's no reason to carry this unused interface around. Replace it
with __alloc_percpu() and free_percpu().

Signed-off-by: Tejun Heo <tj@kernel.org>

alloc_percpu: add align argument to __alloc_percpu.

This prepares for a real __alloc_percpu, by adding an alignment argument.
Only one place uses __alloc_percpu directly, and that's for a string.

tj: af_inet also uses __alloc_percpu(), update it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>

alloc_percpu: change percpu_ptr to per_cpu_ptr

Impact: cleanup

There are two allocated per-cpu accessor macros with almost identical
spelling. The original and far more popular is per_cpu_ptr (44
files), so change over the other 4 files.

tj: kill percpu_ptr() and update UP too

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: mingo@redhat.com
Cc: lenb@kernel.org
Cc: cpufreq@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>

module: reorder module pcpu related functions

Impact: cleanup

Move percpu_modinit() upwards. This is to ease further changes.

Signed-off-by: Tejun Heo <tj@kernel.org>

vmalloc: call flush_cache_vunmap() from unmap_kernel_range()

Impact: proper vcache flush on unmap_kernel_range()

flush_cache_vunmap() should be called before pages are unmapped. Add
a call to it in unmap_kernel_range().

Signed-off-by: Tejun Heo <tj@kernel.org>

x86: use percpu data for 4k hardirq and softirq stacks

Impact: economize memory for large NR_CPUS

percpu data is setup earlier than irq, we can use percpu data
to economize memory.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

x86: UV: fix header struct usage

Impact: Fixes warning

Fix uv.h struct usage:

arch/x86/include/asm/uv/uv.h:16: warning: 'struct mm_struct' declared inside parameter list
arch/x86/include/asm/uv/uv.h:16: warning: its scope is only this definition or declaration, which is probably not what you want

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>

x86: merge sys_rt_sigreturn between 32 and 64 bits

Impact: cleanup

With the recent changes in the 32-bit code to make system calls which
use struct pt_regs take a pointer, sys_rt_sigreturn() have become
identical between 32 and 64 bits, and both are empty wrappers around
do_rt_sigreturn(). Remove both wrappers and rename both to
sys_rt_sigreturn().

Cc: Brian Gerst <brgerst@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>

x86: use regparm(3) for passed-in pt_regs pointer

Some syscalls need to access the pt_regs structure, either to copy
user register state or to modifiy it. This patch adds stubs to load
the address of the pt_regs struct into the %eax register, and changes
the syscalls to take the pointer as an argument instead of relying on
the assumption that the pt_regs structure overlaps the function
arguments.

Drop the use of regparm(1) due to concern about gcc bugs, and to move
in the direction of the eventual removal of regparm(0) for asmlinkage.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>

SGI IA64 UV: fix ia64 build error in the linux-next tree

Fix the ia64 build error that occurs in the linux-next tree by introducing
an ia64 version of uv.h.

Additionally, clean up the usage of is_uv_system().

Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: drop -fno-stack-protector annotations after pt_regs fixes

Now that no functions rely on struct pt_regs being passed by value,
various "no stack protector" annotations can be dropped.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: pass in pt_regs pointer for syscalls that need it

Some syscalls need to access the pt_regs structure, either to copy
user register state or to modifiy it. This patch adds stubs to load
the address of the pt_regs struct into the %eax register, and changes
the syscalls to regparm(1) to receive the pt_regs pointer as the
first argument.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: use pt_regs pointer in do_device_not_available()

The generic exception handler (error_code) passes in the pt_regs
pointer and the error code (unused in this case). The commit
"x86: fix math_emu register frame access" changed this to pass by
value, which doesn't work correctly with stack protector enabled.
Change it back to use the pt_regs pointer.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

stackprotector: fix multi-word cross-builds

Stackprotector builds were failing if CROSS_COMPILER was more than
a single world (such as when distcc was used) - because the check
scripts used $1 instead of $*.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: fix x86_32 stack protector bugs

Impact: fix x86_32 stack protector

Brian Gerst found out that %gs was being initialized to stack_canary
instead of stack_canary - 20, which basically gave the same canary
value for all threads. Fixing this also exposed the following bugs.

* cpu_idle() didn't call boot_init_stack_canary()

* stack canary switching in switch_to() was being done too late making
the initial run of a new thread use the old stack canary value.

Fix all of them and while at it update comment in cpu_idle() about
calling boot_init_stack_canary().

Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: implement x86_32 stack protector

Impact: stack protector for x86_32

Implement stack protector for x86_32.  GDT entry 28 is used for it.
It's set to point to stack_canary-20 and have the length of 24 bytes.
CONFIG_CC_STACKPROTECTOR turns off CONFIG_X86_32_LAZY_GS and sets %gs
to the stack canary segment on entry.  As %gs is otherwise unused by
the kernel, the canary can be anywhere.  It's defined as a percpu
variable.

x86_32 exception handlers take register frame on stack directly as
struct pt_regs.  With -fstack-protector turned on, gcc copies the
whole structure after the stack canary and (of course) doesn't copy
back on return thus losing all changed.  For now, -fno-stack-protector
is added to all files which contain those functions.  We definitely
need something better.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: make lazy %gs optional on x86_32

Impact: pt_regs changed, lazy gs handling made optional, add slight
        overhead to SAVE_ALL, simplifies error_code path a bit

On x86_32, %gs hasn't been used by kernel and handled lazily.  pt_regs
doesn't have place for it and gs is saved/loaded only when necessary.
In preparation for stack protector support, this patch makes lazy %gs
handling optional by doing the followings.

* Add CONFIG_X86_32_LAZY_GS and place for gs in pt_regs.

* Save and restore %gs along with other registers in entry_32.S unless
  LAZY_GS.  Note that this unfortunately adds "pushl $0" on SAVE_ALL
  even when LAZY_GS.  However, it adds no overhead to common exit path
  and simplifies entry path with error code.

* Define different user_gs accessors depending on LAZY_GS and add
  lazy_save_gs() and lazy_load_gs() which are noop if !LAZY_GS.  The
  lazy_*_gs() ops are used to save, load and clear %gs lazily.

* Define ELF_CORE_COPY_KERNEL_REGS() which always read %gs directly.

xen and lguest changes need to be verified.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: add %gs accessors for x86_32

Impact: cleanup

On x86_32, %gs is handled lazily.  It's not saved and restored on
kernel entry/exit but only when necessary which usually is during task
switch but there are few other places.  Currently, it's done by
calling savesegment() and loadsegment() explicitly.  Define
get_user_gs(), set_user_gs() and task_user_gs() and use them instead.

While at it, clean up register access macros in signal.c.

This cleans up code a bit and will help future changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: use asm .macro instead of cpp #define in entry_32.S

Impact: cleanup

Use .macro instead of cpp #define where approriate. This cleans up
code and will ease future changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: no stack protector for vdso

Impact: avoid crash on vsyscall

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

stackprotector: update make rules

Impact: no default -fno-stack-protector if stackp is enabled, cleanup

Stackprotector make rules had the following problems.

* cc support test and warning are scattered across makefile and
kernel/panic.c.

* -fno-stack-protector was always added regardless of configuration.

Update such that cc support test and warning are contained in makefile
and -fno-stack-protector is added iff stackp is turned off. While at
it, prepare for 32bit support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: stackprotector.h misc update

Impact: misc udpate

* wrap content with CONFIG_CC_STACK_PROTECTOR so that other arch files
can include it directly

* add missing includes

This will help future changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

elf: add ELF_CORE_COPY_KERNEL_REGS()

ELF core dump is used for both user land core dump and kernel crash
dump. Depending on architecture, register might need to be accessed
differently for userland and kernel. Allow architectures to define
ELF_CORE_COPY_KERNEL_REGS() and use different operation for kernel
register dump.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Merge branch 'x86/urgent' into core/percpu

Conflicts:
arch/x86/kernel/acpi/boot.c

Merge branch 'x86/uaccess' into core/percpu

x86: fix math_emu register frame access

do_device_not_available() is the handler for #NM and it declares that
it takes a unsigned long and calls math_emu(), which takes a long
argument and surprisingly expects the stack frame starting at the zero
argument would match struct math_emu_info, which isn't true regardless
of configuration in the current code.

This patch makes do_device_not_available() take struct pt_regs like
other exception handlers and initialize struct math_emu_info with
pointer to it and pass pointer to the math_emu_info to math_emulate()
like normal C functions do. This way, unless gcc makes a copy of
struct pt_regs in do_device_not_available(), the register frame is
correctly accessed regardless of kernel configuration or compiler
used.

This doesn't fix all math_emu problems but it at least gets it
somewhat working.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Merge commit 'v2.6.29-rc4' into core/percpu

Conflicts:
arch/x86/mach-voyager/voyager_smp.c
arch/x86/mm/fault.c

x86: math_emu info cleanup

Impact: cleanup

* Come on, struct info? s/struct info/struct math_emu_info/

* Use struct pt_regs and kernel_vm86_regs instead of defining its own
register frame structure.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: include correct %gs in a.out core dump

Impact: dump the correct %gs into a.out core dump

aout_dump_thread() read %gs but didn't include it in core dump. Fix
it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86, vmi: put a missing paravirt_release_pmd in pgd_dtor

Commit 6194ba6ff6ccf8d5c54c857600843c67aa82c407 ("x86: don't special-case
pmd allocations as much") made changes to the way we handle pmd allocations,
and while doing that it dropped a call to paravirt_release_pd on the
pgd page from the pgd_dtor code path.

As a result of this missing release, the hypervisor is now unaware of the
pgd page being freed, and as a result it ends up tracking this page as a
page table page.

After this the guest may start using the same page for other purposes, and
depending on what use the page is put to, it may result in various performance
and/or functional issues ( hangs, reboots).

Since this release is only required for VMI, I now release the pgd page from
the (vmi)_pgd_free hook.

Signed-off-by: Alok N Kataria <akataria@vmware.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>

x86: find nr_irqs_gsi with mp_ioapic_routing

Impact: find right nr_irqs_gsi on some systems.

One test-system has gap between gsi's:

[    0.000000] ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 4, version 0, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: IOAPIC (id[0x05] address[0xfeafd000] gsi_base[48])
[    0.000000] IOAPIC[1]: apic_id 5, version 0, address 0xfeafd000, GSI 48-54
[    0.000000] ACPI: IOAPIC (id[0x06] address[0xfeafc000] gsi_base[56])
[    0.000000] IOAPIC[2]: apic_id 6, version 0, address 0xfeafc000, GSI 56-62
...
[    0.000000] nr_irqs_gsi: 38

So nr_irqs_gsi is not right. some irq for MSI will overwrite with io_apic.

need to get that with acpi_probe_gsi when acpi io_apic is used

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: add clflush before monitor for Intel 7400 series

For Intel 7400 series CPUs, the recommendation is to use a clflush on the
monitored address just before monitor and mwait pair [1].

This clflush makes sure that there are no false wakeups from mwait when the
monitored address was recently written to.

[1] "MONITOR/MWAIT Recommendations for Intel Xeon Processor 7400 series"
section in specification update document of 7400 series
http://download.intel.com/design/xeon/specupdt/32033601.pdf

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: fix abuse of per_cpu_offset

Impact: bug fix

Don't use per_cpu_offset() to determine if it valid to access a
per-cpu variable for a given cpu number. It is not a valid assumption
on x86-64 anymore. Use cpu_possible() instead.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: use linker to offset symbols by __per_cpu_load

Impact: cleanup and bug fix

Use the linker to create symbols for certain per-cpu variables
that are offset by __per_cpu_load. This allows the removal of
the runtime fixup of the GDT pointer, which fixes a bug with
resume reported by Jiri Slaby.

Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

percpu: make PER_CPU_BASE_SECTION overridable by arches

Impact: bug fix

IA-64 needs to put percpu data in the seperate section even on UP.
Fixes regression caused by "percpu: refactor percpu.h"

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Linux 2.6.29-rc4

Merge git://git./linux/kernel/git/arjan/linux-2.6-async-update

* git://git.kernel.org/pub/scm/linux/kernel/git/arjan/linux-2.6-async-update:
  async: use list_move_tail
  async: Rename _special -> _domain for clarity.
  async: Add some documentation.
  async: Handle kthread_run() return codes.
  async: Fix running list handling.

radeonfb: Fix resume from D3Cold on some platforms

For historical reason, this driver used its own saving/restoring
of the PCI config space, and used the state of it on resume as
an indication as to whether it needed to re-POST the chip or not.

This methods breaks with the later core changes since the core will
have restored things for us.

This patch fixes it by removing that custom code, using standard
core methods to save/restore state, and testing for the need to
re-POST by comparing the content of a few key PLL registers.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

aty128fb: Properly save PCI state before changing PCI PM level

This fixes aty128fb to properly save the PCI config space -before- it
potentially switches the PM state of the chip. This avoids a
warning with the new PM core and is the right thing to do anyway.

I also replaced the hand-coded switch to D2 with a call to the
genericc pci_set_power_state() and removed the code that switches it
back to D0 since the generic code is doing that for us nowadays.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

atyfb: Properly save PCI state before changing PCI PM level

This fixes atyfb to properly save the PCI config space -before- it
potentially switches the PM state of the chip. This avoids a
warning with the new PM core and is the right thing to do anyway.

I also slightly cleaned up the code that checks whether we are
running on a PowerMac to do a runtime check instead of a compile
check only, and replaced a deprecated number with the proper
symbolic constant.

Finally, I removed the useless switch to D0 from resume since
the core does it for us.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

async: use list_move_tail

list.h provides a dedicated primitive for
"list_del followed by list_add_tail"... list_move_tail.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

async: Rename _special -> _domain for clarity.

Rename the async_*_special() functions to async_*_domain(), which
describes the purpose of these functions much better.
[Broke up long lines to silence checkpatch]

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

async: Add some documentation.

Add some kerneldoc to the async interface.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

async: Handle kthread_run() return codes.

If we fail to create the manager thread, fall back to non-fastboot.
If we fail to create an async thread, try again after waiting for
a bit.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

async: Fix running list handling.

async_schedule() should pass in async_running as the running
list, and run_one_entry() should put the entry to be run on
the provided running list instead of always on the generic one.

Reported-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

Merge branch 'for-linus' of git://git./linux/kernel/git/jbarnes/pci-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
  PCI PM: make the PM core more careful with drivers using the new PM framework
  PCI PM: Read power state from device after trying to change it on resume
  PCI PM: Do not disable and enable bridges during suspend-resume
  PCI: PCIe portdrv: Simplify suspend and resume
  PCI PM: Fix saving of device state in pci_legacy_suspend
  PCI PM: Check if the state has been saved before trying to restore it
  PCI PM: Fix handling of devices without drivers
  PCI: return error on failure to read PCI ROMs
  PCI: properly clean up ASPM link state on device remove

module: remove over-zealous check in __module_get()

Impact: fix spurious BUG_ON() triggered under load

module_refcount() isn't reliable outside stop_machine(), as demonstrated
by Karsten Keil <kkeil@suse.de>, networking can trigger it under load
(an inc on one cpu and dec on another while module_refcount() is tallying
can give false results, for example).

Almost noone should be using __module_get, but that's another issue.

Cc: Karsten Keil <kkeil@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'release' of git://git./linux/kernel/git/lenb/linux-acpi-2.6

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (30 commits)
  ACPI: Kconfig text - Fix the ACPI_CONTAINER module name according to the real module name.
  eeepc-laptop: fix oops when changing backlight brightness during eeepc-laptop init
  ACPICA: Fix table entry truncation calculation
  ACPI: Enable bit 11 in _PDC to advertise hw coord
  ACPI: struct device - replace bus_id with dev_name(), dev_set_name()
  ACPI: add missing KERN_* constants to printks
  ACPI: dock: Don't eval _STA on every show_docked sysfs read
  ACPI: disable ACPI cleanly when bad RSDP found
  ACPI: delete CPU_IDLE=n code
  ACPI: cpufreq: Remove deprecated /proc/acpi/processor/../performance proc entries
  ACPI: make some IO ports off-limits to AML
  ACPICA: add debug dump of BIOS _OSI strings
  ACPI: proc_dir_entry 'video/VGA' already registered
  ACPI: Skip the first two elements in the _BCL package
  ACPI: remove BM_RLD access from idle entry path
  ACPI: remove locking from PM1x_STS register reads
  eeepc-laptop: use netlink interface
  eeepc-laptop: Implement rfkill hotplugging in eeepc-laptop
  eeepc-laptop: Check return values from rfkill_register
  eeepc-laptop: Add support for extended hotkeys
  ...

Merge branches 'release', 'asus', 'bugzilla-12450', 'cpuidle', 'debug', 'ec', 'misc', 'printk' and 'processor' into release

ACPI: Kconfig text - Fix the ACPI_CONTAINER module name according to the real module name.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>

eeepc-laptop: fix oops when changing backlight brightness during eeepc-laptop init

I got the following oops while changing the backlight brightness during
startup. When it happens, it prevents use of the hotkeys, Fn-Fx, and the
lid button.

It's a clear use-before-init, as I verified by testing with an
appropriately-placed "else printk".

BUG: unable to handle kernel NULL pointer dereference at 00000000
*pde = 00000000
Oops: 0002 [#1] PREEMPT SMP
Pid: 160, comm: kacpi_notify Not tainted (2.6.28.1-eee901 #4) 901
EIP: 0060:[<c0264e68>] [<c0264e68>] eeepc_hotk_notify+26/da
EFLAGS: 00010246 CPU: 1
Using defaults from ksymoops -t elf32-i386 -a i386
EAX: 00000009 EBX: 00000000 ECX: 00000009 EDX: f70dbf64
ESI: 00000029 EDI: f7335188 EBP: c02112c9 ESP: f70dbf80
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
f70731e0 f73acd50 c02164ac f7335180 f70aa040 c02112e6 f733518c c012b62f
f70aa044 f70aa040 c012bdba f70aa04c 00000000 c012be6e 00000000 f70bdf80
c012e198 f70dbfc4 f70dbfc4 f70aa040 c012bdba 00000000 c012e0c9 c012e091
Call Trace:
[<c02164ac>] ? acpi_ev_notify_dispatch+4c/55
[<c02112e6>] ? acpi_os_execute_deferred+1d/25
[<c012b62f>] ? run_workqueue+71/f1
[<c012bdba>] ? worker_thread+0/bf
[<c012be6e>] ? worker_thread+b4/bf
[<c012e198>] ? autoremove_wake_function+0/2b
[<c012bdba>] ? worker_thread+0/bf
[<c012e0c9>] ? kthread+38/5f
[<c012e091>] ? kthread+0/5f
[<c0103abf>] ? kernel_thread_helper+7/10
Code: 00 00 00 00 c3 83 3d 60 5c 50 c0 00 56 89 d6 53 0f 84 c4 00 00 00 8d 42
e0 83 f8 0f 77 0f 8b 1d 68 5c 50 c0 89 d8 e8 a9 fa ff ff <89> 03 8b 1d 60 5c
50 c0 89 f2 83 e2 7f 0f b7 4c 53 10 8d 41 01

Signed-off-by: Darren Salt <linux@youmustbejoking.demon.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>

ACPICA: Fix table entry truncation calculation

During early boot, ACPI RSDT/XSDT table entries are gathered into the
'initial_tables[]' array.  This array is currently statically defined (see
./drivers/acpi/tables.c).  When there are more table entries than can be
held in the 'initial_tables[]' array, the message "Truncating N table
entries!" is output.  As currently implemented, this message will always
erroneously calculate N as 0.

This patch fixes the calculation that determines how many table entries
will be missing (truncated).

This modification may be used under either the GPL or the BSD-style
license used for Intel ACPI CA code.

Signed-off-by: Myron Stowe <myron.stowe@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Len Brown <len.brown@intel.com>

ACPI: Enable bit 11 in _PDC to advertise hw coord

Bit 11 in intel PDC definitions is meant for OS capability to handle
hardware coordination of P-states. In Linux we have always supported
hwardware coordination of P-states. Just let the BIOSes know that we
support it, by setting this bit.

Some BIOSes use this bit to choose between hardware or software coordination
and without this change below, BIOSes switch to software coordination, which
is not very optimal in terms of power consumption and extra wakeups from idle.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>

ACPI: struct device - replace bus_id with dev_name(), dev_set_name()

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>

ACPI: add missing KERN_* constants to printks

According to kerneljanitors todo list all printk calls (beginning
a new line) should have an according KERN_* constant.
Those are the missing peaces here for the acpi subsystem.

Signed-off-by: Frank Seidel <frank@f-seidel.de>
Signed-off-by: Len Brown <len.brown@intel.com>

ACPI: dock: Don't eval _STA on every show_docked sysfs read

Some devices trigger a DEVICE_CHECK on every evalutation of _STA. This
can also be seen in commit 8b59560a3baf2e7c24e0fb92ea5d09eca92805db
(ACPI: dock: avoid check _STA method). If an undock is processed, the
dock driver sends a uevent and userspace might read the show_docked
property in sysfs. This causes an evaluation of _STA of the particular
device which causes the dock driver to immediately dock again.

In any case, evaluation of _STA (show_docked) does not necessarily mean
that we are docked, so check with the internal device structure.

http://bugzilla.kernel.org/show_bug.cgi?id=12360

Signed-off-by: Holger Macht <hmacht@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>

Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/security-testing-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
CRED: Fix SUID exec regression

Merge git://git./linux/kernel/git/mason/btrfs-unstable

* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (37 commits)
  Btrfs: Make sure dir is non-null before doing S_ISGID checks
  Btrfs: Fix memory leak in cache_drop_leaf_ref
  Btrfs: don't return congestion in write_cache_pages as often
  Btrfs: Only prep for btree deletion balances when nodes are mostly empty
  Btrfs: fix btrfs_unlock_up_safe to walk the entire path
  Btrfs: change btrfs_del_leaf to drop locks earlier
  Btrfs: Change btrfs_truncate_inode_items to stop when it hits the inode
  Btrfs: Don't try to compress pages past i_size
  Btrfs: join the transaction in __btrfs_setxattr
  Btrfs: Handle SGID bit when creating inodes
  Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
  Btrfs: Change btree locking to use explicit blocking points
  Btrfs: hash_lock is no longer needed
  Btrfs: disable leak debugging checks in extent_io.c
  Btrfs: sort references by byte number during btrfs_inc_ref
  Btrfs: async threads should try harder to find work
  Btrfs: selinux support
  Btrfs: make btrfs acls selectable
  Btrfs: Catch missed bios in the async bio submission thread
  Btrfs: fix readdir on 32 bit machines
  ...

eCryptfs: Regression in unencrypted filename symlinks

The addition of filename encryption caused a regression in unencrypted
filename symlink support. ecryptfs_copy_filename() is used when dealing
with unencrypted filenames and it reported that the new, copied filename
was a character longer than it should have been.

This caused the return value of readlink() to count the NULL byte of the
symlink target. Most applications don't care about the extra NULL byte,
but a version control system (bzr) helped in discovering the bug.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'x86/fixes' of git://git./linux/kernel/git/frob/linux-2.6-roland

* 'x86/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland:
x86-64: fix int $0x80 -ENOSYS return

x86-64: fix int $0x80 -ENOSYS return

One of my past fixes to this code introduced a different new bug.
When using 32-bit "int $0x80" entry for a bogus syscall number,
the return value is not correctly set to -ENOSYS.  This only happens
when neither syscall-audit nor syscall tracing is enabled (i.e., never
seen if auditd ever started).  Test program:

/* gcc -o int80-badsys -m32 -g int80-badsys.c
   Run on x86-64 kernel.
   Note to reproduce the bug you need auditd never to have started.  */

#include <errno.h>
#include <stdio.h>

int
main (void)
{
  long res;
  asm ("int $0x80" : "=a" (res) : "0" (99999));
  printf ("bad syscall returns %ld\n", res);
  return res != -ENOSYS;
}

The fix makes the int $0x80 path match the sysenter and syscall paths.

Reported-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Roland McGrath <roland@redhat.com>

Merge branch 'to-linus' of git://git./linux/kernel/git/frob/linux-2.6-roland

* 'to-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/frob/linux-2.6-roland:
elf core dump: fix get_user use

elf core dump: fix get_user use

The elf_core_dump() code does its work with set_fs(KERNEL_DS) in force,
so vma_dump_size() needs to switch back with set_fs(USER_DS) to safely
use get_user() for a normal user-space address.

Checking for VM_READ optimizes out the case where get_user() would fail
anyway. The vm_file check here was already superfluous given the control
flow earlier in the function, so that is a cleanup/optimization unrelated
to other changes but an obvious and trivial one.

Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Roland McGrath <roland@redhat.com>

CRED: Fix SUID exec regression

The patch:

commit a6f76f23d297f70e2a6b3ec607f7aeeea9e37e8d
CRED: Make execve() take advantage of copy-on-write credentials

moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called.  This means that LSM_UNSAFE_SHARE is now
calculated incorrectly.  This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made.  All of which are true for threads created by the
pthread library.

However, since we wish to make the security calculation before irrevocably
damaging the process so that we can return it an error code in the case where
we decide we want to reject the exec request on this basis, we have to make the
determination before calling de_thread().

So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us.  These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().

We do have to be careful because CLONE_THREAD does not imply FS or FILES.

We _assume_ that there will be no extra references to these structs held by the
threads we're going to kill.

This can be tested with the attached pair of programs.  Build the two programs
using the Makefile supplied, and run ./test1 as a non-root user.  If
successful, you should see something like:

[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=0 suid=0
SUCCESS - Correct effective user ID

and if unsuccessful, something like:

[dhowells@andromeda tmp]$ ./test1
--TEST1--
uid=4043, euid=4043 suid=4043
exec ./test2
--TEST2--
uid=4043, euid=4043 suid=4043
ERROR - Incorrect effective user ID!

The non-root user ID you see will depend on the user you run as.

[test1.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

static void *thread_func(void *arg)
{
while (1) {}
}

int main(int argc, char **argv)
{
pthread_t tid;
uid_t uid, euid, suid;

printf("--TEST1--\n");
getresuid(&uid, &euid, &suid);
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

if (pthread_create(&tid, NULL, thread_func, NULL) < 0) {
perror("pthread_create");
exit(1);
}

printf("exec ./test2\n");
execlp("./test2", "test2", NULL);
perror("./test2");
_exit(1);
}

[test2.c]
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
uid_t uid, euid, suid;

getresuid(&uid, &euid, &suid);
printf("--TEST2--\n");
printf("uid=%d, euid=%d suid=%d\n", uid, euid, suid);

if (euid != 0) {
fprintf(stderr, "ERROR - Incorrect effective user ID!\n");
exit(1);
}
printf("SUCCESS - Correct effective user ID\n");
exit(0);
}

[Makefile]
CFLAGS = -D_GNU_SOURCE -Wall -Werror -Wunused
all: test1 test2

test1: test1.c
gcc $(CFLAGS) -o test1 test1.c -lpthread

test2: test2.c
gcc $(CFLAGS) -o test2 test2.c
sudo chown root.root test2
sudo chmod +s test2

Reported-by: David Smith <dsmith@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: David Smith <dsmith@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>

vfs: Don't call attach_nobh_buffers() with an empty list

This is a modification of a patch by Bill Pemberton <wfp5p@virginia.edu>

nobh_write_end() could call attach_nobh_buffers() with head == NULL.
This would result in a trap when attach_nobh_buffers() attempted to
access bh->b_this_page.

This can be illustrated by running the writev01 testcase from LTP on jfs.

This error was introduced by commit 5b41e74a "vfs: fix data leak in
nobh_write_end()". That patch did not take into account that if
PageMappedToDisk() is true upon entry to nobh_write_begin(), then no
buffers will be allocated for the page. In that case, we won't have to
worry about a failed write leaving unitialized data in the page.

Of course, head != NULL implies !page_has_buffers(page), so no need to
test both.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Dmitri Monakhov <dmonakhov@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git./linux/kernel/git/tiwai/sound-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: hda - Add missing COEF initialization for ALC887
  ALSA: hda - Add missing initialization for ALC272
  sound: usb-audio: handle wMaxPacketSize for FIXED_ENDPOINT devices
  ALSA: hda - Fix misc workqueue issues
  ALSA: hda - Add quirk for FSC Amilo Xi2550

ACPI: disable ACPI cleanly when bad RSDP found

When ACPI is disabled in the BIOS of this VIA C3 box,
it invalidates the RSDP, which Linux notices:

ACPI Error (tbxfroot-0218): A valid RSDP was not found [20080926]

Bug Linux neglected to disable ACPI at that stage,
and later scribbled on smp_found_config:

ACPI: No APIC-table, disabling MPS

But this box doesn't run well in legacy PIC mode,
it needed IOAPIC mode to perform correctly:

http://lkml.org/lkml/2009/2/5/39

So exit ACPI mode cleanly when we first detect
that it is hopeless.

Signed-off-by: Len Brown <len.brown@intel.com>

ACPI: delete CPU_IDLE=n code

CPU_IDLE=y has been default for ACPI=y since Nov-2007,
and has shipped in many distributions since then.

Here we delete the CPU_IDLE=n ACPI idle code, since
nobody should be using it, and we don't want to
maintain two versions.

Signed-off-by: Len Brown <len.brown@intel.com>

Merge branch 'for-linus' of git://git./linux/kernel/git/ieee1394/linux1394-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6:
ieee1394: dv1394: move deprecation message from module init to file open
firewire: core: Remove card from list of cards when enable fails

Add Sascha Hauer to .mailmap

This fixes the shortlog attribution e.g. for 106757b38fff

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Acked-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

add another mailmap entry for Uwe Kleine-König

I created commit 7971db5a4b4176ad5df590fce07a962c643a2740 on a machine
where I forgot to set user.name and user.email before. The default
values were not optimal.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fork.c: fix NULL pointer dereference when nr_threads == threads-max

I happened to forked lots of processes, and hit NULL pointer dereference.
It is because in copy_process() after checking max_threads, 0 is returned
but not -EAGAIN.

The bug is introduced by "CRED: Detach the credentials from task_struct"
(commit f1752eec6145c97163dbce62d17cf5d928e28a27).

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Btrfs: Make sure dir is non-null before doing S_ISGID checks

The S_ISGID check in btrfs_new_inode caused an oops during subvol creation
because sometimes the dir is null.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

Merge branch 'for-linus' of git://neil.brown.name/md

* 'for-linus' of git://neil.brown.name/md:
  md: Ensure an md array never has too many devices.
  md: Fix a bug in linear.c causing which_dev() to return the wrong device.
  md: Allow read error in a single drive raid1 to be passed up.

ieee1394: dv1394: move deprecation message from module init to file open

On many Linux installations, the dv1394 driver will be auto-loaded
whenever an AV/C device (e.g. camcorder or audio device) is plugged in.
An irritating message would then appear in the kernel log.

Defer this message to until a dv1394 character device file is actually
used by a program. Also include the program name in the message and
update the message slightly.

Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>

Merge branch 'fix/usb-audio' into for-linus

Merge branch 'fix/hda' into for-linus

ALSA: hda - Add missing COEF initialization for ALC887

Signed-off-by: Takashi Iwai <tiwai@suse.de>

ALSA: hda - Add missing initialization for ALC272

ALC272 needs EAPD for speaker outputs as well as other similar ALC
codecs.

Signed-off-by: Takashi Iwai <tiwai@suse.de>

sound: usb-audio: handle wMaxPacketSize for FIXED_ENDPOINT devices

For audio devices that do not have proper audio descriptors (e.g.,
Edirol UA-20), we use hardcoded parameters from our quirks list.
However, we must still read the maximum packet size from the standard
endpoint descriptor; otherwise, we might use packets that are too big
and therefore rejected by the USB core.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

md: Ensure an md array never has too many devices.

Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.

We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.

We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.

So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.

This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>

md: Fix a bug in linear.c causing which_dev() to return the wrong device.

ab5bd5cbc8d4b868378d062eed3d4240930fbb86 introduced the following
bug in linear software raid for large arrays on 32 bit machines:

which_dev() computes the device holding a given sector by shifting
down the sector number to a 32 bit range, dividing by the array
spacing and looking up the resulting index in the hash table of
the array.

Because the computed index might be slightly too small, a loop at
the end of which_dev() increases the index until the given sector
actually falls into the range of the device associated with that index.

The changes of the above mentioned commit caused this loop to check
whether the _index_ rather than the sector number is small enough,
effectively bypassing the loop and thus possibly returning the wrong
device.

As reported by Simon Kirby, this leads to errors such as

linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464

Fix this bug by introducing a local variable for the index so that
the variable containing the passed sector is left unchanged.

Cc: stable@kernel.org
Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>

md: Allow read error in a single drive raid1 to be passed up.

If a raid1 only has a single working device and gets a read error,
we choose to simply return that error up to the filesystem (or whatever)
rather than failing the whole array.

However the codes doesn't quite do that. We attempt a readbalance
which allocates the same drive, so we retry the read - indefinitely.

Instead: If read_balance in the error case chooses the same drive that just
failed, treat it as a failure and don't retry.

Signed-off-by: NeilBrown <neilb@suse.de>

prevent kprobes from catching spurious page faults

Prevent kprobes from catching spurious faults which will cause infinite
recursive page-fault and memory corruption by stack overflow.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>