openwrt/staging/blogic.git
19 years ago[PATCH] msync: check pte dirty earlier
Abhijit Karmarkar [Wed, 22 Jun 2005 00:15:13 +0000 (17:15 -0700)]
[PATCH] msync: check pte dirty earlier

It's common practice to msync a large address range regularly, in which
often only a few ptes have actually been dirtied since the previous pass.

sync_pte_range then goes much faster if it tests whether pte is dirty
before locating and accessing each struct page cacheline; and it is hardly
slowed by ptep_clear_flush_dirty repeating that test in the opposite case,
when every pte actually is dirty.

But beware, s390's pte_dirty always says false, since its dirty bit is kept
in the storage key, located via the struct page address.  So skip this
optimization in its case: use a pte_maybe_dirty macro which just says true
if page_test_and_clear_dirty is implemented.

Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] can_share_swap_page: use page_mapcount
Hugh Dickins [Wed, 22 Jun 2005 00:15:12 +0000 (17:15 -0700)]
[PATCH] can_share_swap_page: use page_mapcount

Remember that ironic get_user_pages race?  when the raised page_count on a
page swapped out led do_wp_page to decide that it had to copy on write, so
substituted a different page into userspace.  2.6.7 onwards have Andrea's
solution, where try_to_unmap_one backs out if it finds page_count raised.

Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
and was found a few months ago to hang an intensive page migration test.  A
year ago I was hesitant to engage page_mapcount, now it seems the right fix.

So remove the page_count hack from try_to_unmap_one; and use activate_page in
unuse_mm when dropping lock, to replace its secondary effect of helping
swapoff to make progress in that case.

Simplify can_share_swap_page (now called only on anonymous pages) to check
page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
their (pessimistic) sum, but does not need swapper_space.tree_lock for that.

In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
keep sum on the high side, and correct when can_share_swap_page called.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] do_wp_page: cannot share file page
Hugh Dickins [Wed, 22 Jun 2005 00:15:11 +0000 (17:15 -0700)]
[PATCH] do_wp_page: cannot share file page

A small optimization to do_wp_page's check for whether to avoid copy by
reusing the page already mapped.  It can never share a cached file page,
nor can it share a reserved page (often the empty zero page), so it's a
waste of time to lock and unlock in those cases.  Which nowadays can both
be neatly excluded by a preliminary PageAnon test.

Christoph has reported that a preliminary page_count test proved valuable
for scalability here, but PageAnon covers more common cases all at once.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] get_user_pages: kill get_page_map
Hugh Dickins [Wed, 22 Jun 2005 00:15:10 +0000 (17:15 -0700)]
[PATCH] get_user_pages: kill get_page_map

Since its birth, get_user_pages has been calling a misguided get_page_map
function.  follow_page has already returned NULL if the pfn is invalid, we
cannot reach an invalid pfn from a validated struct page.

Remove get_page_map, and the messy rewind in get_user_pages to cope with
its failure.  Oh, and could we please call that "struct page *page" like
everywhere else, instead of "struct page *map"?

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] rme96xx: fix PageReserved range
Hugh Dickins [Wed, 22 Jun 2005 00:15:09 +0000 (17:15 -0700)]
[PATCH] rme96xx: fix PageReserved range

rme96xx busmaster_malloc miscalculates and fails to set PageReserved on any
page of char *buf; but busmaster_free does it right, so do the same (I
don't have the card, just noticed this while sifting for rmap BUGs).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] bad_page: clear reclaim and slab
Hugh Dickins [Wed, 22 Jun 2005 00:15:08 +0000 (17:15 -0700)]
[PATCH] bad_page: clear reclaim and slab

Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
ought to clear them to avoid repetitive reports (Nikita noticed this too).
Let prep_new_page check page_count and PG_slab as free_pages_check does.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] dup_mmap: update comment on new vma
Hugh Dickins [Wed, 22 Jun 2005 00:15:08 +0000 (17:15 -0700)]
[PATCH] dup_mmap: update comment on new vma

Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap
came in, try_to_unmap_one knows the vma without needing find_vma.  But add
a comment to note that here vma is inserted without mmap_sem.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] mbind: check_range use standard ptwalk
Hugh Dickins [Wed, 22 Jun 2005 00:15:07 +0000 (17:15 -0700)]
[PATCH] mbind: check_range use standard ptwalk

Strict mbind's check for currently mapped pages being on node has been
using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry:
replace that by a standard four-level page table walk like others in mm.
Since mmap_sem is held for writing, page_table_lock can be taken at the
inner level to limit latency.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] mbind: fix verify_pages pte_page
Hugh Dickins [Wed, 22 Jun 2005 00:15:06 +0000 (17:15 -0700)]
[PATCH] mbind: fix verify_pages pte_page

Strict mbind's check that pages already mapped are on right node has been
using pte_page without checking if pfn_valid, and without page_table_lock
to prevent spurious failures when try_to_unmap_one intervenes between the
pte_present and the pte_page.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] ia64: pfn_to_nid() implementation
Bob Picco [Wed, 22 Jun 2005 00:15:05 +0000 (17:15 -0700)]
[PATCH] ia64: pfn_to_nid() implementation

pfn_to_nid is undefined.  We haven't had this interface on ia64.  The
sys_mbind patches need it.

Oh, the paddr_to_nid call could fail when DISCONTIG+NUMA is configured
because there isn't any ACPI SRAT NUMA information.

Signed-off-by: Bob Picco <bob.picco@hp.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] shmem: restore superblock info
Hugh Dickins [Wed, 22 Jun 2005 00:15:04 +0000 (17:15 -0700)]
[PATCH] shmem: restore superblock info

To improve shmem scalability, we allowed tmpfs instances which don't need
their blocks or inodes limited not to count them, and not to allocate any
sbinfo.  Which was okay when the only use for the sbinfo was accounting
blocks and inodes; but since then a couple of unrelated projects extending
tmpfs want to store other data in the sbinfo.  Whether either extension
reaches mainline is beside the point: I'm guilty of a bad design decision,
and should restore sbinfo to make any such future extensions easier.

So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
inodes.  Brent Casavant verified (many months ago) that this does not
perceptibly impact the scalability (since the unlimited sbinfo cacheline is
repeatedly accessed but only once dirtied).

And merge shmem_set_size into its sole caller shmem_remount_fs.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] SN2 XPC build patches
Jes Sorensen [Wed, 22 Jun 2005 00:15:03 +0000 (17:15 -0700)]
[PATCH] SN2 XPC build patches

This patch contains the bits to make the XPC code use the uncached
allocator rather than calling into the mspec driver.  It also includes the
mspec.h header which is required to build the XPC modules.

Signed-off-by: Jes Sorensen <jes@wildopensource.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] ia64 uncached alloc
Jes Sorensen [Wed, 22 Jun 2005 00:15:02 +0000 (17:15 -0700)]
[PATCH] ia64 uncached alloc

This patch contains the ia64 uncached page allocator and the generic
allocator (genalloc).  The uncached allocator was formerly part of the SN2
mspec driver but there are several other users of it so it has been split
off from the driver.

The generic allocator can be used by device driver to manage special memory
etc.  The generic allocator is based on the allocator from the sym53c8xx_2
driver.

Various users on ia64 needs uncached memory.  The SGI SN architecture requires
it for inter-partition communication between partitions within a large NUMA
cluster.  The specific user for this is the XPC code.  Another application is
large MPI style applications which use it for synchronization, on SN this can
be done using special 'fetchop' operations but it also benefits non SN
hardware which may use regular uncached memory for this purpose.  Performance
of doing this through uncached vs cached memory is pretty substantial.  This
is handled by the mspec driver which I will push out in a seperate patch.

Rather than creating a specific allocator for just uncached memory I came up
with genalloc which is a generic purpose allocator that can be used by device
drivers and other subsystems as they please.  For instance to handle onboard
device memory.  It was derived from the sym53c7xx_2 driver's allocator which
is also an example of a potential user (I am refraining from modifying sym2
right now as it seems to have been under fairly heavy development recently).

On ia64 memory has various properties within a granule, ie.  it isn't safe to
access memory as uncached within the same granule as currently has memory
accessed in cached mode.  The regular system therefore doesn't utilize memory
in the lower granules which is mixed in with device PAL code etc.  The
uncached driver walks the EFI memmap and pulls out the spill uncached pages
and sticks them into the uncached pool.  Only after these chunks have been
utilized, will it start converting regular cached memory into uncached memory.
Hence the reason for the EFI related code additions.

Signed-off-by: Jes Sorensen <jes@wildopensource.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] Reduce size of huge boot per_cpu_pageset
Christoph Lameter [Wed, 22 Jun 2005 00:15:00 +0000 (17:15 -0700)]
[PATCH] Reduce size of huge boot per_cpu_pageset

Reduce size of the huge per_cpu_pageset structure in __initdata introduced
into mm1 with the pageset localization patchset.  Use one specially
configured pageset per cpu for all zones and nodes during bootup.

- Avoid duplication of pageset initialization code.
- do the adding to the pageset list before potential free_pages_bulk
  in free_hot_cold_page (otherwise we would have to hold a page
  in a pageset during the period that the boot pagesets are in use).
- remove mistaken __cpuinitdata attribute and revert back to __initdata
  for the boot pageset. A boot pageset is not necessary for cpu hotplug.

Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
sparsemem patches)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] Periodically drain non local pagesets
Christoph Lameter [Wed, 22 Jun 2005 00:14:57 +0000 (17:14 -0700)]
[PATCH] Periodically drain non local pagesets

The pageset array can potentially acquire a huge amount of memory on large
NUMA systems.  F.e.  on a system with 512 processors and 256 nodes there
will be 256*512 pagesets.  If each pageset only holds 5 pages then we are
talking about 655360 pages.With a 16K page size on IA64 this results in
potentially 10 Gigabytes of memory being trapped in pagesets.  The typical
cases are much less for smaller systems but there is still the potential of
memory being trapped in off node pagesets.  Off node memory may be rarely
used if local memory is available and so we may potentially have memory in
seldom used pagesets without this patch.

The slab allocator flushes its per cpu caches every 2 seconds.  The
following patch flushes the off node pageset caches in the same way by
tying into the slab flush.

The patch also changes /proc/zoneinfo to include the number of pages
currently in each pageset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] add OOM debug
Janet Morgan [Wed, 22 Jun 2005 00:14:56 +0000 (17:14 -0700)]
[PATCH] add OOM debug

This patch provides more debug info when the system is OOM.  It displays
memory stats (basically sysrq-m info) from __alloc_pages() when page
allocation fails and during OOM kill.

Thanks to Dave Jones for coming up with the idea.

Signed-off-by: Janet Morgan <janetmor@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] __read_page_state(): pass unsigned long instead of unsigned
Benjamin LaHaise [Wed, 22 Jun 2005 00:14:55 +0000 (17:14 -0700)]
[PATCH] __read_page_state(): pass unsigned long instead of unsigned

By making the offset argument of __read_page_state an unsigned long instead of
unsigned, we can avoid forcing the compiler to sign extend a usually constant
argument.  This saves 1 instruction on x86-64.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] __mod_page_state(): pass unsigned long instead of unsigned
Benjamin LaHaise [Wed, 22 Jun 2005 00:14:54 +0000 (17:14 -0700)]
[PATCH] __mod_page_state(): pass unsigned long instead of unsigned

By making the offset argument of __mod_page_state an unsigned long instead
of unsigned, we can avoid forcing the compiler to sign extend a usually
constant argument.  This saves 1 instruction on x86-64.

Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] vm: try_to_free_pages unused argument
Darren Hart [Wed, 22 Jun 2005 00:14:53 +0000 (17:14 -0700)]
[PATCH] vm: try_to_free_pages unused argument

try_to_free_pages accepts a third argument, order, but hasn't used it since
before 2.6.0.  The following patch removes the argument and updates all the
calls to try_to_free_pages.

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] mm: remove PG_highmem
Badari Pulavarty [Wed, 22 Jun 2005 00:14:52 +0000 (17:14 -0700)]
[PATCH] mm: remove PG_highmem

Remove PG_highmem, to save a page flag.  Use is_highmem() instead.  It'll
generate a little more code, but we don't use PageHigheMem() in many places.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] mmap topdown fix for large stack limit, large allocation
Chris Wright [Wed, 22 Jun 2005 00:14:52 +0000 (17:14 -0700)]
[PATCH] mmap topdown fix for large stack limit, large allocation

The topdown changes in 2.6.12-rc1 can cause large allocations with large
stack limit to fail, despite there being space available.  The
mmap_base-len is only valid when len >= mmap_base.  However, nothing in
topdown allocator checks this.  It's only (now) caught at higher level,
which will cause allocation to simply fail.  The following change restores
the fallback to bottom-up path, which will allow large allocations with
large stack limit to potentially still succeed.

Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] Avoiding mmap fragmentation
Wolfgang Wander [Wed, 22 Jun 2005 00:14:49 +0000 (17:14 -0700)]
[PATCH] Avoiding mmap fragmentation

Ingo recently introduced a great speedup for allocating new mmaps using the
free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
causes huge performance increases in thread creation.

The downside of this patch is that it does lead to fragmentation in the
mmap-ed areas (visible via /proc/self/maps), such that some applications
that work fine under 2.4 kernels quickly run out of memory on any 2.6
kernel.

The problem is twofold:

  1) the free_area_cache is used to continue a search for memory where
     the last search ended.  Before the change new areas were always
     searched from the base address on.

     So now new small areas are cluttering holes of all sizes
     throughout the whole mmap-able region whereas before small holes
     tended to close holes near the base leaving holes far from the base
     large and available for larger requests.

  2) the free_area_cache also is set to the location of the last
     munmap-ed area so in scenarios where we allocate e.g.  five regions of
     1K each, then free regions 4 2 3 in this order the next request for 1K
     will be placed in the position of the old region 3, whereas before we
     appended it to the still active region 1, placing it at the location
     of the old region 2.  Before we had 1 free region of 2K, now we only
     get two free regions of 1K -> fragmentation.

The patch addresses thes issues by introducing yet another cache descriptor
cached_hole_size that contains the largest known hole size below the
current free_area_cache.  If a new request comes in the size is compared
against the cached_hole_size and if the request can be filled with a hole
below free_area_cache the search is started from the base instead.

The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
(earlier posted) leakme.c test program terminates after 50000+ iterations
with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
(as expected) with thread creation, Ingo's test_str02 with 20000 threads
requires 0.7s system time.

Taking out Ingo's patch (un-patch available per request) by basically
deleting all mentions of free_area_cache from the kernel and starting the
search for new memory always at the respective bases we observe: leakme
terminates successfully with 11 distinctive hardly fragmented areas in
/proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
time for Ingo's test_str02 with 20000 threads.

Now - drumroll ;-) the appended patch works fine with leakme: it ends with
only 7 distinct areas in /proc/self/maps and also thread creation seems
sufficiently fast with 0.71s for 20000 threads.

Signed-off-by: Wolfgang Wander <wwc@rentec.com>
Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] node local per-cpu-pages
Christoph Lameter [Wed, 22 Jun 2005 00:14:47 +0000 (17:14 -0700)]
[PATCH] node local per-cpu-pages

This patch modifies the way pagesets in struct zone are managed.

Each zone has a per-cpu array of pagesets.  So any particular CPU has some
memory in each zone structure which belongs to itself.  Even if that CPU is
not local to that zone.

So the patch relocates the pagesets for each cpu to the node that is nearest
to the cpu instead of allocating the pagesets in the (possibly remote) target
zone.  This means that the operations to manage pages on remote zone can be
done with information available locally.

We play a macro trick so that non-NUMA pmachines avoid the additional
pointer chase on the page allocator fastpath.

AIM7 benchmark on a 32 CPU SGI Altix

w/o patches:
Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      484.68  100       484.6769     12.01      1.97   Fri Mar 25 11:01:42 2005
  100    27140.46   89       271.4046     21.44    148.71   Fri Mar 25 11:02:04 2005
  200    30792.02   82       153.9601     37.80    296.72   Fri Mar 25 11:02:42 2005
  300    32209.27   81       107.3642     54.21    451.34   Fri Mar 25 11:03:37 2005
  400    34962.83   78        87.4071     66.59    588.97   Fri Mar 25 11:04:44 2005
  500    31676.92   75        63.3538     91.87    742.71   Fri Mar 25 11:06:16 2005
  600    36032.69   73        60.0545     96.91    885.44   Fri Mar 25 11:07:54 2005
  700    35540.43   77        50.7720    114.63   1024.28   Fri Mar 25 11:09:49 2005
  800    33906.70   74        42.3834    137.32   1181.65   Fri Mar 25 11:12:06 2005
  900    34120.67   73        37.9119    153.51   1325.26   Fri Mar 25 11:14:41 2005
 1000    34802.37   74        34.8024    167.23   1465.26   Fri Mar 25 11:17:28 2005

with slab API changes and pageset patch:

Tasks    jobs/min  jti  jobs/min/task      real       cpu
    1      485.00  100       485.0000     12.00      1.96   Fri Mar 25 11:46:18 2005
  100    28000.96   89       280.0096     20.79    150.45   Fri Mar 25 11:46:39 2005
  200    32285.80   79       161.4290     36.05    293.37   Fri Mar 25 11:47:16 2005
  300    40424.15   84       134.7472     43.19    438.42   Fri Mar 25 11:47:59 2005
  400    39155.01   79        97.8875     59.46    590.05   Fri Mar 25 11:48:59 2005
  500    37881.25   82        75.7625     76.82    730.19   Fri Mar 25 11:50:16 2005
  600    39083.14   78        65.1386     89.35    872.79   Fri Mar 25 11:51:46 2005
  700    38627.83   77        55.1826    105.47   1022.46   Fri Mar 25 11:53:32 2005
  800    39631.94   78        49.5399    117.48   1169.94   Fri Mar 25 11:55:30 2005
  900    36903.70   79        41.0041    141.94   1310.78   Fri Mar 25 11:57:53 2005
 1000    36201.23   77        36.2012    160.77   1458.31   Fri Mar 25 12:00:34 2005

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Shai Fultheim <Shai@Scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] Hugepage consolidation
David Gibson [Wed, 22 Jun 2005 00:14:44 +0000 (17:14 -0700)]
[PATCH] Hugepage consolidation

A lot of the code in arch/*/mm/hugetlbpage.c is quite similar.  This patch
attempts to consolidate a lot of the code across the arch's, putting the
combined version in mm/hugetlb.c.  There are a couple of uglyish hacks in
order to covert all the hugepage archs, but the result is a very large
reduction in the total amount of code.  It also means things like hugepage
lazy allocation could be implemented in one place, instead of six.

Tested, at least a little, on ppc64, i386 and x86_64.

Notes:
- this patch changes the meaning of set_huge_pte() to be more
  analagous to set_pte()
- does SH4 need s special huge_ptep_get_and_clear()??

Acked-by: William Lee Irwin <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] VM: rate limit early reclaim
Martin Hicks [Wed, 22 Jun 2005 00:14:43 +0000 (17:14 -0700)]
[PATCH] VM: rate limit early reclaim

When early zone reclaim is turned on the LRU is scanned more frequently when a
zone is low on memory.  This limits when the zone reclaim can be called by
skipping the scan if another thread (either via kswapd or sync reclaim) is
already reclaiming from the zone.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] VM: add __GFP_NORECLAIM
Martin Hicks [Wed, 22 Jun 2005 00:14:42 +0000 (17:14 -0700)]
[PATCH] VM: add __GFP_NORECLAIM

When using the early zone reclaim, it was noticed that allocating new pages
that should be spread across the whole system caused eviction of local pages.

This adds a new GFP flag to prevent early reclaim from happening during
certain allocation attempts.  The example that is implemented here is for page
cache pages.  We want page cache pages to be spread across the whole system,
and we don't want page cache pages to evict other pages to get local memory.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] VM: early zone reclaim
Martin Hicks [Wed, 22 Jun 2005 00:14:41 +0000 (17:14 -0700)]
[PATCH] VM: early zone reclaim

This is the core of the (much simplified) early reclaim.  The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.

One of the major uses of this is NUMA machines.  With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.

This adds a zone tuneable to enable early zone reclaim.  It is selected on a
per-zone basis and is turned on/off via syscall.

Adding some extra throttling on the reclaim was also required (patch
4/4).  Without the machine would grind to a crawl when doing a "make -j"
kernel build.  Even with this patch the System Time is higher on
average, but it seems tolerable.  Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

wall  user   sys   %cpu  ctx sw.  sleeps
----  ----   ---   ----   ------  ------
No patch 1009  1384   847   258   298170   504402
w/patch, no reclaim     880   1376   667   288   254064   396745
w/patch & reclaim       1079  1385   926   252   291625   548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot.  Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] VM: add may_swap flag to scan_control
Martin Hicks [Wed, 22 Jun 2005 00:14:40 +0000 (17:14 -0700)]
[PATCH] VM: add may_swap flag to scan_control

Here's the next round of these patches.  These are totally different in
an attempt to meet the "simpler" request after the last patches.  For
reference the earlier threads are:

http://marc.theaimsgroup.com/?l=linux-kernel&m=110839604924587&w=2
http://marc.theaimsgroup.com/?l=linux-mm&m=111461480721249&w=2

This set of patches replaces my other vm- patches that are currently in
-mm.  So they're against 2.6.12-rc5-mm1 about half way through the -mm
patchset.

As I said already this patch is a lot simpler.  The reclaim is turned on
or off on a per-zone basis using a syscall.  I haven't tested the x86
syscall, so it might be wrong.  It uses the existing reclaim/pageout
code with the small addition of a may_swap flag to scan_control
(patch 1/4).

I also added __GFP_NORECLAIM (patch 3/4) so that certain allocation
types can be flagged to never cause reclaim.  This was a deficiency
that was in all of my earlier patch sets.  Previously, doing a big
buffered read would fill one zone with page cache and then start to
reclaim from that same zone, leaving the other zones untouched.

Adding some extra throttling on the reclaim was also required (patch
4/4).  Without the machine would grind to a crawl when doing a "make -j"
kernel build.  Even with this patch the System Time is higher on
average, but it seems tolerable.  Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

wall  user   sys   %cpu  ctx sw.  sleeps
----  ----   ---   ----   ------  ------
No patch 1009  1384   847   258   298170   504402
w/patch, no reclaim     880   1376   667   288   254064   396745
w/patch & reclaim       1079  1385   926   252   291625   548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot.  Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c

This patch:

This adds an extra switch to the scan_control struct.  It simply lets the
reclaim code know if its allowed to swap pages out.

This was required for a simple per-zone reclaimer.  Without this addition
pages would be swapped out as soon as a zone ran out of memory and the early
reclaim kicked in.

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] mm: add /proc/zoneinfo
Nikita Danilov [Wed, 22 Jun 2005 00:14:38 +0000 (17:14 -0700)]
[PATCH] mm: add /proc/zoneinfo

Add /proc/zoneinfo file to display information about memory zones.  Useful
to analyze VM behaviour.

Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] madvise: merge the maps
Prasanna Meda [Wed, 22 Jun 2005 00:14:37 +0000 (17:14 -0700)]
[PATCH] madvise: merge the maps

This attempts to merge back the split maps.  This code is mostly copied
from Chrisw's mlock merging from post 2.6.11 trees.  The only difference is
in munmapped_error handling.  Also passed prev to willneed/dontneed,
eventhogh they do not handle it now, since I felt it will be cleaner,
instead of handling prev in madvise_vma in some cases and in subfunction in
some cases.

Signed-off-by: Prasanna Meda <pmeda@akamai.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] madvise: do not split the maps
Prasanna Meda [Wed, 22 Jun 2005 00:14:36 +0000 (17:14 -0700)]
[PATCH] madvise: do not split the maps

This attempts to avoid splittings when it is not needed, that is when
vm_flags are same as new flags.  The idea is from the <2.6.11 mlock_fixup
and others.  This will provide base for the next madvise merging patch.

Signed-off-by: Prasanna Meda <pmeda@akamai.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] vmscan: notice slab shrinking
akpm@osdl.org [Wed, 22 Jun 2005 00:14:35 +0000 (17:14 -0700)]
[PATCH] vmscan: notice slab shrinking

Fix a problem identified by Andrea Arcangeli <andrea@suse.de>

kswapd will set a zone into all_unreclaimable state if it sees that we're not
successfully reclaiming LRU pages.  But that fails to notice that we're
successfully reclaiming slab obects, so we can set all_unreclaimable too soon.

So change shrink_slab() to return a success indication if it actually
reclaimed some objects, and don't assume that the zone is all_unreclaimable if
that is true.  This means that we won't enter all_unreclaimable state if we
are successfully freeing slab objects but we're not yet actually freeing slab
pages, due to internal fragmentation.

(hm, this has a shortcoming.  We could be successfully freeing ZONE_NORMAL
slab objects while being really oom on ZONE_DMA.  If that happens then kswapd
might burn a lot of CPU.  But given that there might be some slab objects in
ZONE_DMA, perhaps that is appropriate.)

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] smp_processor_id() cleanup
Ingo Molnar [Wed, 22 Jun 2005 00:14:34 +0000 (17:14 -0700)]
[PATCH] smp_processor_id() cleanup

This patch implements a number of smp_processor_id() cleanup ideas that
Arjan van de Ven and I came up with.

The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
spaghetti was hard to follow both on the implementational and on the
usage side.

Some of the complexity arose from picking wrong names, some of the
complexity comes from the fact that not all architectures defined
__smp_processor_id.

In the new code, there are two externally visible symbols:

 - smp_processor_id(): debug variant.

 - raw_smp_processor_id(): nondebug variant. Replaces all existing
   uses of _smp_processor_id() and __smp_processor_id(). Defined
   by every SMP architecture in include/asm-*/smp.h.

There is one new internal symbol, dependent on DEBUG_PREEMPT:

 - debug_smp_processor_id(): internal debug variant, mapped to
                             smp_processor_id().

Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
lib/smp_processor_id.c file.  All related comments got updated and/or
clarified.

I have build/boot tested the following 8 .config combinations on x86:

 {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}

I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT.  (Other
architectures are untested, but should work just fine.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes
Suresh Siddha [Wed, 22 Jun 2005 00:14:32 +0000 (17:14 -0700)]
[PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes

Appended patch will setup compatibility mode TASK_SIZE properly.  This will
fix atleast three known bugs that can be encountered while running
compatibility mode apps.

a) A malicious 32bit app can have an elf section at 0xffffe000.  During
   exec of this app, we will have a memory leak as insert_vm_struct() is
   not checking for return value in syscall32_setup_pages() and thus not
   freeing the vma allocated for the vsyscall page.  And instead of exec
   failing (as it has addresses > TASK_SIZE), we were allowing it to
   succeed previously.

b) With a 32bit app, hugetlb_get_unmapped_area/arch_get_unmapped_area
   may return addresses beyond 32bits, ultimately causing corruption
   because of wrap-around and resulting in SEGFAULT, instead of returning
   ENOMEM.

c) 32bit app doing this below mmap will now fail.

  mmap((void *)(0xFFFFE000UL), 0x10000UL, PROT_READ|PROT_WRITE,
MAP_FIXED|MAP_PRIVATE|MAP_ANON, 0, 0);

Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] coverity: idr_get_new_above_int() overrun fix
Zaur Kambarov [Wed, 22 Jun 2005 00:14:31 +0000 (17:14 -0700)]
[PATCH] coverity: idr_get_new_above_int() overrun fix

This patch fixes overrun of array pa:
92    struct idr_layer *pa[MAX_LEVEL];

in

98    l = idp->layers;
99    pa[l--] = NULL;

by passing idp->layers, set in
202   idp->layers = layers;
to function  sub_alloc in
203   v = sub_alloc(idp, ptr, &id);

Signed-off-by: Zaur Kambarov <zkambarov@coverity.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] coverity: ipmi: avoid overrun of ipmi_interfaces[]
Zaur Kambarov [Wed, 22 Jun 2005 00:14:30 +0000 (17:14 -0700)]
[PATCH] coverity: ipmi: avoid overrun of ipmi_interfaces[]

Fix overrun of static array "ipmi_interfaces" of size 4 at position 4 with
index variable "if_num".

Definitions involved:
297   #define MAX_IPMI_INTERFACES 4
298   static ipmi_smi_t ipmi_interfaces[MAX_IPMI_INTERFACES];

Signed-off-by: Zaur Kambarov <zkambarov@coverity.com>
Cc: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] megaraid build fix
bobl [Wed, 22 Jun 2005 00:14:29 +0000 (17:14 -0700)]
[PATCH] megaraid build fix

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[PATCH] arm: irqs_disabled() type fix
Andrew Morton [Wed, 22 Jun 2005 00:14:28 +0000 (17:14 -0700)]
[PATCH] arm: irqs_disabled() type fix

kernel/sched.c: In function `__might_sleep':
kernel/sched.c:5461: warning: int format, long unsigned int arg (arg 3)

We expect irqs_disabled() to return an int (poor man's bool).

Acked-by: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years agoMerge rsync://rsync.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6
Linus Torvalds [Wed, 22 Jun 2005 01:19:10 +0000 (18:19 -0700)]
Merge /pub/scm/linux/kernel/git/davem/sparc-2.6

19 years ago[SPARC64]: Add prefetch support.
David S. Miller [Tue, 21 Jun 2005 23:20:28 +0000 (16:20 -0700)]
[SPARC64]: Add prefetch support.

The implementation is optimal for UltraSPARC-III and later.
It will work, however suboptimally, on UltraSPARC-II and
be treated as a NOP on UltraSPARC-I.

It is not worth code patching this thing as the highest cost
is the code space, and code patching cannot eliminate that.

Signed-off-by: David S. Miller <davem@davemloft.net>
19 years agoMerge rsync://rsync.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
Linus Torvalds [Tue, 21 Jun 2005 22:45:19 +0000 (15:45 -0700)]
Merge /pub/scm/linux/kernel/git/davem/net-2.6

19 years ago[PATCH] devfs: remove devfs from Kconfig preventing it from being built
Greg KH [Tue, 21 Jun 2005 22:24:19 +0000 (15:24 -0700)]
[PATCH] devfs: remove devfs from Kconfig preventing it from being built

Here's a much smaller patch to simply disable devfs from the build.  If
this goes well, and there are no complaints for a few weeks, I'll resend
my big "devfs-die-die-die" series of patches that rip the whole thing
out of the kernel tree.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
19 years ago[SPARC64]: Fix cmsg length checks in Solaris emulation layer.
David S. Miller [Tue, 21 Jun 2005 22:39:22 +0000 (15:39 -0700)]
[SPARC64]: Fix cmsg length checks in Solaris emulation layer.

Signed-off-by: David S. Miller <davem@davemloft.net>
19 years agoMerge 'for-linus' branch of rsync://rsync.kernel.org/pub/scm/linux/kernel/git/shaggy...
Linus Torvalds [Tue, 21 Jun 2005 21:49:35 +0000 (14:49 -0700)]
Merge 'for-linus' branch of /linux/kernel/git/shaggy/jfs-2.6

19 years ago[IPV4]: Fix fib_trie.c's args to fib_dump_info().
David S. Miller [Tue, 21 Jun 2005 21:43:28 +0000 (14:43 -0700)]
[IPV4]: Fix fib_trie.c's args to fib_dump_info().

Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: Fix ip6t_LOG sit tunnel logging
Patrick McHardy [Tue, 21 Jun 2005 21:07:13 +0000 (14:07 -0700)]
[NETFILTER]: Fix ip6t_LOG sit tunnel logging

Sit tunnel logging is currently broken:

MAC=01:23:45:67:89:ab->01:23:45:47:89:ac TUNNEL=123.123.  0.123-> 12.123.  6.123

Apart from the broken IP address, MAC addresses are printed differently
for sit tunnels than for everything else.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: Drop conntrack reference in ip_call_ra_chain()/ip_mr_input()
Patrick McHardy [Tue, 21 Jun 2005 21:06:24 +0000 (14:06 -0700)]
[NETFILTER]: Drop conntrack reference in ip_call_ra_chain()/ip_mr_input()

Drop reference before handing the packets to raw_rcv()

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: Check TCP checksum in ipt_REJECT
Patrick McHardy [Tue, 21 Jun 2005 21:03:46 +0000 (14:03 -0700)]
[NETFILTER]: Check TCP checksum in ipt_REJECT

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: Avoid unncessary checksum validation in UDP connection tracking
Keir Fraser [Tue, 21 Jun 2005 21:03:23 +0000 (14:03 -0700)]
[NETFILTER]: Avoid unncessary checksum validation in UDP connection tracking

Signed-off-by: Keir Fraser <Keir.Fraser@xl.cam.ac.uk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: Missing owner-field initialization in ip6table_raw
Patrick McHardy [Tue, 21 Jun 2005 21:03:01 +0000 (14:03 -0700)]
[NETFILTER]: Missing owner-field initialization in ip6table_raw

I missed this one when fixing up iptable_raw.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: expectation timeouts are compulsory
Phil Oester [Tue, 21 Jun 2005 21:02:42 +0000 (14:02 -0700)]
[NETFILTER]: expectation timeouts are compulsory

Since expectation timeouts were made compulsory [1], there is no need to
check for them in ip_conntrack_expect_insert.

[1] https://lists.netfilter.org/pipermail/netfilter-devel/2005-January/018143.html

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: Restore netfilter assumptions in IPv6 multicast
Patrick McHardy [Tue, 21 Jun 2005 21:02:15 +0000 (14:02 -0700)]
[NETFILTER]: Restore netfilter assumptions in IPv6 multicast

Netfilter assumes that skb->data == skb->nh.ipv6h

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: Kill nf_debug
Patrick McHardy [Tue, 21 Jun 2005 21:01:57 +0000 (14:01 -0700)]
[NETFILTER]: Kill nf_debug

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETFILTER]: Kill lockhelp.h
Patrick McHardy [Tue, 21 Jun 2005 21:01:30 +0000 (14:01 -0700)]
[NETFILTER]: Kill lockhelp.h

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[IPV6]: multicast join and misc
David L Stevens [Tue, 21 Jun 2005 20:58:25 +0000 (13:58 -0700)]
[IPV6]: multicast join and misc

Here is a simplified version of the patch to fix a bug in IPv6
multicasting. It:

1) adds existence check & EADDRINUSE error for regular joins
2) adds an exception for EADDRINUSE in the source-specific multicast
        join (where a prior join is ok)
3) adds a missing/needed read_lock on sock_mc_list; would've raced
        with destroying the socket on interface down without
4) adds a "leave group" in the (INCLUDE, empty) source filter case.
        This frees unneeded socket buffer memory, but also prevents
        an inappropriate interaction among the 8 socket options that
        mess with this. Some would fail as if in the group when you
        aren't really.

Item #4 had a locking bug in the last version of this patch; rather than
removing the idev->lock read lock only, I've simplified it to remove
all lock state in the path and treat it as a direct "leave group" call for
the (INCLUDE,empty) case it covers. Tested on an MP machine. :-)

Much thanks to HoerdtMickael <hoerdt@clarinet.u-strasbg.fr> who
reported the original bug.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[IPV6]: V6 route events reported with wrong netlink PID and seq number
Jamal Hadi Salim [Tue, 21 Jun 2005 20:51:04 +0000 (13:51 -0700)]
[IPV6]: V6 route events reported with wrong netlink PID and seq number

Essentially netlink at the moment always reports a pid and sequence of 0
always for v6 route activities.
To understand the repurcassions of this look at:
http://lists.quagga.net/pipermail/quagga-dev/2005-June/003507.html

While fixing this, i took the liberty to resolve the outstanding issue
of IPV6 routes inserted via ioctls to have the correct pids as well.

This patch tries to behave as close as possible to the v4 routes i.e
maintains whatever PID the socket issuing the command owns as opposed to
the process. That made the patch a little bulky.

I have tested against both netlink derived utility to add/del routes as
well as ioctl derived one. The Quagga folks have tested against quagga.
This fixes the problem and so far hasnt been detected to introduce any
new issues.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[IPV4]: Add LC-Trie FIB lookup algorithm.
Robert Olsson [Tue, 21 Jun 2005 19:43:18 +0000 (12:43 -0700)]
[IPV4]: Add LC-Trie FIB lookup algorithm.

Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
19 years ago[NETLINK]: netlink_callback structure needs 5 args not 4
Alexey Kuznetsov [Tue, 21 Jun 2005 19:38:48 +0000 (12:38 -0700)]
[NETLINK]: netlink_callback structure needs 5 args not 4

net/ipv4/tcp_diag.c uses up to ->args[4]

Signed-off-by: David S. Miller <davem@davemloft.net>
19 years agoMerge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
Linus Torvalds [Mon, 20 Jun 2005 23:00:33 +0000 (16:00 -0700)]
Merge /pub/scm/linux/kernel/git/gregkh/driver-2.6

19 years agoMerge rsync://rsync.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
Linus Torvalds [Mon, 20 Jun 2005 22:58:58 +0000 (15:58 -0700)]
Merge /pub/scm/linux/kernel/git/davem/net-2.6

19 years ago[PATCH] PCI: fix show_modalias() function due to attribute change
Greg Kroah-Hartman [Sun, 19 Jun 2005 10:21:43 +0000 (12:21 +0200)]
[PATCH] PCI: fix show_modalias() function due to attribute change

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] USB: fix show_modalias() function due to attribute change
Greg Kroah-Hartman [Sun, 19 Jun 2005 10:21:43 +0000 (12:21 +0200)]
[PATCH] USB: fix show_modalias() function due to attribute change

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] SYSFS: fix PAGE_SIZE check
Jon Smirl [Tue, 14 Jun 2005 13:54:54 +0000 (09:54 -0400)]
[PATCH] SYSFS: fix PAGE_SIZE check

Without this change I can't set an attribute exactly PAGE_SIZE in
length. There is no need for zero termination because the interface
uses lengths.

From: Jon Smirl <jonsmirl@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver core: Don't "lose" devices on suspend on failure
Benjamin Herrenschmidt [Tue, 31 May 2005 07:08:49 +0000 (17:08 +1000)]
[PATCH] Driver core: Don't "lose" devices on suspend on failure

I think we need this patch or we might "lose" devices to the dpm_irq_off
list if a failure occurs during the suspend process.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] sysfs-iattr: set inode attributes
Maneesh Soni [Tue, 31 May 2005 05:09:52 +0000 (10:39 +0530)]
[PATCH] sysfs-iattr: set inode attributes

o Following patch sets the attributes for newly allocated inodes for sysfs
  objects. If the object has non-default attributes, inode attributes are
  set as saved in sysfs_dirent->s_iattr, pointer to struct iattr.

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] sysfs-iattr: add sysfs_setattr
Maneesh Soni [Tue, 31 May 2005 05:09:14 +0000 (10:39 +0530)]
[PATCH] sysfs-iattr: add sysfs_setattr

o This adds ->i_op->setattr VFS method for sysfs inodes. The changed
  attribues are saved in the persistent sysfs_dirent structure as a pointer
  to struct iattr. The struct iattr is allocated only for those sysfs_dirent's
  for which default attributes are getting changed. Thanks to Jon Smirl for
  this suggestion.

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] sysfs-iattr: attach sysfs_dirent before new inode
Maneesh Soni [Tue, 31 May 2005 05:08:12 +0000 (10:38 +0530)]
[PATCH] sysfs-iattr: attach sysfs_dirent before new inode

o The following patch makes sure to attach sysfs_dirent to the dentry before
  allocation a new inode through sysfs_create(). This change is done as
  preparatory work for implementing ->i_op->setattr() functionality for
  sysfs objects.

Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] I2C: drivers/i2c/chips/adm1026.c: use dynamic sysfs callbacks
Yani Ioannou [Sun, 5 Jun 2005 08:51:46 +0000 (10:51 +0200)]
[PATCH] I2C: drivers/i2c/chips/adm1026.c: use dynamic sysfs callbacks

Finally (phew!) this patch demonstrates how to adapt the adm1026 to
take advantage of the new callbacks, and the i2c-sysfs.h defined
structure/macros. Most of the other sensor/hwmon drivers could be
updated in the same way. The odd few exceptions (bmcsensors for
example) however might be better off with their own custom attribute
structure.

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] I2C: add i2c sensor_device_attribute and macros
Yani Ioannou [Wed, 18 May 2005 02:59:05 +0000 (22:59 -0400)]
[PATCH] I2C: add i2c sensor_device_attribute and macros

This patch creates a new header with a potential standard i2c sensor
attribute type (which simply includes an int representing the sensor
number/index) and the associated macros, SENSOR_DEVICE_ATTR to define
a static attribute and to_sensor_dev_attr to get a
sensor_device_attribute reference from an embedded device_attribute
reference.

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
19 years ago[PATCH] Driver Core: include: update device attribute callbacks
Yani Ioannou [Tue, 17 May 2005 10:44:59 +0000 (06:44 -0400)]
[PATCH] Driver Core: include: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: drivers/usb/input/aiptek.c - drivers/zorro/zorro-sysfs.c: update...
Yani Ioannou [Tue, 17 May 2005 10:44:04 +0000 (06:44 -0400)]
[PATCH] Driver Core: drivers/usb/input/aiptek.c - drivers/zorro/zorro-sysfs.c: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: drivers/s390/net/qeth_sys.c - drivers/usb/gadget/pxa2xx_udc...
Yani Ioannou [Tue, 17 May 2005 10:43:37 +0000 (06:43 -0400)]
[PATCH] Driver Core: drivers/s390/net/qeth_sys.c - drivers/usb/gadget/pxa2xx_udc.c: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: drivers/char/raw3270.c - drivers/net/netiucv.c: update device...
Yani Ioannou [Tue, 17 May 2005 10:43:27 +0000 (06:43 -0400)]
[PATCH] Driver Core: drivers/char/raw3270.c - drivers/net/netiucv.c: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: drivers/i2c/chips/w83781d.c - drivers/s390/block/dcssblk.c:...
Yani Ioannou [Tue, 17 May 2005 10:42:58 +0000 (06:42 -0400)]
[PATCH] Driver Core: drivers/i2c/chips/w83781d.c - drivers/s390/block/dcssblk.c: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: drivers/i2c/chips/pc87360.c - w83627hf.c: update device attribut...
Yani Ioannou [Tue, 17 May 2005 10:42:25 +0000 (06:42 -0400)]
[PATCH] Driver Core: drivers/i2c/chips/pc87360.c - w83627hf.c: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: drivers/i2c/chips/lm77.c - max1619.c: update device attribute...
Yani Ioannou [Tue, 17 May 2005 10:42:03 +0000 (06:42 -0400)]
[PATCH] Driver Core: drivers/i2c/chips/lm77.c - max1619.c: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: drivers/i2c/chips/adm1031.c - lm75.c: update device attribute...
Yani Ioannou [Tue, 17 May 2005 10:41:35 +0000 (06:41 -0400)]
[PATCH] Driver Core: drivers/i2c/chips/adm1031.c - lm75.c: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: drivers/base - drivers/i2c/chips/adm1026.c: update device attrib...
Yani Ioannou [Tue, 17 May 2005 10:41:12 +0000 (06:41 -0400)]
[PATCH] Driver Core: drivers/base - drivers/i2c/chips/adm1026.c: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: arch: update device attribute callbacks
Yani Ioannou [Tue, 17 May 2005 10:40:51 +0000 (06:40 -0400)]
[PATCH] Driver Core: arch: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver core: Documentation: update device attribute callbacks
Yani Ioannou [Tue, 17 May 2005 10:40:28 +0000 (06:40 -0400)]
[PATCH] Driver core: Documentation: update device attribute callbacks

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver core: change device_attribute callbacks
Yani Ioannou [Tue, 17 May 2005 10:39:34 +0000 (06:39 -0400)]
[PATCH] Driver core: change device_attribute callbacks

This patch adds the device_attribute paramerter to the
device_attribute store and show sysfs callback functions, and passes a
reference to the attribute when the callbacks are called.

Signed-off-by: Yani Ioannou <yani.ioannou@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] driver core: fix error handling in bus_add_device
Hannes Reinecke [Wed, 18 May 2005 08:42:23 +0000 (10:42 +0200)]
[PATCH] driver core: fix error handling in bus_add_device

The error handling in bus_add_device() and device_attach() is simply
non-existing. This patch propagates any error from device_attach to
the upper layers to allow for a proper recovery.

From: Hannes Reinecke <hare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] libfs: add simple attribute files
Arnd Bergmann [Wed, 18 May 2005 12:40:59 +0000 (14:40 +0200)]
[PATCH] libfs: add simple attribute files

Based on the discussion about spufs attributes, this is my suggestion
for a more generic attribute file support that can be used by both
debugfs and spufs.

Simple attribute files behave similarly to sequential files from
a kernel programmers perspective in that a standard set of file
operations is provided and only an open operation needs to
be written that registers file specific get() and set() functions.

These operations are defined as

void foo_set(void *data, u64 val); and
u64 foo_get(void *data);

where data is the inode->u.generic_ip pointer of the file and the
operations just need to make send of that pointer. The infrastructure
makes sure this works correctly with concurrent access and partial
read calls.

A macro named DEFINE_SIMPLE_ATTRIBUTE is provided to further simplify
using the attributes.

This patch already contains the changes for debugfs to use attributes
for its internal file operations.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: driver model doc update
David Brownell [Tue, 17 May 2005 00:19:55 +0000 (17:19 -0700)]
[PATCH] Driver Core: driver model doc update

This updates some driver data documentation:

 - removes references to some fields that haven't been there for a
   long time now, e.g. pre-kobject or even older;

 - giving more information about the probe() method;

 - adding an example of how platform_data is used

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver core: unregister_node() for hotplug use
Keiichiro Tokunaga [Sun, 8 May 2005 12:28:53 +0000 (21:28 +0900)]
[PATCH] Driver core: unregister_node() for hotplug use

This adds a generic function 'unregister_node()'.
It is used to remove objects of a node going away
for hotplug.  All the devices on the node must be
unregistered before calling this function.

Signed-off-by: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
diff -puN drivers/base/node.c~numa_hp_base drivers/base/node.c

19 years ago[PATCH] usbcore: Don't call device_release_driver recursively
Alan Stern [Fri, 6 May 2005 19:41:08 +0000 (15:41 -0400)]
[PATCH] usbcore: Don't call device_release_driver recursively

This patch fixes usb_driver_release_interface() to make it avoid calling
device_release_driver() recursively, i.e., when invoked from within the
disconnect routine for the same device.  The patch applies to your
"driver" tree.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] driver core: Fix races in driver_detach()
Alan Stern [Fri, 6 May 2005 19:38:33 +0000 (15:38 -0400)]
[PATCH] driver core: Fix races in driver_detach()

This patch is intended for your "driver" tree.  It fixes several subtle
races in driver_detach() and device_release_driver() in the driver-model
core.

The major change is to use klist_remove() rather than klist_del() when
taking a device off its driver's list.  There's no other way to guarantee
that the list pointers will be updated before some other driver binds to
the device.  For this to work driver_detach() can't use a klist iterator,
so the loop over the devices must be written out in full.  In addition the
patch protects against the possibility that, when a driver and a device
are unregistered at the same time, one may be unloaded from memory before
the other is finished using it.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] sn: fixes due to driver core changes
Patrick Mochel [Fri, 29 Apr 2005 00:11:52 +0000 (17:11 -0700)]
[PATCH] sn: fixes due to driver core changes

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] usb: klist_node_attached() fix
Patrick Mochel [Mon, 20 Jun 2005 22:15:28 +0000 (15:15 -0700)]
[PATCH] usb: klist_node_attached() fix

The original code looks like this:

        /* if interface was already added, bind now; else let
         * the future device_add() bind it, bypassing probe()
         */
        if (!list_empty (&dev->bus_list))
                device_bind_driver(dev);

IOW, it's checking to see if the device is attached to the bus or not
and binding the driver if it is. It's checking the device's bus list,
which will only appear empty when the device has been initialized, but
not added. It depends way too much on the driver model internals, but it
seems to be the only way to do the weird crap they want to do with
interfaces.

When I converted it to use klists, I accidentally inverted the logic,
which led to bad things happening. This patch returns the check to its
orginal value.

From: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Index: gregkh-2.6/drivers/usb/core/usb.c
===================================================================

19 years ago[PATCH] Fix typo in scdrv_init()
Jason Uhlenkott [Wed, 30 Mar 2005 21:19:54 +0000 (13:19 -0800)]
[PATCH] Fix typo in scdrv_init()

Fix a typo in scdrv_init() which was breaking the build for SGI sn2.

Signed-off-by: Jason Uhlenkott <jasonuhl@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Driver Core: fix bk-driver-core kills ppc64
Patrick Mochel [Wed, 6 Apr 2005 06:46:33 +0000 (23:46 -0700)]
[PATCH] Driver Core: fix bk-driver-core kills ppc64

There's no check to see if the device is already bound to a driver, which
could do bad things.  The first thing to go wrong is that it will try to match
a driver with a device already bound to one.  In some cases (it appears with
USB with drivers/usb/core/usb.c::usb_match_id()), some drivers will match a
device based on the class type, so it would be common (especially for HID
devices) to match a device that is already bound.

The fun comes when ->probe() is called, it fails, then
driver_probe_device() does this:

dev->driver = NULL;

Later on, that pointer could be be dereferenced without checking and cause
hell to break loose.

This problem could be nasty. It's very hardware dependent, since some
devices could have a different set of matching qualifiers than others.

Now, I don't quite see exactly where/how you were getting that crash.
You're dereferencing bad memory, but I'm not sure which pointer was bad
and where it came from, but it could have come from a couple of different
places.

The patch below will hopefully fix it all up for you. It's against
2.6.12-rc2-mm1, and does the following:

- Move logic to driver_probe_device() and comments uncommon returns:
  1 - If device is bound
  0 - If device not bound, and no error
  error - If there was an error.

- Move locking to caller of that function, since we want to lock a
  device for the entire time we're trying to bind it to a driver (to
  prevent against a driver being loaded at the same time).

- Update __device_attach() and __driver_attach() to do that locking.

- Check if device is already bound in __driver_attach()

- Update the converse device_release_driver() so it locks the device
  around all of the operations.

- Mark driver_probe_device() as static and remove export. It's an
  internal function, it should stay that way, and there are no other
  callers. If there is ever a need to export it, we can audit it as
  necessary.

Signed-off-by: Andrew Morton <akpm@osdl.org>
19 years ago[PATCH] Driver core: Fix up the driver and device iterators to be quieter
gregkh@suse.de [Thu, 31 Mar 2005 20:53:00 +0000 (12:53 -0800)]
[PATCH] Driver core: Fix up the driver and device iterators to be quieter

Also stops looping over the lists when a match is found.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de
19 years ago[PATCH] use device_for_each_child() to properly access child devices.
long [Tue, 29 Mar 2005 21:36:43 +0000 (13:36 -0800)]
[PATCH] use device_for_each_child() to properly access child devices.

On Friday, March 25, 2005 8:47 PM Greg KH wrote:
>Here's a fix for pci express.  For some reason I don't think they are
>using the driver model properly here, but I could be wrong...

Thanks for making the changes. However, changes in functions:
void pcie_port_device_remove(struct pci_dev *dev) and
static int remove_iter(struct device *dev, void *data)
are not correct. Please use the patch, which is based on kernel
2.6.12-rc1, below for a fix for these.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Use device_for_each_child() to unregister devices in nodemgr_remove_host_dev()
gregkh@suse.de [Fri, 25 Mar 2005 19:45:31 +0000 (11:45 -0800)]
[PATCH] Use device_for_each_child() to unregister devices in nodemgr_remove_host_dev()

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
diff -Nru a/drivers/ieee1394/nodemgr.c b/drivers/ieee1394/nodemgr.c

19 years ago[PATCH] USB: fix build warning in usb core as pointed out by Andrew.
gregkh@suse.de [Thu, 24 Mar 2005 08:44:28 +0000 (00:44 -0800)]
[PATCH] USB: fix build warning in usb core as pointed out by Andrew.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Index: gregkh-2.6/drivers/usb/core/usb.c
===================================================================

19 years ago[PATCH] driver core: change export symbol for driver_for_each_device()
gregkh@suse.de [Tue, 22 Mar 2005 20:17:13 +0000 (12:17 -0800)]
[PATCH] driver core: change export symbol for driver_for_each_device()

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Index: linux-2.6.12-rc2/drivers/base/driver.c
===================================================================

19 years ago[PATCH] Fix up bogus comment.
mochel@digitalimplant.org [Fri, 25 Mar 2005 04:08:04 +0000 (20:08 -0800)]
[PATCH] Fix up bogus comment.

Signed-off-by: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
diff -Nru a/drivers/base/driver.c b/drivers/base/driver.c

19 years ago[PATCH] Use a klist for device child lists.
mochel@digitalimplant.org [Fri, 25 Mar 2005 03:08:30 +0000 (19:08 -0800)]
[PATCH] Use a klist for device child lists.

- Use klist iterator in device_for_each_child(), making it safe to use for
  removing devices.
- Remove unused list_to_dev() function.
- Kills all usage of devices_subsys.rwsem.

Signed-off-by: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] use device_for_each_child() to properly access child devices.
gregkh@suse.de [Fri, 25 Mar 2005 23:52:00 +0000 (15:52 -0800)]
[PATCH] use device_for_each_child() to properly access child devices.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
19 years ago[PATCH] Use device_for_each_child() to unregister devices in scsi_remove_target().
mochel@digitalimplant.org [Fri, 25 Mar 2005 03:03:59 +0000 (19:03 -0800)]
[PATCH] Use device_for_each_child() to unregister devices in scsi_remove_target().

Signed-off-by: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Index: gregkh-2.6/drivers/scsi/scsi_sysfs.c
===================================================================