compat: backport radix-tree bit optimized iterator
This backport's Konstantin's radix-tree bit optimized
iterator,
78c1d7848, added on v3.4. This is used by a new
drivers in the future. The new bit optimized iterator relies
on the stack optimization introduced via
e2bdb933 added as of
v3.3. Backporting the bit optimized iterator requires
adjusting it to the old v3.2 radix_tree_node which is not
implemented here. For kernels v3.3 - v3.3 we can backport
the bit optimized iterator given that the radix_tree_node
did not changed between v3.3 - v3.4, in fact it hasn't
changed even up to v3.10. The backport relies on the same
helpers and inlines present on v3.3-v3.10 to implement
radix_tree_next_chunk().
This was tested as of next-
20130410.
Throw the helper into compat config build option
CPTCFG_BACKPORT_BUILD_RADIX_HELPERS only to be built
on v3.3 right now unless someone really wants to
backport
78c1d7848 support onto v3.2.
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains
78c1d7848
v3.4-rc2~15^2~26
mcgrof@frijol ~/linux-stable (git::master)$ git describe --contains
e2bdb933
v3.3-rc1~81^2~8
ckmake below, and then the commit log references above.
== ckmake-report.log ==
1 2.6.24 [ OK ]
2 2.6.25 [ OK ]
3 2.6.26 [ OK ]
4 2.6.27 [ OK ]
5 2.6.28 [ OK ]
6 2.6.29 [ OK ]
7 2.6.30 [ OK ]
8 2.6.31 [ OK ]
9 2.6.32 [ OK ]
10 2.6.33 [ OK ]
11 2.6.34 [ OK ]
12 2.6.35 [ OK ]
13 2.6.36 [ OK ]
14 2.6.37 [ OK ]
15 2.6.38 [ OK ]
16 2.6.39 [ OK ]
17 3.0.76 [ OK ]
18 3.1.10 [ OK ]
19 3.2.44 [ OK ]
20 3.3.8 [ OK ]
21 3.4.43 [ OK ]
22 3.5.7 [ OK ]
23 3.6.11 [ OK ]
24 3.7.10 [ OK ]
25 3.8.11 [ OK ]
26 3.9.0 [ OK ]
real 30m37.773s
user 809m37.644s
sys 126m30.806s
commit
78c1d78488a3c45685d993130c9f17102dc79a54
Author: Konstantin Khlebnikov <khlebnikov@openvz.org>
Date: Wed Mar 28 14:42:53 2012 -0700
radix-tree: introduce bit-optimized iterator
A series of radix tree cleanups, and usage of them in the core pagecache
code.
Micro-benchmark:
lookup 14 slots (typical page-vector size)
in radix-tree there earch <step> slot filled and tagged
before/after - nsec per full scan through tree
* Intel Sandy Bridge i7-2620M 4Mb L3
New code always faster
* AMD Athlon 6000+ 2x1Mb L2, without L3
New code generally faster,
Minor degradation (marked with "*") for huge sparse trees
* i386 on Sandy Bridge
New code faster for common cases: tagged and dense trees.
Some degradations for non-tagged lookup on sparse trees.
Ideally, there might help __ffs() analog for searching first non-zero
long element in array, gcc sometimes cannot optimize this loop corretly.
Numbers:
CPU: Intel Sandy Bridge i7-2620M 4Mb L3
radix-tree with 1024 slots:
tagged lookup
step 1 before 7156 after 3613
step 2 before 5399 after 2696
step 3 before 4779 after 1928
step 4 before 4456 after 1429
step 5 before 4292 after 1213
step 6 before 4183 after 1052
step 7 before 4157 after 951
step 8 before 4016 after 812
step 9 before 3952 after 851
step 10 before 3937 after 732
step 11 before 4023 after 709
step 12 before 3872 after 657
step 13 before 3892 after 633
step 14 before 3720 after 591
step 15 before 3879 after 578
step 16 before 3561 after 513
normal lookup
step 1 before 4266 after 3301
step 2 before 2695 after 2129
step 3 before 2083 after 1712
step 4 before 1801 after 1534
step 5 before 1628 after 1313
step 6 before 1551 after 1263
step 7 before 1475 after 1185
step 8 before 1432 after 1167
step 9 before 1373 after 1092
step 10 before 1339 after 1134
step 11 before 1292 after 1056
step 12 before 1319 after 1030
step 13 before 1276 after 1004
step 14 before 1256 after 987
step 15 before 1228 after 992
step 16 before 1247 after 999
radix-tree with 1024*1024*128 slots:
tagged lookup
commit
e2bdb933ab8b7db71c318a4ddcf78a9fffd61ecb
Author: Hugh Dickins <hughd@google.com>
Date: Thu Jan 12 17:20:41 2012 -0800
radix_tree: take radix_tree_path off stack
Down, down in the deepest depths of GFP_NOIO page reclaim, we have
shrink_page_list() calling __remove_mapping() calling __delete_from_
swap_cache() or __delete_from_page_cache().
You would not expect those to need much stack, but in fact they call
radix_tree_delete(): which declares a 192-byte radix_tree_path array on
its stack (to record the node,offsets it visits when descending, in case
it needs to ascend to update them). And if any tag is still set [1],
that calls radix_tree_tag_clear(), which declares a further such
192-byte radix_tree_path array on the stack. (At least we have
interrupts disabled here, so won't then be pushing registers too.)
That was probably a good choice when most users were 32-bit (array of
half the size), and adding fields to radix_tree_node would have bloated
it unnecessarily. But nowadays many are 64-bit, and each
radix_tree_node contains a struct rcu_head, which is only used when
freeing; whereas the radix_tree_path info is only used for updating the
tree (deleting, clearing tags or setting tags if tagged) when a lock
must be held, of no interest when accessing the tree locklessly.
So add a parent pointer to the radix_tree_node, in union with the
rcu_head, and remove all uses of the radix_tree_path. There would be
space in that union to save the offset when descending as before (we can
argue that a lock must already be held to exclude other users), but
recalculating it when ascending is both easy (a constant shift and a
constant mask) and uncommon, so it seems better just to do that.
Two little optimizations: no need to decrement height when descending,
adjusting shift is enough; and once radix_tree_tag_if_tagged() has set
tag on a node and its ancestors, it need not ascend from that node
again.
perf on the radix tree test harness reports radix_tree_insert() as 2%
slower (now having to set parent), but radix_tree_delete() 24% faster.
Surely that's an exaggeration from rtth's artificially low map shift 3,
but forcing it back to 6 still rates radix_tree_delete() 8% faster.
Signed-off-by: Luis R. Rodriguez <mcgrof@do-not-panic.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>