lib/list_sort: optimize number of calls to comparison function
CONFIG_RETPOLINE has severely degraded indirect function call
performance, so it's worth putting some effort into reducing the number
of times cmp() is called.
This patch avoids badly unbalanced merges on unlucky input sizes. It
slightly increases the code size, but saves an average of 0.2*n calls to
cmp().
x86-64 code size 739 -> 803 bytes (+64)
Unfortunately, there's not a lot of low-hanging fruit in a merge sort;
it already performs only n*log2(n) - K*n + O(1) compares. The leading
coefficient is already at the theoretical limit (log2(n!) corresponds to
K=1.4427), so we're fighting over the linear term, and the best
mergesort can do is K=1.2645, achieved when n is a power of 2.
The differences between mergesort variants appear when n is *not* a
power of 2; K is a function of the fractional part of log2(n). Top-down
mergesort does best of all, achieving a minimum K=1.2408, and an average
(over all sizes) K=1.248. However, that requires knowing the number of
entries to be sorted ahead of time, and making a full pass over the
input to count it conflicts with a second performance goal, which is
cache blocking.
Obviously, we have to read the entire list into L1 cache at some point,
and performance is best if it fits. But if it doesn't fit, each full
pass over the input causes a cache miss per element, which is
undesirable.
While textbooks explain bottom-up mergesort as a succession of merging
passes, practical implementations do merging in depth-first order: as
soon as two lists of the same size are available, they are merged. This
allows as many merge passes as possible to fit into L1; only the final
few merges force cache misses.
This cache-friendly depth-first merge order depends on us merging the
beginning of the input as much as possible before we've even seen the
end of the input (and thus know its size).
The simple eager merge pattern causes bad performance when n is just
over a power of 2. If n=1028, the final merge is between 1024- and
4-element lists, which is wasteful of comparisons. (This is actually
worse on average than n=1025, because a 1204:1 merge will, on average,
end after 512 compares, while 1024:4 will walk 4/5 of the list.)
Because of this, bottom-up mergesort achieves K < 0.5 for such sizes,
and has an average (over all sizes) K of around 1. (My experiments show
K=1.01, while theory predicts K=0.965.)
There are "worst-case optimal" variants of bottom-up mergesort which
avoid this bad performance, but the algorithms given in the literature,
such as queue-mergesort and boustrodephonic mergesort, depend on the
breadth-first multi-pass structure that we are trying to avoid.
This implementation is as eager as possible while ensuring that all
merge passes are at worst 1:2 unbalanced. This achieves the same
average K=1.207 as queue-mergesort, which is 0.2*n better then
bottom-up, and only 0.04*n behind top-down mergesort.
Specifically, defers merging two lists of size 2^k until it is known
that there are 2^k additional inputs following. This ensures that the
final uneven merges triggered by reaching the end of the input will be
at worst 2:1. This will avoid cache misses as long as 3*2^k elements
fit into the cache.
(I confess to being more than a little bit proud of how clean this code
turned out. It took a lot of thinking, but the resultant inner loop is
very simple and efficient.)
Refs:
Bottom-up Mergesort: A Detailed Analysis
Wolfgang Panny, Helmut Prodinger
Algorithmica 14(4):340--354, October 1995
https://doi.org/10.1007/
BF01294131
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260
The cost distribution of queue-mergesort, optimal mergesorts, and
power-of-two rules
Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen
Journal of Algorithms 30(2); Pages 423--448, February 1999
https://doi.org/10.1006/jagm.1998.0986
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380
Queue-Mergesort
Mordecai J. Golin, Robert Sedgewick
Information Processing Letters, 48(5):253--259, 10 December 1993
https://doi.org/10.1016/0020-0190(93)90088-q
https://sci-hub.tw/10.1016/0020-0190(93)90088-Q
Feedback from Rasmus Villemoes <linux@rasmusvillemoes.dk>.
Link: http://lkml.kernel.org/r/fd560853cc4dca0d0f02184ffa888b4c1be89abc.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>