When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.
This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.
Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.
I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
1) optimal, as baseline;
2) degraded;
3) degraded with this patch.
Kernel version is 4.0-rc3.
Each individual test I only did once so there might be some variations,
but we just focus on big trend.
Sequential Read:
bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s)
1024 1608 656 995
512 1624 710 956
256 1635 728 980
128 1636 771 983
64 1612 1119 1000
32 1580 1420 1004
16 1368 688 986
8 768 647 953
4 411 413 850
Random Read:
bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS)
1024 163 160 156
512 274 273 272
256 426 428 424
128 576 592 591
64 726 724 726
32 849 848 837
16 900 970 971
8 927 940 929
4 948 940 955
Some notes:
* In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
* In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
* In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.
If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.
Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.
Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>
unsigned int chunk_sectors = mddev->chunk_sectors;
unsigned int bio_sectors = bvm->bi_size >> 9;
- if ((bvm->bi_rw & 1) == WRITE)
- return biovec->bv_len; /* always allow writes to be mergeable */
+ /*
+ * always allow writes to be mergeable, read as well if array
+ * is degraded as we'll go through stripe cache anyway.
+ */
+ if ((bvm->bi_rw & 1) == WRITE || mddev->degraded)
+ return biovec->bv_len;
if (mddev->new_chunk_sectors < mddev->chunk_sectors)
chunk_sectors = mddev->new_chunk_sectors;
md_write_start(mddev, bi);
- if (rw == READ &&
+ /*
+ * If array is degraded, better not do chunk aligned read because
+ * later we might have to read it again in order to reconstruct
+ * data on failed drives.
+ */
+ if (rw == READ && mddev->degraded == 0 &&
mddev->reshape_position == MaxSector &&
chunk_aligned_read(mddev,bi))
return;