diff mbox series

[md-6.13] md: remove bitmap file support

Message ID 20241107125911.311347-1-yukuai1@huaweicloud.com (mailing list archive)
State Changes Requested
Headers show
Series [md-6.13] md: remove bitmap file support | expand

Checks

Context Check Description
mdraidci/vmtest-md-6_13-PR success PR summary
mdraidci/vmtest-md-6_13-VM_Test-0 success Logs for per-patch-testing

Commit Message

Yu Kuai Nov. 7, 2024, 12:59 p.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

The bitmap file has been marked as deprecated for more than a year now,
let's remove it, and we don't need to care about this case in the new
bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-bitmap.c | 269 +++++------------------------------------
 drivers/md/md-bitmap.h |   1 -
 drivers/md/md.c        | 195 ++++-------------------------
 drivers/md/md.h        |  53 ++++----
 drivers/md/raid5-ppl.c |   2 +-
 drivers/md/raid5.c     |   2 +-
 6 files changed, 79 insertions(+), 443 deletions(-)

Comments

Song Liu Nov. 7, 2024, 11:41 p.m. UTC | #1
On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> The bitmap file has been marked as deprecated for more than a year now,
> let's remove it, and we don't need to care about this case in the new
> bitmap.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

What happens when an old array with bitmap file boots into a kernel
without bitmap file support?

Thanks,
Song
Yu Kuai Nov. 8, 2024, 1:03 a.m. UTC | #2
Hi,

在 2024/11/08 7:41, Song Liu 写道:
> On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> The bitmap file has been marked as deprecated for more than a year now,
>> let's remove it, and we don't need to care about this case in the new
>> bitmap.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> 
> What happens when an old array with bitmap file boots into a kernel
> without bitmap file support?

If mdadm is used with bitmap file support, then kenel will just ignore
the bitmap, the same as none bitmap. Perhaps it's better to leave a
error message?

And if mdadm is updated, reassemble will fail.

Thanks,
Kuai

> 
> Thanks,
> Song
> .
>
Song Liu Nov. 8, 2024, 1:28 a.m. UTC | #3
On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/11/08 7:41, Song Liu 写道:
> > On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> The bitmap file has been marked as deprecated for more than a year now,
> >> let's remove it, and we don't need to care about this case in the new
> >> bitmap.
> >>
> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >
> > What happens when an old array with bitmap file boots into a kernel
> > without bitmap file support?
>
> If mdadm is used with bitmap file support, then kenel will just ignore
> the bitmap, the same as none bitmap. Perhaps it's better to leave a
> error message?

Yes, we should print some error message before assembling the array.

> And if mdadm is updated, reassemble will fail.

I think we should ship this with 6.14 (not 6.13), so that we have
more time testing different combinations of old/new mdadm
and kernel. WDYT?

Thanks,
Song
Yu Kuai Nov. 8, 2024, 1:33 a.m. UTC | #4
Hi,

在 2024/11/08 9:28, Song Liu 写道:
> On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/11/08 7:41, Song Liu 写道:
>>> On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> The bitmap file has been marked as deprecated for more than a year now,
>>>> let's remove it, and we don't need to care about this case in the new
>>>> bitmap.
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>
>>> What happens when an old array with bitmap file boots into a kernel
>>> without bitmap file support?
>>
>> If mdadm is used with bitmap file support, then kenel will just ignore
>> the bitmap, the same as none bitmap. Perhaps it's better to leave a
>> error message?
> 
> Yes, we should print some error message before assembling the array.

OK
> 
>> And if mdadm is updated, reassemble will fail.
> 
> I think we should ship this with 6.14 (not 6.13), so that we have
> more time testing different combinations of old/new mdadm
> and kernel. WDYT?

Agreed!

Thanks,
Kuai

> 
> Thanks,
> Song
> .
>
Dragan Milivojević Nov. 8, 2024, 5:15 a.m. UTC | #5
On Fri, 8 Nov 2024 at 02:29, Song Liu <song@kernel.org> wrote:

> I think we should ship this with 6.14 (not 6.13), so that we have
> more time testing different combinations of old/new mdadm
> and kernel. WDYT?

I'm not sure if bitmap performance fixes are already included
but if not please include those too. Internal bitmap kills performance
and external bitmap was a workaround for that issue.
Yu Kuai Nov. 8, 2024, 6:07 a.m. UTC | #6
Hi,

在 2024/11/08 13:15, Dragan Milivojević 写道:
> On Fri, 8 Nov 2024 at 02:29, Song Liu <song@kernel.org> wrote:
> 
>> I think we should ship this with 6.14 (not 6.13), so that we have
>> more time testing different combinations of old/new mdadm
>> and kernel. WDYT?
> 
> I'm not sure if bitmap performance fixes are already included
> but if not please include those too. Internal bitmap kills performance
> and external bitmap was a workaround for that issue.

I don't think external bitmap can workaround the performance degradation
problem, because the global lock for the bitmap are the one to blame for
this, it's the same for external or internal bitmap.

Do you know that is there anyone using external bitmap in the real
world? And is there numbers for performance? We'll have to consider
to keep it untill the new lockless bitmap is ready if so.

Thanks,
Kuai

> .
>
Dragan Milivojević Nov. 8, 2024, 10:19 p.m. UTC | #7
On Fri, 8 Nov 2024 at 07:07, Yu Kuai <yukuai1@huaweicloud.com> wrote:


> I don't think external bitmap can workaround the performance degradation
> problem, because the global lock for the bitmap are the one to blame for
> this, it's the same for external or internal bitmap.

Not according to my tests:

5 disk RAID5, 64K chunk



Test                   BW         IOPS
bitmap internal 64M    700KiB/s   174
bitmap internal 128M   702KiB/s   175
bitmap internal 512M   1142KiB/s  285
bitmap internal 1024M  40.4MiB/s  10.3k
bitmap internal 2G     66.5MiB/s  17.0k
bitmap external 64M    67.8MiB/s  17.3k
bitmap external 1024M  76.5MiB/s  19.6k
bitmap none            80.6MiB/s  20.6k
Single disk 1K         54.1MiB/s  55.4k
Single disk 4K         269MiB/s   68.8k



Full test logs with system details at: pastebin. com/raw/TK4vWjQu


>
> Do you know that is there anyone using external bitmap in the real
> world? And is there numbers for performance? We'll have to consider
> to keep it untill the new lockless bitmap is ready if so.

Well I am and it's a royal pain but there isn't much of an alternative.
Yu Kuai Nov. 9, 2024, 1:43 a.m. UTC | #8
Hi,

在 2024/11/09 6:19, Dragan Milivojević 写道:
> On Fri, 8 Nov 2024 at 07:07, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> 
>> I don't think external bitmap can workaround the performance degradation
>> problem, because the global lock for the bitmap are the one to blame for
>> this, it's the same for external or internal bitmap.
> 
> Not according to my tests:
> 
> 5 disk RAID5, 64K chunk
> 
> 
> 
> Test                   BW         IOPS
> bitmap internal 64M    700KiB/s   174
> bitmap internal 128M   702KiB/s   175
> bitmap internal 512M   1142KiB/s  285
> bitmap internal 1024M  40.4MiB/s  10.3k
> bitmap internal 2G     66.5MiB/s  17.0k
> bitmap external 64M    67.8MiB/s  17.3k
> bitmap external 1024M  76.5MiB/s  19.6k

This is not what I expected, can you give the tests procedures in
details? Including test machine, create the array and test scrpits.

> bitmap none            80.6MiB/s  20.6k
> Single disk 1K         54.1MiB/s  55.4k
> Single disk 4K         269MiB/s   68.8k
> 
> 
> 
> Full test logs with system details at: pastebin. com/raw/TK4vWjQu
> 
> 
>>
>> Do you know that is there anyone using external bitmap in the real
>> world? And is there numbers for performance? We'll have to consider
>> to keep it untill the new lockless bitmap is ready if so.
> 
> Well I am and it's a royal pain but there isn't much of an alternative.

Bitmap file will be removed, the way it's implemented is problematic. If
you have plans to upgrade kernel to v6.13+, I can keep it for now,
untill the other lockless bitmap is ready.

Thanks,
Kuai

> .
>
Dragan Milivojević Nov. 9, 2024, 2:15 a.m. UTC | #9
On Sat, 9 Nov 2024 at 02:44, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> This is not what I expected, can you give the tests procedures in
> details? Including test machine, create the array and test scrpits.

Server is Dell PowerEdge R7525, 2x AMD EPYC 7313 the rest
is in the linked pastebin, let me know if you need more info.

BTW do you guys do performance tests? All of the raid levels are
practically broken performance wise. None of them scale. Looking forward
to seeing those patches from Shushu Yi included, anyone knows when those
will be shipped?

> Bitmap file will be removed, the way it's implemented is problematic. If
> you have plans to upgrade kernel to v6.13+, I can keep it for now,
> untill the other lockless bitmap is ready.

I usually use distro kernels so no such plan for now, just thought it would
be useful to ship both at the same time. Soften the blow for those using
external bitmaps.
Yu Kuai Nov. 11, 2024, 2:04 a.m. UTC | #10
Hi,

在 2024/11/09 10:15, Dragan Milivojević 写道:
> On Sat, 9 Nov 2024 at 02:44, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> This is not what I expected, can you give the tests procedures in
>> details? Including test machine, create the array and test scrpits.
> 
> Server is Dell PowerEdge R7525, 2x AMD EPYC 7313 the rest
> is in the linked pastebin, let me know if you need more info.

Yes, as I said please show me you how you creat the array and your test
script. I must know what you are testing, like single threaded or high
concurrency. For example your result shows bitmap none close to bitmap
external, this is impossible in our previous results. I can only guess
that you're testing single threaded.

BTW, it'll be great if you can provide some perf results of the internal
bitmap in your case, that will show us directly where is the bottleneck.

> 
> BTW do you guys do performance tests? All of the raid levels are

We do, but we never test external bitmap.

+CC Paul

Hi, do you have time to add the external bitmap in our test?

Thanks,
Kuai
Dragan Milivojević Nov. 11, 2024, 11:59 a.m. UTC | #11
On 11/11/2024 03:04, Yu Kuai wrote:

> Yes, as I said please show me you how you creat the array and your test
> script. I must know what you are testing, like single threaded or high
> concurrency. For example your result shows bitmap none close to bitmap
> external, this is impossible in our previous results. I can only guess
> that you're testing single threaded.

All of that is included in the previously linked pastebin.
I will include the contents of that pastebin at the end of this email if that helps.
Every test includes the mdadm create line, disk settings, md settings, fio test line
used and the results and the typical iostat output during the test. I hope that is
sufficient.
  
> BTW, it'll be great if you can provide some perf results of the internal
> bitmap in your case, that will show us directly where is the bottleneck.

Not right now, this server is in production and I'm not sure if I will be able
to get it to an idle state or to find the time to do it due to other projects.

>> BTW do you guys do performance tests? All of the raid levels are
> 
> We do, but we never test external bitmap.

I wasn't referring to that, more to the fact that there is a huge difference in
performance between no bitmap and bitmap or that raid (even "simple" levels like 0)
do not scale with real world workloads.

The contents of that pastebin, hopefully my email client won't mess up the formating:


5 disk RAID5, 64K chunk

Summary

Test                   BW         IOPS
bitmap internal 64M    700KiB/s   174
bitmap internal 128M   702KiB/s   175
bitmap internal 512M   1142KiB/s  285
bitmap internal 1024M  40.4MiB/s  10.3k
bitmap internal 2G     66.5MiB/s  17.0k
bitmap external 64M    67.8MiB/s  17.3k
bitmap external 1024M  76.5MiB/s  19.6k
bitmap none            80.6MiB/s  20.6k
Single disk 1K         54.1MiB/s  55.4k
Single disk 4K         269MiB/s   68.8k




AlmaLinux release 9.4 (Seafoam Ocelot)
5.14.0-427.20.1.el9_4


nvme list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            1460A0F9TSTJ         Dell DC NVMe CD8 U.2 960GB               0x1        122.33  GB / 960.20  GB    512   B +  0 B   2.0.0
/dev/nvme1n1          /dev/ng1n1            S6WRNJ0WA04045P      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme2n1          /dev/ng2n1            S6WRNJ0WA04048B      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme3n1          /dev/ng3n1            S6WRNJ0W810396H      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme4n1          /dev/ng4n1            S6WRNJ0W808149N      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme5n1          /dev/ng5n1            S6WRNJ0WA04043Z      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme6n1          /dev/ng6n1            PHBT909504AH016N     INTEL MEMPEK1J016GAL                     0x1         14.40  GB /  14.40  GB    512   B +  0 B   K4110420
/dev/nvme7n1          /dev/ng7n1            S6WRNJ0WA04036R      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme8n1          /dev/ng8n1            S6WRNJ0WA04050H      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7




bitmap internal 64M
================================================================
mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=4718: Sun Jun 30 02:18:30 2024
   write: IOPS=174, BW=700KiB/s (717kB/s)(41.0MiB/60005msec); 0 zone resets
     slat (usec): min=4, max=18062, avg=11.28, stdev=176.21
     clat (usec): min=46, max=13308, avg=5700.08, stdev=1194.59
      lat (usec): min=53, max=22717, avg=5711.36, stdev=1206.03
     clat percentiles (usec):
      |  1.00th=[   51],  5.00th=[ 5800], 10.00th=[ 5800], 20.00th=[ 5866],
      | 30.00th=[ 5866], 40.00th=[ 5866], 50.00th=[ 5866], 60.00th=[ 5932],
      | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5932], 95.00th=[ 5997],
      | 99.00th=[ 6194], 99.50th=[ 8586], 99.90th=[10290], 99.95th=[13042],
      | 99.99th=[13042]
    bw (  KiB/s): min=  608, max=  752, per=100.00%, avg=700.03, stdev=20.93, samples=119
    iops        : min=  152, max=  188, avg=175.01, stdev= 5.23, samples=119
   lat (usec)   : 50=0.68%, 100=3.23%
   lat (msec)   : 10=95.99%, 20=0.10%
   cpu          : usr=0.08%, sys=0.24%, ctx=10503, majf=0, minf=8
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,10499,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=700KiB/s (717kB/s), 700KiB/s-700KiB/s (717kB/s-717kB/s), io=41.0MiB (43.0MB), run=60005-60005msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.07    0.00    0.00   99.93

Device   r/s    rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
md127    0.00   0.00   0.00    0.00   0.00     0.00      175.47  0.69   0.00    0.00   5.68     4.00      0.00    0.00     1.00    100.00
nvme1n1  69.40  0.27   0.00    0.00   0.01     4.00      237.80  0.93   0.00    0.00   3.55     4.00      168.47  0.81     0.98    95.59
nvme2n1  69.20  0.27   0.00    0.00   0.01     4.00      237.60  0.93   0.00    0.00   3.55     4.00      168.47  0.81     0.98    95.61
nvme3n1  72.20  0.28   0.00    0.00   0.01     4.00      240.60  0.94   0.00    0.00   3.51     4.00      168.47  0.83     0.98    95.29
nvme4n1  68.07  0.27   0.00    0.00   0.02     4.00      236.53  0.92   0.00    0.00   3.57     4.00      168.47  0.81     0.98    95.65
nvme5n1  72.07  0.28   0.00    0.00   0.02     4.00      240.53  0.94   0.00    0.00   3.52     4.00      168.47  0.83     0.99    95.31


mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : 77fa1a1b:2f0dd646:adc85c8e:985513a8
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 64 MB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 29807 bits (chunks), 1517 dirty (5.1%)


bitmap internal 128M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=128M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=6283: Sun Jun 30 02:49:06 2024
   write: IOPS=175, BW=702KiB/s (719kB/s)(41.1MiB/60002msec); 0 zone resets
     slat (usec): min=8, max=18200, avg=16.06, stdev=177.21
     clat (usec): min=61, max=20048, avg=5675.78, stdev=1968.88
      lat (usec): min=74, max=22975, avg=5691.84, stdev=1976.14
     clat percentiles (usec):
      |  1.00th=[   68],  5.00th=[   73], 10.00th=[ 5866], 20.00th=[ 5932],
      | 30.00th=[ 5932], 40.00th=[ 5932], 50.00th=[ 5932], 60.00th=[ 5997],
      | 70.00th=[ 5997], 80.00th=[ 5997], 90.00th=[ 5997], 95.00th=[ 6063],
      | 99.00th=[14615], 99.50th=[15008], 99.90th=[16188], 99.95th=[16319],
      | 99.99th=[16319]
    bw (  KiB/s): min=  384, max=  816, per=99.97%, avg=702.12, stdev=72.52, samples=119
    iops        : min=   96, max=  204, avg=175.53, stdev=18.13, samples=119
   lat (usec)   : 100=7.62%, 250=0.01%
   lat (msec)   : 10=90.80%, 20=1.56%, 50=0.01%
   cpu          : usr=0.11%, sys=0.34%, ctx=10539, majf=0, minf=8
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,10534,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=702KiB/s (719kB/s), 702KiB/s-702KiB/s (719kB/s-719kB/s), io=41.1MiB (43.1MB), run=60002-60002msec





avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.08    0.00    0.00   99.92

Device   r/s    rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
md127    0.00   0.00   0.00    0.00   0.00     0.00      173.73  0.68   0.00    0.00   5.73     4.00      0.00    0.00     1.00    99.99
nvme1n1  65.87  0.26   0.00    0.00   0.01     4.00      226.07  0.65   0.00    0.00   3.60     2.94      160.20  0.81     0.94    92.46
nvme2n1  71.33  0.28   0.00    0.00   0.02     4.00      231.53  0.67   0.00    0.00   3.50     2.96      160.27  0.84     0.95    91.79
nvme3n1  68.60  0.27   0.00    0.00   0.02     4.00      228.80  0.66   0.00    0.00   3.68     2.95      160.27  0.93     0.99    94.37
nvme4n1  68.87  0.27   0.00    0.00   0.02     4.00      229.07  0.66   0.00    0.00   3.52     2.95      160.20  0.81     0.94    91.59
nvme5n1  72.80  0.28   0.00    0.00   0.02     4.00      233.00  0.68   0.00    0.00   3.53     2.97      160.27  0.87     0.96    92.29


mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : 93fdcd4b:ae61a1f8:4d809242:2cd4a4c7
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 128 MB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 14904 bits (chunks), 1617 dirty (10.8%)





bitmap internal 512M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=512M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5


Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1232KiB/s][w=308 IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=6661: Sun Jun 30 02:58:11 2024
   write: IOPS=285, BW=1142KiB/s (1169kB/s)(66.9MiB/60006msec); 0 zone resets
     slat (usec): min=4, max=18130, avg=10.80, stdev=138.54
     clat (usec): min=42, max=13261, avg=3490.08, stdev=2945.95
      lat (usec): min=50, max=22827, avg=3500.88, stdev=2949.63
     clat percentiles (usec):
      |  1.00th=[   49],  5.00th=[   51], 10.00th=[   52], 20.00th=[   55],
      | 30.00th=[   58], 40.00th=[   72], 50.00th=[ 5866], 60.00th=[ 5932],
      | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5997], 95.00th=[ 5997],
      | 99.00th=[ 6128], 99.50th=[ 8586], 99.90th=[ 9896], 99.95th=[13042],
      | 99.99th=[13042]
    bw (  KiB/s): min=  600, max= 1648, per=99.68%, avg=1138.89, stdev=188.44, samples=119
    iops        : min=  150, max=  412, avg=284.72, stdev=47.11, samples=119
   lat (usec)   : 50=3.41%, 100=38.62%, 250=0.04%, 500=0.03%
   lat (msec)   : 10=57.83%, 20=0.07%
   cpu          : usr=0.09%, sys=0.40%, ctx=17130, majf=0, minf=9
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,17127,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=1142KiB/s (1169kB/s), 1142KiB/s-1142KiB/s (1169kB/s-1169kB/s), io=66.9MiB (70.2MB), run=60006-60006msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.10    0.00    0.00   99.90

Device   r/s     rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
md127    0.00    0.00   0.00    0.00   0.00     0.00      307.13  1.20   0.00    0.00   3.24     4.00      0.00    0.00     1.00    100.00
nvme1n1  120.47  0.47   0.00    0.00   0.01     4.00      286.07  0.63   0.00    0.00   3.03     2.26      165.60  0.99     1.03    96.58
nvme2n1  123.87  0.48   0.00    0.00   0.01     4.00      289.47  0.65   0.00    0.00   3.00     2.28      165.60  1.00     1.04    96.63
nvme3n1  120.87  0.47   0.00    0.00   0.01     4.00      286.47  0.63   0.00    0.00   3.02     2.27      165.60  1.00     1.03    96.39
nvme4n1  125.00  0.49   0.00    0.00   0.02     4.00      290.60  0.65   0.00    0.00   3.00     2.29      165.60  1.02     1.04    96.54
nvme5n1  124.07  0.48   0.00    0.00   0.02     4.00      289.67  0.65   0.00    0.00   3.01     2.28      165.60  1.03     1.04    96.59


mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : 17eadc76:a367542a:feb6e24e:d650576c
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 512 MB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 3726 bits (chunks), 1977 dirty (53.1%)


bitmap internal 1024M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=51.0MiB/s][w=13.1k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=7120: Sun Jun 30 03:08:12 2024
   write: IOPS=10.3k, BW=40.4MiB/s (42.4MB/s)(2425MiB/60001msec); 0 zone resets
     slat (usec): min=6, max=18135, avg= 8.93, stdev=23.41
     clat (usec): min=3, max=10459, avg=86.97, stdev=342.95
      lat (usec): min=63, max=22927, avg=95.90, stdev=344.33
     clat percentiles (usec):
      |  1.00th=[   62],  5.00th=[   63], 10.00th=[   64], 20.00th=[   65],
      | 30.00th=[   65], 40.00th=[   66], 50.00th=[   67], 60.00th=[   67],
      | 70.00th=[   68], 80.00th=[   69], 90.00th=[   70], 95.00th=[   74],
      | 99.00th=[  133], 99.50th=[  155], 99.90th=[ 5997], 99.95th=[ 5997],
      | 99.99th=[ 6063]
    bw (  KiB/s): min=  616, max=52968, per=99.80%, avg=41305.95, stdev=20465.79, samples=119
    iops        : min=  154, max=13242, avg=10326.47, stdev=5116.44, samples=119
   lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 100=98.64%, 250=1.00%
   lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.33%, 20=0.01%
   cpu          : usr=1.89%, sys=12.74%, ctx=620837, majf=0, minf=170751
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,620822,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), io=2425MiB (2543MB), run=60001-60001msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.27    0.00    0.00   98.70

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      18216.93  71.16  0.00    0.00   0.05     4.00      0.00  0.00     0.88    100.00
nvme1n1  7256.20  28.34  0.00    0.00   0.01     4.00      7256.27   28.34  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.71
nvme2n1  7302.53  28.53  0.00    0.00   0.01     4.00      7302.53   28.53  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.73
nvme3n1  7278.47  28.43  0.00    0.00   0.01     4.00      7278.53   28.43  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.57
nvme4n1  7303.93  28.53  0.00    0.00   0.01     4.00      7303.93   28.53  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.74
nvme5n1  7292.67  28.49  0.00    0.00   0.02     4.00      7292.60   28.49  0.00    0.00   0.02     4.00      0.00  0.00     0.22    99.69



mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : a0c7ad14:50689e41:e065a166:4935a186
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 1 GB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 1863 bits (chunks), 1863 dirty (100.0%)


bitmap internal 2G
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=2G /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=74.7MiB/s][w=19.1k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=7696: Sun Jun 30 03:30:40 2024
   write: IOPS=17.0k, BW=66.5MiB/s (69.8MB/s)(3993MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=18094, avg= 4.79, stdev=17.94
     clat (usec): min=5, max=10352, avg=53.37, stdev=181.29
      lat (usec): min=41, max=22883, avg=58.16, stdev=182.72
     clat percentiles (usec):
      |  1.00th=[   43],  5.00th=[   44], 10.00th=[   45], 20.00th=[   46],
      | 30.00th=[   46], 40.00th=[   47], 50.00th=[   47], 60.00th=[   48],
      | 70.00th=[   48], 80.00th=[   49], 90.00th=[   50], 95.00th=[   52],
      | 99.00th=[   90], 99.50th=[  126], 99.90th=[  873], 99.95th=[ 5997],
      | 99.99th=[ 6063]
    bw (  KiB/s): min=  640, max=80168, per=99.91%, avg=68080.94, stdev=21547.29, samples=119
    iops        : min=  160, max=20042, avg=17020.24, stdev=5386.82, samples=119
   lat (usec)   : 10=0.01%, 50=92.06%, 100=7.10%, 250=0.73%, 500=0.01%
   lat (usec)   : 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.09%, 20=0.01%
   cpu          : usr=2.22%, sys=11.55%, ctx=1022167, majf=0, minf=14
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,1022154,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=66.5MiB/s (69.8MB/s), 66.5MiB/s-66.5MiB/s (69.8MB/s-69.8MB/s), io=3993MiB (4187MB), run=60001-60001msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.15    0.00    0.00   98.81

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      18836.40  73.58  0.00    0.00   0.05     4.00      0.00  0.00     0.87    100.00
nvme1n1  7505.27  29.32  0.00    0.00   0.01     4.00      7505.40   29.32  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.93
nvme2n1  7510.00  29.34  0.00    0.00   0.01     4.00      7510.07   29.34  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.90
nvme3n1  7561.40  29.54  0.00    0.00   0.01     4.00      7561.47   29.54  0.00    0.00   0.01     4.00      0.00  0.00     0.19    100.00
nvme4n1  7543.07  29.47  0.00    0.00   0.01     4.00      7543.07   29.47  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.91
nvme5n1  7552.73  29.50  0.00    0.00   0.01     4.00      7552.80   29.50  0.00    0.00   0.01     4.00      0.00  0.00     0.22    99.91



mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : 7d8ed7e8:4c2c4b17:723a22e5:4e9b5200
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 2 GB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 932 bits (chunks), 932 dirty (100.0%)









bitmap external 64M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=67.3MiB/s][w=17.2k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=7912: Sun Jun 30 03:39:11 2024
   write: IOPS=17.3k, BW=67.8MiB/s (71.1MB/s)(4066MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=21987, avg= 6.11, stdev=22.04
     clat (usec): min=3, max=8410, avg=50.79, stdev=27.03
      lat (usec): min=42, max=22140, avg=56.90, stdev=35.13
     clat percentiles (usec):
      |  1.00th=[   41],  5.00th=[   42], 10.00th=[   44], 20.00th=[   46],
      | 30.00th=[   47], 40.00th=[   48], 50.00th=[   49], 60.00th=[   50],
      | 70.00th=[   51], 80.00th=[   52], 90.00th=[   56], 95.00th=[   68],
      | 99.00th=[   93], 99.50th=[  124], 99.90th=[  155], 99.95th=[  237],
      | 99.99th=[ 1037]
    bw (  KiB/s): min=38120, max=82576, per=100.00%, avg=69402.96, stdev=7769.33, samples=119
    iops        : min= 9530, max=20644, avg=17350.76, stdev=1942.33, samples=119
   lat (usec)   : 4=0.01%, 20=0.01%, 50=67.87%, 100=31.34%, 250=0.76%
   lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
   cpu          : usr=2.23%, sys=14.27%, ctx=1040947, majf=0, minf=233929
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,1040925,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=67.8MiB/s (71.1MB/s), 67.8MiB/s-67.8MiB/s (71.1MB/s-71.1MB/s), io=4066MiB (4264MB), run=60001-60001msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.15    0.00    0.00   98.81

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      18428.60  71.99  0.00    0.00   0.05     4.00      0.00  0.00     0.87    99.99
nvme1n1  7399.40  28.90  0.00    0.00   0.01     4.00      7399.47   28.90  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.73
nvme2n1  7361.20  28.75  0.00    0.00   0.01     4.00      7361.27   28.75  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.63
nvme3n1  7376.67  28.82  0.00    0.00   0.01     4.00      7376.73   28.82  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.63
nvme4n1  7367.27  28.78  0.00    0.00   0.01     4.00      7367.20   28.78  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.65
nvme5n1  7352.47  28.72  0.00    0.00   0.01     4.00      7352.67   28.72  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.73
nvme8n1  0.47     0.00   0.00    0.00   0.00     4.00      293.40    1.15   0.00    0.00   0.02     4.00      0.00  0.00     0.01    24.24



mdadm -X /bitmap/bitmap.bin
         Filename : /bitmap/bitmap.bin
            Magic : 6d746962
          Version : 4
             UUID : 1e3480e5:1f9d8b8a:53ebc6b7:279afb73
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 64 MB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 29807 bits (chunks), 29665 dirty (99.5%)



bitmap external 1024M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=70.6MiB/s][w=18.1k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=8592: Sun Jun 30 03:54:11 2024
   write: IOPS=19.6k, BW=76.5MiB/s (80.2MB/s)(4590MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=21819, avg= 4.12, stdev=20.16
     clat (usec): min=22, max=3706, avg=46.37, stdev=20.38
      lat (usec): min=40, max=21951, avg=50.49, stdev=28.81
     clat percentiles (usec):
      |  1.00th=[   40],  5.00th=[   41], 10.00th=[   42], 20.00th=[   42],
      | 30.00th=[   43], 40.00th=[   44], 50.00th=[   45], 60.00th=[   47],
      | 70.00th=[   48], 80.00th=[   49], 90.00th=[   50], 95.00th=[   52],
      | 99.00th=[   86], 99.50th=[  120], 99.90th=[  157], 99.95th=[  233],
      | 99.99th=[  906]
    bw (  KiB/s): min=61616, max=84728, per=100.00%, avg=78398.66, stdev=5410.81, samples=119
    iops        : min=15404, max=21182, avg=19599.66, stdev=1352.70, samples=119
   lat (usec)   : 50=90.10%, 100=9.16%, 250=0.72%, 500=0.01%, 750=0.01%
   lat (usec)   : 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%
   cpu          : usr=2.35%, sys=11.88%, ctx=1175104, majf=0, minf=11
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,1175086,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=76.5MiB/s (80.2MB/s), 76.5MiB/s-76.5MiB/s (80.2MB/s-80.2MB/s), io=4590MiB (4813MB), run=60001-60001msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.03    0.00    0.00   98.93

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      20758.20  81.09  0.00    0.00   0.04     4.00      0.00  0.00     0.89    100.00
nvme1n1  8291.67  32.39  0.00    0.00   0.01     4.00      8291.73   32.39  0.00    0.00   0.01     4.00      0.00  0.00     0.22    99.87
nvme2n1  8270.93  32.31  0.00    0.00   0.01     4.00      8271.07   32.31  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.79
nvme3n1  8310.67  32.46  0.00    0.00   0.01     4.00      8310.80   32.46  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.83
nvme4n1  8300.67  32.42  0.00    0.00   0.01     4.00      8300.67   32.42  0.00    0.00   0.01     4.00      0.00  0.00     0.23    99.76
nvme5n1  8342.13  32.59  0.00    0.00   0.02     4.00      8342.13   32.59  0.00    0.00   0.01     4.00      0.00  0.00     0.25    99.85
nvme8n1  0.33     0.00   0.00    0.00   8.40     4.00      0.00      0.00   0.00    0.00   0.00     0.00      0.00  0.00     0.00    0.33


mdadm -X /bitmap/bitmap.bin
         Filename : /bitmap/bitmap.bin
            Magic : 6d746962
          Version : 4
             UUID : 30e6211d:31ac1204:e6cdadb3:9691d3ee
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 1 GB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 1863 bits (chunks), 1863 dirty (100.0%)



bitmap none
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=none /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=82.5MiB/s][w=21.1k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=9158: Sun Jun 30 04:11:01 2024
   write: IOPS=20.6k, BW=80.6MiB/s (84.5MB/s)(4833MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=13598, avg= 3.50, stdev=12.46
     clat (usec): min=4, max=3694, avg=44.31, stdev=21.60
      lat (usec): min=39, max=13681, avg=47.81, stdev=24.98
     clat percentiles (usec):
      |  1.00th=[   39],  5.00th=[   40], 10.00th=[   41], 20.00th=[   41],
      | 30.00th=[   42], 40.00th=[   43], 50.00th=[   43], 60.00th=[   44],
      | 70.00th=[   45], 80.00th=[   46], 90.00th=[   48], 95.00th=[   50],
      | 99.00th=[   87], 99.50th=[  117], 99.90th=[  157], 99.95th=[  229],
      | 99.99th=[  963]
    bw (  KiB/s): min=74112, max=86712, per=100.00%, avg=82486.43, stdev=3696.94, samples=119
    iops        : min=18528, max=21678, avg=20621.59, stdev=924.23, samples=119
   lat (usec)   : 10=0.01%, 50=95.91%, 100=3.33%, 250=0.74%, 500=0.01%
   lat (usec)   : 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%
   cpu          : usr=2.30%, sys=10.74%, ctx=1237375, majf=0, minf=179597
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,1237359,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), io=4833MiB (5068MB), run=60001-60001msec




avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.06    0.00    0.00   98.91

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      20040.87  78.28  0.00    0.00   0.04     4.00      0.00  0.00     0.89    99.99
nvme1n1  8016.80  31.32  0.00    0.00   0.01     4.00      8016.93   31.32  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.68
nvme2n1  7983.20  31.18  0.00    0.00   0.01     4.00      7983.20   31.18  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.74
nvme3n1  8030.07  31.37  0.00    0.00   0.01     4.00      8030.20   31.37  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.62
nvme4n1  8016.40  31.31  0.00    0.00   0.01     4.00      8016.40   31.31  0.00    0.00   0.01     4.00      0.00  0.00     0.23    99.73
nvme5n1  8034.87  31.39  0.00    0.00   0.02     4.00      8035.00   31.39  0.00    0.00   0.01     4.00      0.00  0.00     0.24    99.71




single disk 1K RW
================================================================
fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=1k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Single
Single: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, (T) 1024B-1024B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=54.3MiB/s][w=55.6k IOPS][eta 00m:00s]
Single: (groupid=0, jobs=1): err= 0: pid=4471: Sun Jun 30 18:31:56 2024
   write: IOPS=55.4k, BW=54.1MiB/s (56.7MB/s)(3244MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=2792, avg= 2.71, stdev= 2.12
     clat (nsec): min=651, max=8350.9k, avg=14864.41, stdev=5360.57
      lat (usec): min=15, max=8403, avg=17.57, stdev= 5.79
     clat percentiles (usec):
      |  1.00th=[   15],  5.00th=[   15], 10.00th=[   15], 20.00th=[   15],
      | 30.00th=[   15], 40.00th=[   15], 50.00th=[   15], 60.00th=[   15],
      | 70.00th=[   15], 80.00th=[   16], 90.00th=[   16], 95.00th=[   16],
      | 99.00th=[   18], 99.50th=[   22], 99.90th=[   32], 99.95th=[   33],
      | 99.99th=[  206]
    bw (  KiB/s): min=51884, max=56778, per=100.00%, avg=55394.37, stdev=561.60, samples=119
    iops        : min=51884, max=56778, avg=55394.44, stdev=561.62, samples=119
   lat (nsec)   : 750=0.01%, 1000=0.01%
   lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=99.43%, 50=0.54%
   lat (usec)   : 100=0.01%, 250=0.02%, 500=0.01%
   lat (msec)   : 10=0.01%
   cpu          : usr=3.57%, sys=16.41%, ctx=3321571, majf=0, minf=180742
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,3321653,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s (56.7MB/s-56.7MB/s), io=3244MiB (3401MB), run=60001-60001msec

Disk stats (read/write):
   nvme8n1: ios=0/3309968, merge=0/0, ticks=0/44637, in_queue=44638, util=99.71%


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    0.42    0.00    0.00   99.54

Device   r/s   rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
nvme8n1  0.00  0.00   0.00    0.00   0.00     0.00      55496.93  54.20  0.00    0.00   0.01     1.00      0.00  0.00     0.75    100.00





single disk 4K RW
================================================================
blockdev --setra 256 /dev/nvme8n1

fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Single

Single: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=270MiB/s][w=69.0k IOPS][eta 00m:00s]
Single: (groupid=0, jobs=1): err= 0: pid=4396: Sun Jun 30 18:21:52 2024
   write: IOPS=68.8k, BW=269MiB/s (282MB/s)(15.8GiB/60001msec); 0 zone resets
     slat (usec): min=2, max=796, avg= 2.45, stdev= 1.59
     clat (nsec): min=652, max=8343.1k, avg=11616.73, stdev=5088.99
      lat (usec): min=11, max=8410, avg=14.06, stdev= 5.36
     clat percentiles (usec):
      |  1.00th=[   12],  5.00th=[   12], 10.00th=[   12], 20.00th=[   12],
      | 30.00th=[   12], 40.00th=[   12], 50.00th=[   12], 60.00th=[   12],
      | 70.00th=[   12], 80.00th=[   12], 90.00th=[   12], 95.00th=[   12],
      | 99.00th=[   14], 99.50th=[   17], 99.90th=[   28], 99.95th=[   34],
      | 99.99th=[  204]
    bw (  KiB/s): min=264072, max=277568, per=100.00%, avg=275629.71, stdev=1902.45, samples=119
    iops        : min=66018, max=69392, avg=68907.43, stdev=475.55, samples=119
   lat (nsec)   : 750=0.01%, 1000=0.01%
   lat (usec)   : 2=0.01%, 4=0.01%, 10=0.04%, 20=99.55%, 50=0.38%
   lat (usec)   : 100=0.01%, 250=0.02%, 1000=0.01%
   lat (msec)   : 10=0.01%
   cpu          : usr=5.20%, sys=21.28%, ctx=4129887, majf=0, minf=45204
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,4130258,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=269MiB/s (282MB/s), 269MiB/s-269MiB/s (282MB/s-282MB/s), io=15.8GiB (16.9GB), run=60001-60001msec

Disk stats (read/write):
   nvme8n1: ios=0/4119593, merge=0/0, ticks=0/40922, in_queue=40923, util=99.89%


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.08    0.00    0.57    0.00    0.00   99.35

Device   r/s   rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s   wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
nvme8n1  0.00  0.00   0.00    0.00   0.00     0.00      69041.33  269.69  0.00    0.00   0.01     4.00      0.00  0.00     0.68    100.00
Yu Kuai Nov. 11, 2024, 1:02 p.m. UTC | #12
Hi,

在 2024/11/11 19:59, Dragan Milivojević 写道:
> On 11/11/2024 03:04, Yu Kuai wrote:
> 
>> Yes, as I said please show me you how you creat the array and your test
>> script. I must know what you are testing, like single threaded or high
>> concurrency. For example your result shows bitmap none close to bitmap
>> external, this is impossible in our previous results. I can only guess
>> that you're testing single threaded.
> 
> All of that is included in the previously linked pastebin.

TBO, I don't know what is this. :(
> I will include the contents of that pastebin at the end of this email if 
> that helps.
> Every test includes the mdadm create line, disk settings, md settings, 
> fio test line
> used and the results and the typical iostat output during the test. I 
> hope that is
> sufficient.
> 
>> BTW, it'll be great if you can provide some perf results of the internal
>> bitmap in your case, that will show us directly where is the bottleneck.
> 
> Not right now, this server is in production and I'm not sure if I will 
> be able
> to get it to an idle state or to find the time to do it due to other 
> projects.
> 
>>> BTW do you guys do performance tests? All of the raid levels are
>>
>> We do, but we never test external bitmap.
> 
> I wasn't referring to that, more to the fact that there is a huge 
> difference in
> performance between no bitmap and bitmap or that raid (even "simple" 
> levels like 0)
> do not scale with real world workloads.

Yes, this is a known problem, the gap here is that I don't think
external bitmap is much helpful, while your result disagree.

> 
> The contents of that pastebin, hopefully my email client won't mess up 
> the formating:
> 
> 
> 5 disk RAID5, 64K chunk
> 
> Summary
> 
> Test                   BW         IOPS
> bitmap internal 64M    700KiB/s   174
> bitmap internal 128M   702KiB/s   175
> bitmap internal 512M   1142KiB/s  285
> bitmap internal 1024M  40.4MiB/s  10.3k
> bitmap internal 2G     66.5MiB/s  17.0k
> bitmap external 64M    67.8MiB/s  17.3k
> bitmap external 1024M  76.5MiB/s  19.6k
> bitmap none            80.6MiB/s  20.6k
> Single disk 1K         54.1MiB/s  55.4k
> Single disk 4K         269MiB/s   68.8k
> 
> 
> 
> 
> AlmaLinux release 9.4 (Seafoam Ocelot)
> 5.14.0-427.20.1.el9_4
> 
> 
> nvme list
> Node                  Generic               SN                   
> Model                                    Namespace  
> Usage                      Format           FW Rev
> --------------------- --------------------- -------------------- 
> ---------------------------------------- ---------- 
> -------------------------- ---------------- --------
> /dev/nvme0n1          /dev/ng0n1            1460A0F9TSTJ         Dell DC 
> NVMe CD8 U.2 960GB               0x1        122.33  GB / 960.20  GB    
> 512   B +  0 B   2.0.0
> /dev/nvme1n1          /dev/ng1n1            S6WRNJ0WA04045P      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme2n1          /dev/ng2n1            S6WRNJ0WA04048B      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme3n1          /dev/ng3n1            S6WRNJ0W810396H      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme4n1          /dev/ng4n1            S6WRNJ0W808149N      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme5n1          /dev/ng5n1            S6WRNJ0WA04043Z      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme6n1          /dev/ng6n1            PHBT909504AH016N     INTEL 
> MEMPEK1J016GAL                     0x1         14.40  GB /  14.40  GB    
> 512   B +  0 B   K4110420
> /dev/nvme7n1          /dev/ng7n1            S6WRNJ0WA04036R      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme8n1          /dev/ng8n1            S6WRNJ0WA04050H      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> 
> 
> 
> 
> bitmap internal 64M
> ================================================================
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 

The array set up is fine. And the following external bitmap is using
/bitmap/bitmap.bin, does the back-end storage of this file the same as
test device?

> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5

Then this is what I suspected, the above test is quite limited and can't
replace the real world workload, 1 thread 1 iodepth with 4k randwrite.

I still can't believe your test result, and I can't figure out why
internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5,
and use the same fio script to test, the result is quite different from
yours:

ram0:			981MiB/s
non-bitmap:		132MiB/s
internal-bitmap:	95.5MiB/s
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=4718: Sun Jun 30 02:18:30 2024
>    write: IOPS=174, BW=700KiB/s (717kB/s)(41.0MiB/60005msec); 0 zone resets
>      slat (usec): min=4, max=18062, avg=11.28, stdev=176.21
>      clat (usec): min=46, max=13308, avg=5700.08, stdev=1194.59
>       lat (usec): min=53, max=22717, avg=5711.36, stdev=1206.03
>      clat percentiles (usec):
>       |  1.00th=[   51],  5.00th=[ 5800], 10.00th=[ 5800], 20.00th=[ 5866],
>       | 30.00th=[ 5866], 40.00th=[ 5866], 50.00th=[ 5866], 60.00th=[ 5932],
>       | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5932], 95.00th=[ 5997],
>       | 99.00th=[ 6194], 99.50th=[ 8586], 99.90th=[10290], 99.95th=[13042],
>       | 99.99th=[13042]
>     bw (  KiB/s): min=  608, max=  752, per=100.00%, avg=700.03, 
> stdev=20.93, samples=119

There is absolutely something wrong here, it doesn't make sense to me
that internal bitmap is so slow. However, I have no idea until you can
provide the perf result.

Thanks,
Kuai

>     iops        : min=  152, max=  188, avg=175.01, stdev= 5.23, 
> samples=119
>    lat (usec)   : 50=0.68%, 100=3.23%
>    lat (msec)   : 10=95.99%, 20=0.10%
>    cpu          : usr=0.08%, sys=0.24%, ctx=10503, majf=0, minf=8
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,10499,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=700KiB/s (717kB/s), 700KiB/s-700KiB/s (717kB/s-717kB/s), 
> io=41.0MiB (43.0MB), run=60005-60005msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.00    0.00    0.07    0.00    0.00   99.93
> 
> Device   r/s    rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  
> wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
> md127    0.00   0.00   0.00    0.00   0.00     0.00      175.47  0.69   
> 0.00    0.00   5.68     4.00      0.00    0.00     1.00    100.00
> nvme1n1  69.40  0.27   0.00    0.00   0.01     4.00      237.80  0.93   
> 0.00    0.00   3.55     4.00      168.47  0.81     0.98    95.59
> nvme2n1  69.20  0.27   0.00    0.00   0.01     4.00      237.60  0.93   
> 0.00    0.00   3.55     4.00      168.47  0.81     0.98    95.61
> nvme3n1  72.20  0.28   0.00    0.00   0.01     4.00      240.60  0.94   
> 0.00    0.00   3.51     4.00      168.47  0.83     0.98    95.29
> nvme4n1  68.07  0.27   0.00    0.00   0.02     4.00      236.53  0.92   
> 0.00    0.00   3.57     4.00      168.47  0.81     0.98    95.65
> nvme5n1  72.07  0.28   0.00    0.00   0.02     4.00      240.53  0.94   
> 0.00    0.00   3.52     4.00      168.47  0.83     0.99    95.31
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : 77fa1a1b:2f0dd646:adc85c8e:985513a8
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 64 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 29807 bits (chunks), 1517 dirty (5.1%)
> 
> 
> bitmap internal 128M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=128M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=6283: Sun Jun 30 02:49:06 2024
>    write: IOPS=175, BW=702KiB/s (719kB/s)(41.1MiB/60002msec); 0 zone resets
>      slat (usec): min=8, max=18200, avg=16.06, stdev=177.21
>      clat (usec): min=61, max=20048, avg=5675.78, stdev=1968.88
>       lat (usec): min=74, max=22975, avg=5691.84, stdev=1976.14
>      clat percentiles (usec):
>       |  1.00th=[   68],  5.00th=[   73], 10.00th=[ 5866], 20.00th=[ 5932],
>       | 30.00th=[ 5932], 40.00th=[ 5932], 50.00th=[ 5932], 60.00th=[ 5997],
>       | 70.00th=[ 5997], 80.00th=[ 5997], 90.00th=[ 5997], 95.00th=[ 6063],
>       | 99.00th=[14615], 99.50th=[15008], 99.90th=[16188], 99.95th=[16319],
>       | 99.99th=[16319]
>     bw (  KiB/s): min=  384, max=  816, per=99.97%, avg=702.12, 
> stdev=72.52, samples=119
>     iops        : min=   96, max=  204, avg=175.53, stdev=18.13, 
> samples=119
>    lat (usec)   : 100=7.62%, 250=0.01%
>    lat (msec)   : 10=90.80%, 20=1.56%, 50=0.01%
>    cpu          : usr=0.11%, sys=0.34%, ctx=10539, majf=0, minf=8
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,10534,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=702KiB/s (719kB/s), 702KiB/s-702KiB/s (719kB/s-719kB/s), 
> io=41.1MiB (43.1MB), run=60002-60002msec
> 
> 
> 
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.00    0.00    0.08    0.00    0.00   99.92
> 
> Device   r/s    rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  
> wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
> md127    0.00   0.00   0.00    0.00   0.00     0.00      173.73  0.68   
> 0.00    0.00   5.73     4.00      0.00    0.00     1.00    99.99
> nvme1n1  65.87  0.26   0.00    0.00   0.01     4.00      226.07  0.65   
> 0.00    0.00   3.60     2.94      160.20  0.81     0.94    92.46
> nvme2n1  71.33  0.28   0.00    0.00   0.02     4.00      231.53  0.67   
> 0.00    0.00   3.50     2.96      160.27  0.84     0.95    91.79
> nvme3n1  68.60  0.27   0.00    0.00   0.02     4.00      228.80  0.66   
> 0.00    0.00   3.68     2.95      160.27  0.93     0.99    94.37
> nvme4n1  68.87  0.27   0.00    0.00   0.02     4.00      229.07  0.66   
> 0.00    0.00   3.52     2.95      160.20  0.81     0.94    91.59
> nvme5n1  72.80  0.28   0.00    0.00   0.02     4.00      233.00  0.68   
> 0.00    0.00   3.53     2.97      160.27  0.87     0.96    92.29
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : 93fdcd4b:ae61a1f8:4d809242:2cd4a4c7
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 128 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 14904 bits (chunks), 1617 dirty (10.8%)
> 
> 
> 
> 
> 
> bitmap internal 512M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=512M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=1232KiB/s][w=308 IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=6661: Sun Jun 30 02:58:11 2024
>    write: IOPS=285, BW=1142KiB/s (1169kB/s)(66.9MiB/60006msec); 0 zone 
> resets
>      slat (usec): min=4, max=18130, avg=10.80, stdev=138.54
>      clat (usec): min=42, max=13261, avg=3490.08, stdev=2945.95
>       lat (usec): min=50, max=22827, avg=3500.88, stdev=2949.63
>      clat percentiles (usec):
>       |  1.00th=[   49],  5.00th=[   51], 10.00th=[   52], 20.00th=[   55],
>       | 30.00th=[   58], 40.00th=[   72], 50.00th=[ 5866], 60.00th=[ 5932],
>       | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5997], 95.00th=[ 5997],
>       | 99.00th=[ 6128], 99.50th=[ 8586], 99.90th=[ 9896], 99.95th=[13042],
>       | 99.99th=[13042]
>     bw (  KiB/s): min=  600, max= 1648, per=99.68%, avg=1138.89, 
> stdev=188.44, samples=119
>     iops        : min=  150, max=  412, avg=284.72, stdev=47.11, 
> samples=119
>    lat (usec)   : 50=3.41%, 100=38.62%, 250=0.04%, 500=0.03%
>    lat (msec)   : 10=57.83%, 20=0.07%
>    cpu          : usr=0.09%, sys=0.40%, ctx=17130, majf=0, minf=9
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,17127,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=1142KiB/s (1169kB/s), 1142KiB/s-1142KiB/s 
> (1169kB/s-1169kB/s), io=66.9MiB (70.2MB), run=60006-60006msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.00    0.00    0.10    0.00    0.00   99.90
> 
> Device   r/s     rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  
> wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
> md127    0.00    0.00   0.00    0.00   0.00     0.00      307.13  1.20   
> 0.00    0.00   3.24     4.00      0.00    0.00     1.00    100.00
> nvme1n1  120.47  0.47   0.00    0.00   0.01     4.00      286.07  0.63   
> 0.00    0.00   3.03     2.26      165.60  0.99     1.03    96.58
> nvme2n1  123.87  0.48   0.00    0.00   0.01     4.00      289.47  0.65   
> 0.00    0.00   3.00     2.28      165.60  1.00     1.04    96.63
> nvme3n1  120.87  0.47   0.00    0.00   0.01     4.00      286.47  0.63   
> 0.00    0.00   3.02     2.27      165.60  1.00     1.03    96.39
> nvme4n1  125.00  0.49   0.00    0.00   0.02     4.00      290.60  0.65   
> 0.00    0.00   3.00     2.29      165.60  1.02     1.04    96.54
> nvme5n1  124.07  0.48   0.00    0.00   0.02     4.00      289.67  0.65   
> 0.00    0.00   3.01     2.28      165.60  1.03     1.04    96.59
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : 17eadc76:a367542a:feb6e24e:d650576c
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 512 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 3726 bits (chunks), 1977 dirty (53.1%)
> 
> 
> bitmap internal 1024M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=51.0MiB/s][w=13.1k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=7120: Sun Jun 30 03:08:12 2024
>    write: IOPS=10.3k, BW=40.4MiB/s (42.4MB/s)(2425MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=6, max=18135, avg= 8.93, stdev=23.41
>      clat (usec): min=3, max=10459, avg=86.97, stdev=342.95
>       lat (usec): min=63, max=22927, avg=95.90, stdev=344.33
>      clat percentiles (usec):
>       |  1.00th=[   62],  5.00th=[   63], 10.00th=[   64], 20.00th=[   65],
>       | 30.00th=[   65], 40.00th=[   66], 50.00th=[   67], 60.00th=[   67],
>       | 70.00th=[   68], 80.00th=[   69], 90.00th=[   70], 95.00th=[   74],
>       | 99.00th=[  133], 99.50th=[  155], 99.90th=[ 5997], 99.95th=[ 5997],
>       | 99.99th=[ 6063]
>     bw (  KiB/s): min=  616, max=52968, per=99.80%, avg=41305.95, 
> stdev=20465.79, samples=119
>     iops        : min=  154, max=13242, avg=10326.47, stdev=5116.44, 
> samples=119
>    lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 100=98.64%, 250=1.00%
>    lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%, 10=0.33%, 20=0.01%
>    cpu          : usr=1.89%, sys=12.74%, ctx=620837, majf=0, minf=170751
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,620822,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s 
> (42.4MB/s-42.4MB/s), io=2425MiB (2543MB), run=60001-60001msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.27    0.00    0.00   98.70
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      18216.93  
> 71.16  0.00    0.00   0.05     4.00      0.00  0.00     0.88    100.00
> nvme1n1  7256.20  28.34  0.00    0.00   0.01     4.00      7256.27   
> 28.34  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.71
> nvme2n1  7302.53  28.53  0.00    0.00   0.01     4.00      7302.53   
> 28.53  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.73
> nvme3n1  7278.47  28.43  0.00    0.00   0.01     4.00      7278.53   
> 28.43  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.57
> nvme4n1  7303.93  28.53  0.00    0.00   0.01     4.00      7303.93   
> 28.53  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.74
> nvme5n1  7292.67  28.49  0.00    0.00   0.02     4.00      7292.60   
> 28.49  0.00    0.00   0.02     4.00      0.00  0.00     0.22    99.69
> 
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : a0c7ad14:50689e41:e065a166:4935a186
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 1 GB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 1863 bits (chunks), 1863 dirty (100.0%)
> 
> 
> bitmap internal 2G
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=2G /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=74.7MiB/s][w=19.1k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=7696: Sun Jun 30 03:30:40 2024
>    write: IOPS=17.0k, BW=66.5MiB/s (69.8MB/s)(3993MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=18094, avg= 4.79, stdev=17.94
>      clat (usec): min=5, max=10352, avg=53.37, stdev=181.29
>       lat (usec): min=41, max=22883, avg=58.16, stdev=182.72
>      clat percentiles (usec):
>       |  1.00th=[   43],  5.00th=[   44], 10.00th=[   45], 20.00th=[   46],
>       | 30.00th=[   46], 40.00th=[   47], 50.00th=[   47], 60.00th=[   48],
>       | 70.00th=[   48], 80.00th=[   49], 90.00th=[   50], 95.00th=[   52],
>       | 99.00th=[   90], 99.50th=[  126], 99.90th=[  873], 99.95th=[ 5997],
>       | 99.99th=[ 6063]
>     bw (  KiB/s): min=  640, max=80168, per=99.91%, avg=68080.94, 
> stdev=21547.29, samples=119
>     iops        : min=  160, max=20042, avg=17020.24, stdev=5386.82, 
> samples=119
>    lat (usec)   : 10=0.01%, 50=92.06%, 100=7.10%, 250=0.73%, 500=0.01%
>    lat (usec)   : 750=0.01%, 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%, 10=0.09%, 20=0.01%
>    cpu          : usr=2.22%, sys=11.55%, ctx=1022167, majf=0, minf=14
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,1022154,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=66.5MiB/s (69.8MB/s), 66.5MiB/s-66.5MiB/s 
> (69.8MB/s-69.8MB/s), io=3993MiB (4187MB), run=60001-60001msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.15    0.00    0.00   98.81
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      18836.40  
> 73.58  0.00    0.00   0.05     4.00      0.00  0.00     0.87    100.00
> nvme1n1  7505.27  29.32  0.00    0.00   0.01     4.00      7505.40   
> 29.32  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.93
> nvme2n1  7510.00  29.34  0.00    0.00   0.01     4.00      7510.07   
> 29.34  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.90
> nvme3n1  7561.40  29.54  0.00    0.00   0.01     4.00      7561.47   
> 29.54  0.00    0.00   0.01     4.00      0.00  0.00     0.19    100.00
> nvme4n1  7543.07  29.47  0.00    0.00   0.01     4.00      7543.07   
> 29.47  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.91
> nvme5n1  7552.73  29.50  0.00    0.00   0.01     4.00      7552.80   
> 29.50  0.00    0.00   0.01     4.00      0.00  0.00     0.22    99.91
> 
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : 7d8ed7e8:4c2c4b17:723a22e5:4e9b5200
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 2 GB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 932 bits (chunks), 932 dirty (100.0%)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> bitmap external 64M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin 
> --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=67.3MiB/s][w=17.2k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=7912: Sun Jun 30 03:39:11 2024
>    write: IOPS=17.3k, BW=67.8MiB/s (71.1MB/s)(4066MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=21987, avg= 6.11, stdev=22.04
>      clat (usec): min=3, max=8410, avg=50.79, stdev=27.03
>       lat (usec): min=42, max=22140, avg=56.90, stdev=35.13
>      clat percentiles (usec):
>       |  1.00th=[   41],  5.00th=[   42], 10.00th=[   44], 20.00th=[   46],
>       | 30.00th=[   47], 40.00th=[   48], 50.00th=[   49], 60.00th=[   50],
>       | 70.00th=[   51], 80.00th=[   52], 90.00th=[   56], 95.00th=[   68],
>       | 99.00th=[   93], 99.50th=[  124], 99.90th=[  155], 99.95th=[  237],
>       | 99.99th=[ 1037]
>     bw (  KiB/s): min=38120, max=82576, per=100.00%, avg=69402.96, 
> stdev=7769.33, samples=119
>     iops        : min= 9530, max=20644, avg=17350.76, stdev=1942.33, 
> samples=119
>    lat (usec)   : 4=0.01%, 20=0.01%, 50=67.87%, 100=31.34%, 250=0.76%
>    lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
>    cpu          : usr=2.23%, sys=14.27%, ctx=1040947, majf=0, minf=233929
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,1040925,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=67.8MiB/s (71.1MB/s), 67.8MiB/s-67.8MiB/s 
> (71.1MB/s-71.1MB/s), io=4066MiB (4264MB), run=60001-60001msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.15    0.00    0.00   98.81
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      18428.60  
> 71.99  0.00    0.00   0.05     4.00      0.00  0.00     0.87    99.99
> nvme1n1  7399.40  28.90  0.00    0.00   0.01     4.00      7399.47   
> 28.90  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.73
> nvme2n1  7361.20  28.75  0.00    0.00   0.01     4.00      7361.27   
> 28.75  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.63
> nvme3n1  7376.67  28.82  0.00    0.00   0.01     4.00      7376.73   
> 28.82  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.63
> nvme4n1  7367.27  28.78  0.00    0.00   0.01     4.00      7367.20   
> 28.78  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.65
> nvme5n1  7352.47  28.72  0.00    0.00   0.01     4.00      7352.67   
> 28.72  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.73
> nvme8n1  0.47     0.00   0.00    0.00   0.00     4.00      293.40    
> 1.15   0.00    0.00   0.02     4.00      0.00  0.00     0.01    24.24
> 
> 
> 
> mdadm -X /bitmap/bitmap.bin
>          Filename : /bitmap/bitmap.bin
>             Magic : 6d746962
>           Version : 4
>              UUID : 1e3480e5:1f9d8b8a:53ebc6b7:279afb73
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 64 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 29807 bits (chunks), 29665 dirty (99.5%)
> 
> 
> 
> bitmap external 1024M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin 
> --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=70.6MiB/s][w=18.1k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=8592: Sun Jun 30 03:54:11 2024
>    write: IOPS=19.6k, BW=76.5MiB/s (80.2MB/s)(4590MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=21819, avg= 4.12, stdev=20.16
>      clat (usec): min=22, max=3706, avg=46.37, stdev=20.38
>       lat (usec): min=40, max=21951, avg=50.49, stdev=28.81
>      clat percentiles (usec):
>       |  1.00th=[   40],  5.00th=[   41], 10.00th=[   42], 20.00th=[   42],
>       | 30.00th=[   43], 40.00th=[   44], 50.00th=[   45], 60.00th=[   47],
>       | 70.00th=[   48], 80.00th=[   49], 90.00th=[   50], 95.00th=[   52],
>       | 99.00th=[   86], 99.50th=[  120], 99.90th=[  157], 99.95th=[  233],
>       | 99.99th=[  906]
>     bw (  KiB/s): min=61616, max=84728, per=100.00%, avg=78398.66, 
> stdev=5410.81, samples=119
>     iops        : min=15404, max=21182, avg=19599.66, stdev=1352.70, 
> samples=119
>    lat (usec)   : 50=90.10%, 100=9.16%, 250=0.72%, 500=0.01%, 750=0.01%
>    lat (usec)   : 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%
>    cpu          : usr=2.35%, sys=11.88%, ctx=1175104, majf=0, minf=11
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,1175086,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=76.5MiB/s (80.2MB/s), 76.5MiB/s-76.5MiB/s 
> (80.2MB/s-80.2MB/s), io=4590MiB (4813MB), run=60001-60001msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.03    0.00    0.00   98.93
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      20758.20  
> 81.09  0.00    0.00   0.04     4.00      0.00  0.00     0.89    100.00
> nvme1n1  8291.67  32.39  0.00    0.00   0.01     4.00      8291.73   
> 32.39  0.00    0.00   0.01     4.00      0.00  0.00     0.22    99.87
> nvme2n1  8270.93  32.31  0.00    0.00   0.01     4.00      8271.07   
> 32.31  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.79
> nvme3n1  8310.67  32.46  0.00    0.00   0.01     4.00      8310.80   
> 32.46  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.83
> nvme4n1  8300.67  32.42  0.00    0.00   0.01     4.00      8300.67   
> 32.42  0.00    0.00   0.01     4.00      0.00  0.00     0.23    99.76
> nvme5n1  8342.13  32.59  0.00    0.00   0.02     4.00      8342.13   
> 32.59  0.00    0.00   0.01     4.00      0.00  0.00     0.25    99.85
> nvme8n1  0.33     0.00   0.00    0.00   8.40     4.00      0.00      
> 0.00   0.00    0.00   0.00     0.00      0.00  0.00     0.00    0.33
> 
> 
> mdadm -X /bitmap/bitmap.bin
>          Filename : /bitmap/bitmap.bin
>             Magic : 6d746962
>           Version : 4
>              UUID : 30e6211d:31ac1204:e6cdadb3:9691d3ee
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 1 GB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 1863 bits (chunks), 1863 dirty (100.0%)
> 
> 
> 
> bitmap none
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=none /dev/md/raid5 
> --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=82.5MiB/s][w=21.1k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=9158: Sun Jun 30 04:11:01 2024
>    write: IOPS=20.6k, BW=80.6MiB/s (84.5MB/s)(4833MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=13598, avg= 3.50, stdev=12.46
>      clat (usec): min=4, max=3694, avg=44.31, stdev=21.60
>       lat (usec): min=39, max=13681, avg=47.81, stdev=24.98
>      clat percentiles (usec):
>       |  1.00th=[   39],  5.00th=[   40], 10.00th=[   41], 20.00th=[   41],
>       | 30.00th=[   42], 40.00th=[   43], 50.00th=[   43], 60.00th=[   44],
>       | 70.00th=[   45], 80.00th=[   46], 90.00th=[   48], 95.00th=[   50],
>       | 99.00th=[   87], 99.50th=[  117], 99.90th=[  157], 99.95th=[  229],
>       | 99.99th=[  963]
>     bw (  KiB/s): min=74112, max=86712, per=100.00%, avg=82486.43, 
> stdev=3696.94, samples=119
>     iops        : min=18528, max=21678, avg=20621.59, stdev=924.23, 
> samples=119
>    lat (usec)   : 10=0.01%, 50=95.91%, 100=3.33%, 250=0.74%, 500=0.01%
>    lat (usec)   : 750=0.01%, 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%
>    cpu          : usr=2.30%, sys=10.74%, ctx=1237375, majf=0, minf=179597
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,1237359,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s 
> (84.5MB/s-84.5MB/s), io=4833MiB (5068MB), run=60001-60001msec
> 
> 
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.06    0.00    0.00   98.91
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      20040.87  
> 78.28  0.00    0.00   0.04     4.00      0.00  0.00     0.89    99.99
> nvme1n1  8016.80  31.32  0.00    0.00   0.01     4.00      8016.93   
> 31.32  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.68
> nvme2n1  7983.20  31.18  0.00    0.00   0.01     4.00      7983.20   
> 31.18  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.74
> nvme3n1  8030.07  31.37  0.00    0.00   0.01     4.00      8030.20   
> 31.37  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.62
> nvme4n1  8016.40  31.31  0.00    0.00   0.01     4.00      8016.40   
> 31.31  0.00    0.00   0.01     4.00      0.00  0.00     0.23    99.73
> nvme5n1  8034.87  31.39  0.00    0.00   0.02     4.00      8035.00   
> 31.39  0.00    0.00   0.01     4.00      0.00  0.00     0.24    99.71
> 
> 
> 
> 
> single disk 1K RW
> ================================================================
> fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=1k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Single
> Single: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, (T) 
> 1024B-1024B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=54.3MiB/s][w=55.6k IOPS][eta 00m:00s]
> Single: (groupid=0, jobs=1): err= 0: pid=4471: Sun Jun 30 18:31:56 2024
>    write: IOPS=55.4k, BW=54.1MiB/s (56.7MB/s)(3244MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=2792, avg= 2.71, stdev= 2.12
>      clat (nsec): min=651, max=8350.9k, avg=14864.41, stdev=5360.57
>       lat (usec): min=15, max=8403, avg=17.57, stdev= 5.79
>      clat percentiles (usec):
>       |  1.00th=[   15],  5.00th=[   15], 10.00th=[   15], 20.00th=[   15],
>       | 30.00th=[   15], 40.00th=[   15], 50.00th=[   15], 60.00th=[   15],
>       | 70.00th=[   15], 80.00th=[   16], 90.00th=[   16], 95.00th=[   16],
>       | 99.00th=[   18], 99.50th=[   22], 99.90th=[   32], 99.95th=[   33],
>       | 99.99th=[  206]
>     bw (  KiB/s): min=51884, max=56778, per=100.00%, avg=55394.37, 
> stdev=561.60, samples=119
>     iops        : min=51884, max=56778, avg=55394.44, stdev=561.62, 
> samples=119
>    lat (nsec)   : 750=0.01%, 1000=0.01%
>    lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=99.43%, 50=0.54%
>    lat (usec)   : 100=0.01%, 250=0.02%, 500=0.01%
>    lat (msec)   : 10=0.01%
>    cpu          : usr=3.57%, sys=16.41%, ctx=3321571, majf=0, minf=180742
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,3321653,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s 
> (56.7MB/s-56.7MB/s), io=3244MiB (3401MB), run=60001-60001msec
> 
> Disk stats (read/write):
>    nvme8n1: ios=0/3309968, merge=0/0, ticks=0/44637, in_queue=44638, 
> util=99.71%
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    0.42    0.00    0.00   99.54
> 
> Device   r/s   rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  
> wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> nvme8n1  0.00  0.00   0.00    0.00   0.00     0.00      55496.93  54.20  
> 0.00    0.00   0.01     1.00      0.00  0.00     0.75    100.00
> 
> 
> 
> 
> 
> single disk 4K RW
> ================================================================
> blockdev --setra 256 /dev/nvme8n1
> 
> fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Single
> 
> Single: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=270MiB/s][w=69.0k IOPS][eta 00m:00s]
> Single: (groupid=0, jobs=1): err= 0: pid=4396: Sun Jun 30 18:21:52 2024
>    write: IOPS=68.8k, BW=269MiB/s (282MB/s)(15.8GiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=796, avg= 2.45, stdev= 1.59
>      clat (nsec): min=652, max=8343.1k, avg=11616.73, stdev=5088.99
>       lat (usec): min=11, max=8410, avg=14.06, stdev= 5.36
>      clat percentiles (usec):
>       |  1.00th=[   12],  5.00th=[   12], 10.00th=[   12], 20.00th=[   12],
>       | 30.00th=[   12], 40.00th=[   12], 50.00th=[   12], 60.00th=[   12],
>       | 70.00th=[   12], 80.00th=[   12], 90.00th=[   12], 95.00th=[   12],
>       | 99.00th=[   14], 99.50th=[   17], 99.90th=[   28], 99.95th=[   34],
>       | 99.99th=[  204]
>     bw (  KiB/s): min=264072, max=277568, per=100.00%, avg=275629.71, 
> stdev=1902.45, samples=119
>     iops        : min=66018, max=69392, avg=68907.43, stdev=475.55, 
> samples=119
>    lat (nsec)   : 750=0.01%, 1000=0.01%
>    lat (usec)   : 2=0.01%, 4=0.01%, 10=0.04%, 20=99.55%, 50=0.38%
>    lat (usec)   : 100=0.01%, 250=0.02%, 1000=0.01%
>    lat (msec)   : 10=0.01%
>    cpu          : usr=5.20%, sys=21.28%, ctx=4129887, majf=0, minf=45204
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,4130258,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=269MiB/s (282MB/s), 269MiB/s-269MiB/s (282MB/s-282MB/s), 
> io=15.8GiB (16.9GB), run=60001-60001msec
> 
> Disk stats (read/write):
>    nvme8n1: ios=0/4119593, merge=0/0, ticks=0/40922, in_queue=40923, 
> util=99.89%
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.08    0.00    0.57    0.00    0.00   99.35
> 
> Device   r/s   rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s   wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> nvme8n1  0.00  0.00   0.00    0.00   0.00     0.00      69041.33  
> 269.69  0.00    0.00   0.01     4.00      0.00  0.00     0.68    100.00
> 
> .
>
Dragan Milivojević Nov. 11, 2024, 2:07 p.m. UTC | #13
On 11/11/2024 14:02, Yu Kuai wrote:

> TBO, I don't know what is this. :(

It's just a website where you can post text content, notes basically. I use it
with mailing list where messages get rejected if I attach a file.
I prefer not to include long debug logs, test logs etc in the body as it will just
get quoted endless amount of times and pollute the thread. Old habit from the days
when blackquoting was a thing and kilobytes mattered.


> Yes, this is a known problem, the gap here is that I don't think
> external bitmap is much helpful, while your result disagree.
> 
>> bitmap internal 64M
>> ================================================================
>> mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1
>>
>> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
>> blockdev --setra 1024 /dev/md/raid5
>>
>> echo 8 > /sys/block/md127/md/group_thread_cnt
>> echo 8192 > /sys/block/md127/md/stripe_cache_size
>>

  
> The array set up is fine. And the following external bitmap is using
> /bitmap/bitmap.bin, does the back-end storage of this file the same as
> test device?


No, I used one of the extra devices.




>> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5
> 
> Then this is what I suspected, the above test is quite limited and can't
> replace the real world workload, 1 thread 1 iodepth with 4k randwrite.



That is true. I went down this rabbit hole because I was getting worse results
with a RAID5 array than with a single disk using real world workload, PostgreSQL in my case.
I chose these test parameters as a worst case scenario.

I did test with other parameters, did a whole battery of test with iodepth of 1 and 8 and
BS sizes 4K,8,16 all the way to 2048K. It shows similar behaviour.
For example:

5 disk RAID5, 64K chunk, default internal bitmap, iodepth 8

randread
BS     BW     IOPS    LAT      LAT_DEV  SS  SS_perc  USR_CPU  SYS_CPU
4K     574    146993  53.58    21.63    0   7.50%    13.25    59.30
8K     1127   144268  54.75    19.48    0   4.39%    11.25    61.50
16K    2084   133387  59.32    16.53    0   2.87%    10.52    63.03
32K    3942   126151  62.67    21.59    1   1.30%    13.03    60.14
64K    7225   115606  68.64    19.31    1   1.03%    9.58     65.30
128K   7947   63580   124.73   22.66    1   1.91%    8.94     63.48
256K   9216   36867   216.49   26.47    1   0.51%    2.65     69.43
512K   8065   16130   494.82   42.43    1   1.25%    2.41     72.56
1024K  8130   8130    983.01   64.22    1   0.97%    0.92     73.38
2048K  10685  5342    1496.28  132.24   0   2.50%    0.75     68.89


randwrite
BS     BW    IOPS  LAT       LAT_DEV  SS  SS_perc  USR_CPU  SYS_CPU
4K     1     375   21318.71  5059.72  0   41.06%   0.10     0.38
8K     2     354   22548.71  3084.57  0   4.90%    0.11     0.35
16K    5     346   23107.64  2517.95  0   9.77%    0.11     0.49
32K    13    420   19001.29  5500.62  0   34.75%   0.22     1.30
64K    33    530   15064.25  3916.28  0   8.07%    0.29     2.92
128K   79    637   12549.72  3249.85  0   3.99%    0.72     4.60
256K   184   739   10812.12  2576.32  0   34.02%   3.81     4.32
512K   307   615   12995.86  2891.70  0   2.99%    2.31     4.31
1024K  611   611   13071.85  3287.53  0   6.96%    3.60     8.42
2048K  1051  525   15209.81  3562.27  0   35.79%   8.67     20.12


Bitmap  none, array with the same settings (previous array was shut down, drives were "cleansed" with nvme format)


randread
BS     BW     IOPS    LAT_µs   LAT_DEV  SS  SS_perc  USR_CPU  SYS_CPU
4K     571    146399  53.80    25.07    0   5.17%    13.54    58.45
8K     1147   146866  53.87    17.48    0   3.10%    11.20    59.26
16K    1970   126136  62.70    20.11    0   2.64%    11.06    58.88
32K    3519   112637  70.36    23.60    1   1.98%    11.05    54.55
64K    6502   104037  76.27    21.71    1   1.52%    9.60     60.40
128K   7886   63093   126.05   21.88    1   1.19%    6.84     65.40
256K   9446   37787   211.05   27.00    1   0.77%    3.60     69.37
512K   8397   16794   475.58   42.16    1   1.45%    1.85     71.99
1024K  8510   8510    939.13   55.02    1   1.01%    1.00     72.60
2048K  11035  5517    1448.77  84.14    1   1.99%    0.74     73.49


randwrite
BS     BW    IOPS   LAT_µs   LAT_DEV  SS  SS_perc  USR_CPU  SYS_CPU
4K     195   50151  158.96   48.56    1   1.13%    5.74     34.68
8K     264   33897  235.39   77.11    1   1.32%    4.60     34.46
16K    343   22003  362.88   111.80   1   1.70%    5.34     37.17
32K    645   20642  386.83   145.86   0   33.84%   6.48     45.15
64K    917   14680  543.97   170.23   0   3.01%    6.05     53.27
128K   1416  11332  704.94   202.18   0   4.66%    9.69     57.63
256K   1394  5576   1433.60  375.88   1   1.52%    8.53     24.93
512K   1726  3452   2316.19  500.19   1   1.18%    12.38    30.54
1024K  2598  2598   3077.47  629.37   0   2.53%    18.74    47.02
2048K  2457  1228   6508.20  1825.67  0   3.32%    28.70    61.01



Reads are fine but writes are many times slower ...


> 
> I still can't believe your test result, and I can't figure out why
> internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5,
> and use the same fio script to test, the result is quite different from
> yours:
> 
> ram0:            981MiB/s
> non-bitmap:        132MiB/s
> internal-bitmap:    95.5MiB/s
>>

I don't know, I can provide full fio test logs including fio "tracing" for these iodepth 8
tests if that would make any difference.



> There is absolutely something wrong here, it doesn't make sense to me
> that internal bitmap is so slow. However, I have no idea until you can
> provide the perf result.

I may be able to find time to do that over the weekend, but don’t hold me to it.
The test setup will not be the same, server is in production ...
I did leave some "spare" partitions on all drives to investigate this issue further
but did not find the time.

Please send me an example of how would you like me to run the perf tool, I haven't used
it much.

Thanks
Dragan
Mariusz Tkaczyk Nov. 12, 2024, 7:54 a.m. UTC | #14
On Thu, 7 Nov 2024 17:28:43 -0800
Song Liu <song@kernel.org> wrote:

> On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > 在 2024/11/08 7:41, Song Liu 写道:  
> > > On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:  
> > >>
> > >> From: Yu Kuai <yukuai3@huawei.com>
> > >>
> > >> The bitmap file has been marked as deprecated for more than a year now,
> > >> let's remove it, and we don't need to care about this case in the new
> > >> bitmap.
> > >>
> > >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>  
> > >
> > > What happens when an old array with bitmap file boots into a kernel
> > > without bitmap file support?  
> >
> > If mdadm is used with bitmap file support, then kenel will just ignore
> > the bitmap, the same as none bitmap. Perhaps it's better to leave a
> > error message?  
> 
> Yes, we should print some error message before assembling the array.
> 
> > And if mdadm is updated, reassemble will fail.  

I would be great if mdadm can just ignore it too. It comes from config file, so
simply you can ignore bitmap entry if it is different than "internal" or
"clustered". You can print error however you must do it somewhere else (outside
config.c), otherwise user would be always prompted about that on every config
read - probably we don't need to make it such noise but maybe we should (user
may not notice change if we are not screaming it loud). I have no opinion here.

The first rule is always data access- we should not break that if possible. I
think case I think it is possible to keep them assembled.

> 
> I think we should ship this with 6.14 (not 6.13), so that we have
> more time testing different combinations of old/new mdadm
> and kernel. WDYT?

Later is better because it decreases possibility that someone would met the
case with new kernel and old mdadm, where probably some ioctl/sysfs writes
fails will be observed.

I would say that we should wait around one year after removing it from mdadm.
That is preferred by me.

I will merge Kuai changes soon, before the release. I think it is valuable to
have it blocked in new mdadm release.

Mariusz
Yu Kuai Nov. 13, 2024, 1:18 a.m. UTC | #15
Hi,

在 2024/11/11 22:07, Dragan Milivojević 写道:
> Reads are fine but writes are many times slower ...
> 
> 
>>
>> I still can't believe your test result, and I can't figure out why
>> internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5,
>> and use the same fio script to test, the result is quite different from
>> yours:
>>
>> ram0:            981MiB/s
>> non-bitmap:        132MiB/s
>> internal-bitmap:    95.5MiB/s
>>>

So, I waited for Paul to have a chance to give it a test for real disks,
still, results are similar to above.
> 
> I don't know, I can provide full fio test logs including fio "tracing" 
> for these iodepth 8
> tests if that would make any difference.
> 

No, I don't need fio logs.
> 
> 
>> There is absolutely something wrong here, it doesn't make sense to me
>> that internal bitmap is so slow. However, I have no idea until you can
>> provide the perf result.
> 
> I may be able to find time to do that over the weekend, but don’t hold 
> me to it.
> The test setup will not be the same, server is in production ...
> I did leave some "spare" partitions on all drives to investigate this 
> issue further
> but did not find the time.
> 
> Please send me an example of how would you like me to run the perf tool, 
> I haven't used
> it much.

You can see examples here:

https://github.com/brendangregg/FlameGraph

To be short, while test is running:

perf record -a -g -- sleep 10
perf script -i perf.data | ./stackcollapse-perf.pl | ./flamegraph.pl

BTW, you said that you're using production environment, this will
probably make it hard to analyze performance.

Thanks,
Kuai
Dragan Milivojević Nov. 13, 2024, 1:25 a.m. UTC | #16
On 13/11/2024 02:18, Yu Kuai wrote:
>>> ram0:            981MiB/s
>>> non-bitmap:        132MiB/s
>>> internal-bitmap:    95.5MiB/s
>>>>

> 
> So, I waited for Paul to have a chance to give it a test for real disks,
> still, results are similar to above.

That is interesting. How are you running those tests?
I should try them on my hardware as well.

> 
> You can see examples here:
> 
> https://github.com/brendangregg/FlameGraph
> 
> To be short, while test is running:
> 
> perf record -a -g -- sleep 10
> perf script -i perf.data | ./stackcollapse-perf.pl | ./flamegraph.pl
> 
> BTW, you said that you're using production environment, this will
> probably make it hard to analyze performance.

I may be able to move things around for the weekend, we will see.

Thanks
Dragan
diff mbox series

Patch

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 29da10e6f703..6895883fc622 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -167,8 +167,6 @@  struct bitmap {
 	int need_sync;
 
 	struct bitmap_storage {
-		/* backing disk file */
-		struct file *file;
 		/* cached copy of the bitmap file superblock */
 		struct page *sb_page;
 		unsigned long sb_index;
@@ -495,135 +493,6 @@  static void write_sb_page(struct bitmap *bitmap, unsigned long pg_index,
 
 static void md_bitmap_file_kick(struct bitmap *bitmap);
 
-#ifdef CONFIG_MD_BITMAP_FILE
-static void write_file_page(struct bitmap *bitmap, struct page *page, int wait)
-{
-	struct buffer_head *bh = page_buffers(page);
-
-	while (bh && bh->b_blocknr) {
-		atomic_inc(&bitmap->pending_writes);
-		set_buffer_locked(bh);
-		set_buffer_mapped(bh);
-		submit_bh(REQ_OP_WRITE | REQ_SYNC, bh);
-		bh = bh->b_this_page;
-	}
-
-	if (wait)
-		wait_event(bitmap->write_wait,
-			   atomic_read(&bitmap->pending_writes) == 0);
-}
-
-static void end_bitmap_write(struct buffer_head *bh, int uptodate)
-{
-	struct bitmap *bitmap = bh->b_private;
-
-	if (!uptodate)
-		set_bit(BITMAP_WRITE_ERROR, &bitmap->flags);
-	if (atomic_dec_and_test(&bitmap->pending_writes))
-		wake_up(&bitmap->write_wait);
-}
-
-static void free_buffers(struct page *page)
-{
-	struct buffer_head *bh;
-
-	if (!PagePrivate(page))
-		return;
-
-	bh = page_buffers(page);
-	while (bh) {
-		struct buffer_head *next = bh->b_this_page;
-		free_buffer_head(bh);
-		bh = next;
-	}
-	detach_page_private(page);
-	put_page(page);
-}
-
-/* read a page from a file.
- * We both read the page, and attach buffers to the page to record the
- * address of each block (using bmap).  These addresses will be used
- * to write the block later, completely bypassing the filesystem.
- * This usage is similar to how swap files are handled, and allows us
- * to write to a file with no concerns of memory allocation failing.
- */
-static int read_file_page(struct file *file, unsigned long index,
-		struct bitmap *bitmap, unsigned long count, struct page *page)
-{
-	int ret = 0;
-	struct inode *inode = file_inode(file);
-	struct buffer_head *bh;
-	sector_t block, blk_cur;
-	unsigned long blocksize = i_blocksize(inode);
-
-	pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE,
-		 (unsigned long long)index << PAGE_SHIFT);
-
-	bh = alloc_page_buffers(page, blocksize);
-	if (!bh) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	attach_page_private(page, bh);
-	blk_cur = index << (PAGE_SHIFT - inode->i_blkbits);
-	while (bh) {
-		block = blk_cur;
-
-		if (count == 0)
-			bh->b_blocknr = 0;
-		else {
-			ret = bmap(inode, &block);
-			if (ret || !block) {
-				ret = -EINVAL;
-				bh->b_blocknr = 0;
-				goto out;
-			}
-
-			bh->b_blocknr = block;
-			bh->b_bdev = inode->i_sb->s_bdev;
-			if (count < blocksize)
-				count = 0;
-			else
-				count -= blocksize;
-
-			bh->b_end_io = end_bitmap_write;
-			bh->b_private = bitmap;
-			atomic_inc(&bitmap->pending_writes);
-			set_buffer_locked(bh);
-			set_buffer_mapped(bh);
-			submit_bh(REQ_OP_READ, bh);
-		}
-		blk_cur++;
-		bh = bh->b_this_page;
-	}
-
-	wait_event(bitmap->write_wait,
-		   atomic_read(&bitmap->pending_writes)==0);
-	if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags))
-		ret = -EIO;
-out:
-	if (ret)
-		pr_err("md: bitmap read error: (%dB @ %llu): %d\n",
-		       (int)PAGE_SIZE,
-		       (unsigned long long)index << PAGE_SHIFT,
-		       ret);
-	return ret;
-}
-#else /* CONFIG_MD_BITMAP_FILE */
-static void write_file_page(struct bitmap *bitmap, struct page *page, int wait)
-{
-}
-static int read_file_page(struct file *file, unsigned long index,
-		struct bitmap *bitmap, unsigned long count, struct page *page)
-{
-	return -EIO;
-}
-static void free_buffers(struct page *page)
-{
-	put_page(page);
-}
-#endif /* CONFIG_MD_BITMAP_FILE */
-
 /*
  * bitmap file superblock operations
  */
@@ -642,10 +511,7 @@  static void filemap_write_page(struct bitmap *bitmap, unsigned long pg_index,
 		pg_index += store->sb_index;
 	}
 
-	if (store->file)
-		write_file_page(bitmap, page, wait);
-	else
-		write_sb_page(bitmap, pg_index, page, wait);
+	write_sb_page(bitmap, pg_index, page, wait);
 }
 
 /*
@@ -655,18 +521,15 @@  static void filemap_write_page(struct bitmap *bitmap, unsigned long pg_index,
  */
 static void md_bitmap_wait_writes(struct bitmap *bitmap)
 {
-	if (bitmap->storage.file)
-		wait_event(bitmap->write_wait,
-			   atomic_read(&bitmap->pending_writes)==0);
-	else
-		/* Note that we ignore the return value.  The writes
-		 * might have failed, but that would just mean that
-		 * some bits which should be cleared haven't been,
-		 * which is safe.  The relevant bitmap blocks will
-		 * probably get written again, but there is no great
-		 * loss if they aren't.
-		 */
-		md_super_wait(bitmap->mddev);
+	/*
+	 * Note that we ignore the return value.  The writes
+	 * might have failed, but that would just mean that
+	 * some bits which should be cleared haven't been,
+	 * which is safe.  The relevant bitmap blocks will
+	 * probably get written again, but there is no great
+	 * loss if they aren't.
+	 */
+	md_super_wait(bitmap->mddev);
 }
 
 
@@ -704,11 +567,8 @@  static void bitmap_update_sb(void *data)
 					   bitmap_info.space);
 	kunmap_atomic(sb);
 
-	if (bitmap->storage.file)
-		write_file_page(bitmap, bitmap->storage.sb_page, 1);
-	else
-		write_sb_page(bitmap, bitmap->storage.sb_index,
-			      bitmap->storage.sb_page, 1);
+	write_sb_page(bitmap, bitmap->storage.sb_index, bitmap->storage.sb_page,
+		      1);
 }
 
 static void bitmap_print_sb(struct bitmap *bitmap)
@@ -821,7 +681,7 @@  static int md_bitmap_read_sb(struct bitmap *bitmap)
 	struct page *sb_page;
 	loff_t offset = 0;
 
-	if (!bitmap->storage.file && !bitmap->mddev->bitmap_info.offset) {
+	if (!bitmap->mddev->bitmap_info.offset) {
 		chunksize = 128 * 1024 * 1024;
 		daemon_sleep = 5 * HZ;
 		write_behind = 0;
@@ -851,16 +711,8 @@  static int md_bitmap_read_sb(struct bitmap *bitmap)
 			bitmap->cluster_slot, offset);
 	}
 
-	if (bitmap->storage.file) {
-		loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host);
-		int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize;
-
-		err = read_file_page(bitmap->storage.file, 0,
-				bitmap, bytes, sb_page);
-	} else {
-		err = read_sb_page(bitmap->mddev, offset, sb_page, 0,
-				   sizeof(bitmap_super_t));
-	}
+	err = read_sb_page(bitmap->mddev, offset, sb_page, 0,
+			   sizeof(bitmap_super_t));
 	if (err)
 		return err;
 
@@ -1062,25 +914,18 @@  static int md_bitmap_storage_alloc(struct bitmap_storage *store,
 
 static void md_bitmap_file_unmap(struct bitmap_storage *store)
 {
-	struct file *file = store->file;
 	struct page *sb_page = store->sb_page;
 	struct page **map = store->filemap;
 	int pages = store->file_pages;
 
 	while (pages--)
 		if (map[pages] != sb_page) /* 0 is sb_page, release it below */
-			free_buffers(map[pages]);
+			put_page(map[pages]);
 	kfree(map);
 	kfree(store->filemap_attr);
 
 	if (sb_page)
-		free_buffers(sb_page);
-
-	if (file) {
-		struct inode *inode = file_inode(file);
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
-		fput(file);
-	}
+		put_page(sb_page);
 }
 
 /*
@@ -1092,14 +937,8 @@  static void md_bitmap_file_kick(struct bitmap *bitmap)
 {
 	if (!test_and_set_bit(BITMAP_STALE, &bitmap->flags)) {
 		bitmap_update_sb(bitmap);
-
-		if (bitmap->storage.file) {
-			pr_warn("%s: kicking failed bitmap file %pD4 from array!\n",
-				bmname(bitmap), bitmap->storage.file);
-
-		} else
-			pr_warn("%s: disabling internal bitmap due to errors\n",
-				bmname(bitmap));
+		pr_warn("%s: disabling internal bitmap due to errors\n",
+			bmname(bitmap));
 	}
 }
 
@@ -1319,13 +1158,12 @@  static int md_bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 	struct mddev *mddev = bitmap->mddev;
 	unsigned long chunks = bitmap->counts.chunks;
 	struct bitmap_storage *store = &bitmap->storage;
-	struct file *file = store->file;
 	unsigned long node_offset = 0;
 	unsigned long bit_cnt = 0;
 	unsigned long i;
 	int ret;
 
-	if (!file && !mddev->bitmap_info.offset) {
+	if (!mddev->bitmap_info.offset) {
 		/* No permanent bitmap - fill with '1s'. */
 		store->filemap = NULL;
 		store->file_pages = 0;
@@ -1340,15 +1178,6 @@  static int md_bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 		return 0;
 	}
 
-	if (file && i_size_read(file->f_mapping->host) < store->bytes) {
-		pr_warn("%s: bitmap file too short %lu < %lu\n",
-			bmname(bitmap),
-			(unsigned long) i_size_read(file->f_mapping->host),
-			store->bytes);
-		ret = -ENOSPC;
-		goto err;
-	}
-
 	if (mddev_is_clustered(mddev))
 		node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE));
 
@@ -1362,11 +1191,7 @@  static int md_bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 		else
 			count = PAGE_SIZE;
 
-		if (file)
-			ret = read_file_page(file, i, bitmap, count, page);
-		else
-			ret = read_sb_page(mddev, 0, page, i + node_offset,
-					   count);
+		ret = read_sb_page(mddev, 0, page, i + node_offset, count);
 		if (ret)
 			goto err;
 	}
@@ -1444,10 +1269,6 @@  static void bitmap_write_all(struct mddev *mddev)
 	if (!bitmap || !bitmap->storage.filemap)
 		return;
 
-	/* Only one copy, so nothing needed */
-	if (bitmap->storage.file)
-		return;
-
 	for (i = 0; i < bitmap->storage.file_pages; i++)
 		set_page_attr(bitmap, i, BITMAP_PAGE_NEEDWRITE);
 	bitmap->allclean = 0;
@@ -2105,14 +1926,11 @@  static struct bitmap *__bitmap_create(struct mddev *mddev, int slot)
 {
 	struct bitmap *bitmap;
 	sector_t blocks = mddev->resync_max_sectors;
-	struct file *file = mddev->bitmap_info.file;
 	int err;
 	struct kernfs_node *bm = NULL;
 
 	BUILD_BUG_ON(sizeof(bitmap_super_t) != 256);
 
-	BUG_ON(file && mddev->bitmap_info.offset);
-
 	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
 		pr_notice("md/raid:%s: array with journal cannot have bitmap\n",
 			  mdname(mddev));
@@ -2140,15 +1958,6 @@  static struct bitmap *__bitmap_create(struct mddev *mddev, int slot)
 	} else
 		bitmap->sysfs_can_clear = NULL;
 
-	bitmap->storage.file = file;
-	if (file) {
-		get_file(file);
-		/* As future accesses to this file will use bmap,
-		 * and bypass the page cache, we must sync the file
-		 * first.
-		 */
-		vfs_fsync(file, 1);
-	}
 	/* read superblock from bitmap file (this sets mddev->bitmap_info.chunksize) */
 	if (!mddev->bitmap_info.external) {
 		/*
@@ -2352,7 +2161,6 @@  static int bitmap_get_stats(void *data, struct md_bitmap_stats *stats)
 
 	storage = &bitmap->storage;
 	stats->file_pages = storage->file_pages;
-	stats->file = storage->file;
 
 	stats->behind_writes = atomic_read(&bitmap->behind_writes);
 	stats->behind_wait = wq_has_sleeper(&bitmap->behind_wait);
@@ -2383,11 +2191,6 @@  static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks,
 	long pages;
 	struct bitmap_page *new_bp;
 
-	if (bitmap->storage.file && !init) {
-		pr_info("md: cannot resize file-based bitmap\n");
-		return -EINVAL;
-	}
-
 	if (chunksize == 0) {
 		/* If there is enough space, leave the chunk size unchanged,
 		 * else increase by factor of two until there is enough space.
@@ -2421,7 +2224,7 @@  static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks,
 
 	chunks = DIV_ROUND_UP_SECTOR_T(blocks, 1 << chunkshift);
 	memset(&store, 0, sizeof(store));
-	if (bitmap->mddev->bitmap_info.offset || bitmap->mddev->bitmap_info.file)
+	if (bitmap->mddev->bitmap_info.offset)
 		ret = md_bitmap_storage_alloc(&store, chunks,
 					      !bitmap->mddev->bitmap_info.external,
 					      mddev_is_clustered(bitmap->mddev)
@@ -2443,9 +2246,6 @@  static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks,
 	if (!init)
 		bitmap->mddev->pers->quiesce(bitmap->mddev, 1);
 
-	store.file = bitmap->storage.file;
-	bitmap->storage.file = NULL;
-
 	if (store.sb_page && bitmap->storage.sb_page)
 		memcpy(page_address(store.sb_page),
 		       page_address(bitmap->storage.sb_page),
@@ -2582,9 +2382,7 @@  static ssize_t
 location_show(struct mddev *mddev, char *page)
 {
 	ssize_t len;
-	if (mddev->bitmap_info.file)
-		len = sprintf(page, "file");
-	else if (mddev->bitmap_info.offset)
+	if (mddev->bitmap_info.offset)
 		len = sprintf(page, "%+lld", (long long)mddev->bitmap_info.offset);
 	else
 		len = sprintf(page, "none");
@@ -2608,8 +2406,7 @@  location_store(struct mddev *mddev, const char *buf, size_t len)
 		}
 	}
 
-	if (mddev->bitmap || mddev->bitmap_info.file ||
-	    mddev->bitmap_info.offset) {
+	if (mddev->bitmap || mddev->bitmap_info.offset) {
 		/* bitmap already configured.  Only option is to clear it */
 		if (strncmp(buf, "none", 4) != 0) {
 			rv = -EBUSY;
@@ -2618,22 +2415,11 @@  location_store(struct mddev *mddev, const char *buf, size_t len)
 
 		bitmap_destroy(mddev);
 		mddev->bitmap_info.offset = 0;
-		if (mddev->bitmap_info.file) {
-			struct file *f = mddev->bitmap_info.file;
-			mddev->bitmap_info.file = NULL;
-			fput(f);
-		}
 	} else {
 		/* No bitmap, OK to set a location */
 		long long offset;
 
-		if (strncmp(buf, "none", 4) == 0)
-			/* nothing to be done */;
-		else if (strncmp(buf, "file:", 5) == 0) {
-			/* Not supported yet */
-			rv = -EINVAL;
-			goto out;
-		} else {
+		if (strncmp(buf, "none", 4) != 0) {
 			if (buf[0] == '+')
 				rv = kstrtoll(buf+1, 10, &offset);
 			else
@@ -2864,10 +2650,9 @@  static ssize_t metadata_show(struct mddev *mddev, char *page)
 
 static ssize_t metadata_store(struct mddev *mddev, const char *buf, size_t len)
 {
-	if (mddev->bitmap ||
-	    mddev->bitmap_info.file ||
-	    mddev->bitmap_info.offset)
+	if (mddev->bitmap || mddev->bitmap_info.offset)
 		return -EBUSY;
+
 	if (strncmp(buf, "external", 8) == 0)
 		mddev->bitmap_info.external = 1;
 	else if ((strncmp(buf, "internal", 8) == 0) ||
diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h
index 662e6fc141a7..4b386954f5f5 100644
--- a/drivers/md/md-bitmap.h
+++ b/drivers/md/md-bitmap.h
@@ -67,7 +67,6 @@  struct md_bitmap_stats {
 	unsigned long	file_pages;
 	unsigned long	sync_size;
 	unsigned long	pages;
-	struct file	*file;
 };
 
 struct bitmap_operations {
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 35c2e1e761aa..03f2a9fafea2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1155,7 +1155,7 @@  struct super_type  {
  */
 int md_check_no_bitmap(struct mddev *mddev)
 {
-	if (!mddev->bitmap_info.file && !mddev->bitmap_info.offset)
+	if (!mddev->bitmap_info.offset)
 		return 0;
 	pr_warn("%s: bitmaps are not supported for %s\n",
 		mdname(mddev), mddev->pers->name);
@@ -1349,8 +1349,7 @@  static int super_90_validate(struct mddev *mddev, struct md_rdev *freshest, stru
 
 		mddev->max_disks = MD_SB_DISKS;
 
-		if (sb->state & (1<<MD_SB_BITMAP_PRESENT) &&
-		    mddev->bitmap_info.file == NULL) {
+		if (sb->state & (1<<MD_SB_BITMAP_PRESENT)) {
 			mddev->bitmap_info.offset =
 				mddev->bitmap_info.default_offset;
 			mddev->bitmap_info.space =
@@ -1476,7 +1475,7 @@  static void super_90_sync(struct mddev *mddev, struct md_rdev *rdev)
 	sb->layout = mddev->layout;
 	sb->chunk_size = mddev->chunk_sectors << 9;
 
-	if (mddev->bitmap && mddev->bitmap_info.file == NULL)
+	if (mddev->bitmap)
 		sb->state |= (1<<MD_SB_BITMAP_PRESENT);
 
 	sb->disks[0].state = (1<<MD_DISK_REMOVED);
@@ -1824,8 +1823,7 @@  static int super_1_validate(struct mddev *mddev, struct md_rdev *freshest, struc
 
 		mddev->max_disks =  (4096-256)/2;
 
-		if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET) &&
-		    mddev->bitmap_info.file == NULL) {
+		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET) {
 			mddev->bitmap_info.offset =
 				(__s32)le32_to_cpu(sb->bitmap_offset);
 			/* Metadata doesn't record how much space is available.
@@ -2030,7 +2028,7 @@  static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 	sb->data_offset = cpu_to_le64(rdev->data_offset);
 	sb->data_size = cpu_to_le64(rdev->sectors);
 
-	if (mddev->bitmap && mddev->bitmap_info.file == NULL) {
+	if (mddev->bitmap) {
 		sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_info.offset);
 		sb->feature_map = cpu_to_le32(MD_FEATURE_BITMAP_OFFSET);
 	}
@@ -2227,6 +2225,10 @@  static int
 super_1_allow_new_offset(struct md_rdev *rdev,
 			 unsigned long long new_offset)
 {
+	struct mddev *mddev = rdev->mddev;
+	struct md_bitmap_stats stats;
+	int err;
+
 	/* All necessary checks on new >= old have been done */
 	if (new_offset >= rdev->data_offset)
 		return 1;
@@ -2245,21 +2247,12 @@  super_1_allow_new_offset(struct md_rdev *rdev,
 	if (rdev->sb_start + (32+4)*2 > new_offset)
 		return 0;
 
-	if (!rdev->mddev->bitmap_info.file) {
-		struct mddev *mddev = rdev->mddev;
-		struct md_bitmap_stats stats;
-		int err;
-
-		err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats);
-		if (!err && rdev->sb_start + mddev->bitmap_info.offset +
-		    stats.file_pages * (PAGE_SIZE >> 9) > new_offset)
-			return 0;
-	}
-
-	if (rdev->badblocks.sector + rdev->badblocks.size > new_offset)
-		return 0;
+	err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats);
+	if (err)
+		return 1;
 
-	return 1;
+	return rdev->sb_start + mddev->bitmap_info.offset +
+		stats.file_pages * (PAGE_SIZE >> 9) <= new_offset;
 }
 
 static struct super_type super_types[] = {
@@ -6150,8 +6143,7 @@  int md_run(struct mddev *mddev)
 			(unsigned long long)pers->size(mddev, 0, 0) / 2);
 		err = -EINVAL;
 	}
-	if (err == 0 && pers->sync_request &&
-	    (mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
+	if (err == 0 && pers->sync_request && mddev->bitmap_info.offset) {
 		err = mddev->bitmap_ops->create(mddev, -1);
 		if (err)
 			pr_warn("%s: failed to create bitmap (%d)\n",
@@ -6563,17 +6555,8 @@  static int do_md_stop(struct mddev *mddev, int mode)
 	if (mode == 0) {
 		pr_info("md: %s stopped.\n", mdname(mddev));
 
-		if (mddev->bitmap_info.file) {
-			struct file *f = mddev->bitmap_info.file;
-			spin_lock(&mddev->lock);
-			mddev->bitmap_info.file = NULL;
-			spin_unlock(&mddev->lock);
-			fput(f);
-		}
 		mddev->bitmap_info.offset = 0;
-
 		export_array(mddev);
-
 		md_clean(mddev);
 		if (mddev->hold_active == UNTIL_STOP)
 			mddev->hold_active = 0;
@@ -6767,38 +6750,6 @@  static int get_array_info(struct mddev *mddev, void __user *arg)
 	return 0;
 }
 
-static int get_bitmap_file(struct mddev *mddev, void __user * arg)
-{
-	mdu_bitmap_file_t *file = NULL; /* too big for stack allocation */
-	char *ptr;
-	int err;
-
-	file = kzalloc(sizeof(*file), GFP_NOIO);
-	if (!file)
-		return -ENOMEM;
-
-	err = 0;
-	spin_lock(&mddev->lock);
-	/* bitmap enabled */
-	if (mddev->bitmap_info.file) {
-		ptr = file_path(mddev->bitmap_info.file, file->pathname,
-				sizeof(file->pathname));
-		if (IS_ERR(ptr))
-			err = PTR_ERR(ptr);
-		else
-			memmove(file->pathname, ptr,
-				sizeof(file->pathname)-(ptr-file->pathname));
-	}
-	spin_unlock(&mddev->lock);
-
-	if (err == 0 &&
-	    copy_to_user(arg, file, sizeof(*file)))
-		err = -EFAULT;
-
-	kfree(file);
-	return err;
-}
-
 static int get_disk_info(struct mddev *mddev, void __user * arg)
 {
 	mdu_disk_info_t info;
@@ -7153,92 +7104,6 @@  static int hot_add_disk(struct mddev *mddev, dev_t dev)
 	return err;
 }
 
-static int set_bitmap_file(struct mddev *mddev, int fd)
-{
-	int err = 0;
-
-	if (mddev->pers) {
-		if (!mddev->pers->quiesce || !mddev->thread)
-			return -EBUSY;
-		if (mddev->recovery || mddev->sync_thread)
-			return -EBUSY;
-		/* we should be able to change the bitmap.. */
-	}
-
-	if (fd >= 0) {
-		struct inode *inode;
-		struct file *f;
-
-		if (mddev->bitmap || mddev->bitmap_info.file)
-			return -EEXIST; /* cannot add when bitmap is present */
-
-		if (!IS_ENABLED(CONFIG_MD_BITMAP_FILE)) {
-			pr_warn("%s: bitmap files not supported by this kernel\n",
-				mdname(mddev));
-			return -EINVAL;
-		}
-		pr_warn("%s: using deprecated bitmap file support\n",
-			mdname(mddev));
-
-		f = fget(fd);
-
-		if (f == NULL) {
-			pr_warn("%s: error: failed to get bitmap file\n",
-				mdname(mddev));
-			return -EBADF;
-		}
-
-		inode = f->f_mapping->host;
-		if (!S_ISREG(inode->i_mode)) {
-			pr_warn("%s: error: bitmap file must be a regular file\n",
-				mdname(mddev));
-			err = -EBADF;
-		} else if (!(f->f_mode & FMODE_WRITE)) {
-			pr_warn("%s: error: bitmap file must open for write\n",
-				mdname(mddev));
-			err = -EBADF;
-		} else if (atomic_read(&inode->i_writecount) != 1) {
-			pr_warn("%s: error: bitmap file is already in use\n",
-				mdname(mddev));
-			err = -EBUSY;
-		}
-		if (err) {
-			fput(f);
-			return err;
-		}
-		mddev->bitmap_info.file = f;
-		mddev->bitmap_info.offset = 0; /* file overrides offset */
-	} else if (mddev->bitmap == NULL)
-		return -ENOENT; /* cannot remove what isn't there */
-	err = 0;
-	if (mddev->pers) {
-		if (fd >= 0) {
-			err = mddev->bitmap_ops->create(mddev, -1);
-			if (!err)
-				err = mddev->bitmap_ops->load(mddev);
-
-			if (err) {
-				mddev->bitmap_ops->destroy(mddev);
-				fd = -1;
-			}
-		} else if (fd < 0) {
-			mddev->bitmap_ops->destroy(mddev);
-		}
-	}
-
-	if (fd < 0) {
-		struct file *f = mddev->bitmap_info.file;
-		if (f) {
-			spin_lock(&mddev->lock);
-			mddev->bitmap_info.file = NULL;
-			spin_unlock(&mddev->lock);
-			fput(f);
-		}
-	}
-
-	return err;
-}
-
 /*
  * md_set_array_info is used two different ways
  * The original usage is when creating a new array.
@@ -7520,11 +7385,6 @@  static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 			if (rv)
 				goto err;
 
-			if (stats.file) {
-				rv = -EINVAL;
-				goto err;
-			}
-
 			if (mddev->bitmap_info.nodes) {
 				/* hold PW on all the bitmap lock */
 				if (md_cluster_ops->lock_all_bitmaps(mddev) <= 0) {
@@ -7589,18 +7449,19 @@  static int md_getgeo(struct block_device *bdev, struct hd_geometry *geo)
 static inline int md_ioctl_valid(unsigned int cmd)
 {
 	switch (cmd) {
+	case GET_BITMAP_FILE:
+	case SET_BITMAP_FILE:
+		return -EOPNOTSUPP;
 	case GET_ARRAY_INFO:
 	case GET_DISK_INFO:
 	case RAID_VERSION:
 		return 0;
 	case ADD_NEW_DISK:
-	case GET_BITMAP_FILE:
 	case HOT_ADD_DISK:
 	case HOT_REMOVE_DISK:
 	case RESTART_ARRAY_RW:
 	case RUN_ARRAY:
 	case SET_ARRAY_INFO:
-	case SET_BITMAP_FILE:
 	case SET_DISK_FAULTY:
 	case STOP_ARRAY:
 	case STOP_ARRAY_RO:
@@ -7619,7 +7480,6 @@  static bool md_ioctl_need_suspend(unsigned int cmd)
 	case ADD_NEW_DISK:
 	case HOT_ADD_DISK:
 	case HOT_REMOVE_DISK:
-	case SET_BITMAP_FILE:
 	case SET_ARRAY_INFO:
 		return true;
 	default:
@@ -7699,9 +7559,6 @@  static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
 
 	case SET_DISK_FAULTY:
 		return set_disk_faulty(mddev, new_decode_dev(arg));
-
-	case GET_BITMAP_FILE:
-		return get_bitmap_file(mddev, argp);
 	}
 
 	if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) {
@@ -7734,10 +7591,8 @@  static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
 	 */
 	/* if we are not initialised yet, only ADD_NEW_DISK, STOP_ARRAY,
 	 * RUN_ARRAY, and GET_ and SET_BITMAP_FILE are allowed */
-	if ((!mddev->raid_disks && !mddev->external)
-	    && cmd != ADD_NEW_DISK && cmd != STOP_ARRAY
-	    && cmd != RUN_ARRAY && cmd != SET_BITMAP_FILE
-	    && cmd != GET_BITMAP_FILE) {
+	if (!mddev->raid_disks && !mddev->external && cmd != ADD_NEW_DISK &&
+	    cmd != STOP_ARRAY && cmd != RUN_ARRAY) {
 		err = -ENODEV;
 		goto unlock;
 	}
@@ -7833,10 +7688,6 @@  static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
 		err = do_md_run(mddev);
 		goto unlock;
 
-	case SET_BITMAP_FILE:
-		err = set_bitmap_file(mddev, (int)arg);
-		goto unlock;
-
 	default:
 		err = -EINVAL;
 		goto unlock;
@@ -7855,6 +7706,7 @@  static int md_ioctl(struct block_device *bdev, blk_mode_t mode,
 		clear_bit(MD_CLOSING, &mddev->flags);
 	return err;
 }
+
 #ifdef CONFIG_COMPAT
 static int md_compat_ioctl(struct block_device *bdev, blk_mode_t mode,
 		    unsigned int cmd, unsigned long arg)
@@ -8328,11 +8180,6 @@  static void md_bitmap_status(struct seq_file *seq, struct mddev *mddev)
 		   chunk_kb ? chunk_kb : mddev->bitmap_info.chunksize,
 		   chunk_kb ? "KB" : "B");
 
-	if (stats.file) {
-		seq_puts(seq, ", file: ");
-		seq_file_path(seq, stats.file, " \t\n");
-	}
-
 	seq_putc(seq, '\n');
 }
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 4ba93af36126..bae257bc630c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -360,6 +360,34 @@  enum {
 	MD_RESYNC_ACTIVE = 3,
 };
 
+struct bitmap_info {
+	/*
+	 * offset from superblock of start of bitmap. May be negative, but not
+	 * '0' For external metadata, offset from start of device.
+	 */
+	loff_t			offset;
+	/* space available at this offset */
+	unsigned long		space;
+	/*
+	 * this is the offset to use when hot-adding a bitmap.  It should
+	 * eventually be settable by sysfs.
+	 */
+	loff_t			default_offset;
+	/* space available at default offset */
+	unsigned long		default_space;
+	struct mutex		mutex;
+	unsigned long		chunksize;
+	/* how many jiffies between updates? */
+	unsigned long		daemon_sleep;
+	/* write-behind mode */
+	unsigned long		max_write_behind;
+	int			external;
+	/* Maximum number of nodes in the cluster */
+	int			nodes;
+	/* Name of the cluster */
+	char                    cluster_name[64];
+};
+
 struct mddev {
 	void				*private;
 	struct md_personality		*pers;
@@ -519,7 +547,6 @@  struct mddev {
 	 *   in_sync - and related safemode and MD_CHANGE changes
 	 *   pers (also protected by reconfig_mutex and pending IO).
 	 *   clearing ->bitmap
-	 *   clearing ->bitmap_info.file
 	 *   changing ->resync_{min,max}
 	 *   setting MD_RECOVERY_RUNNING (which interacts with resync_{min,max})
 	 */
@@ -537,29 +564,7 @@  struct mddev {
 
 	void				*bitmap; /* the bitmap for the device */
 	struct bitmap_operations	*bitmap_ops;
-	struct {
-		struct file		*file; /* the bitmap file */
-		loff_t			offset; /* offset from superblock of
-						 * start of bitmap. May be
-						 * negative, but not '0'
-						 * For external metadata, offset
-						 * from start of device.
-						 */
-		unsigned long		space; /* space available at this offset */
-		loff_t			default_offset; /* this is the offset to use when
-							 * hot-adding a bitmap.  It should
-							 * eventually be settable by sysfs.
-							 */
-		unsigned long		default_space; /* space available at
-							* default offset */
-		struct mutex		mutex;
-		unsigned long		chunksize;
-		unsigned long		daemon_sleep; /* how many jiffies between updates? */
-		unsigned long		max_write_behind; /* write-behind mode */
-		int			external;
-		int			nodes; /* Maximum number of nodes in the cluster */
-		char                    cluster_name[64]; /* Name of the cluster */
-	} bitmap_info;
+	struct bitmap_info		bitmap_info;
 
 	atomic_t			max_corr_read_errors; /* max read retries */
 	struct list_head		all_mddevs;
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 37c4da5311ca..6a1c8d6e1849 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -1332,7 +1332,7 @@  int ppl_init_log(struct r5conf *conf)
 		return -EINVAL;
 	}
 
-	if (mddev->bitmap_info.file || mddev->bitmap_info.offset) {
+	if (mddev->bitmap_info.offset) {
 		pr_warn("md/raid:%s PPL is not compatible with bitmap\n",
 			mdname(mddev));
 		return -EINVAL;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f5ac81dd21b2..296501838a60 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7811,7 +7811,7 @@  static int raid5_run(struct mddev *mddev)
 	}
 
 	if ((test_bit(MD_HAS_JOURNAL, &mddev->flags) || journal_dev) &&
-	    (mddev->bitmap_info.offset || mddev->bitmap_info.file)) {
+	    (mddev->bitmap_info.offset)) {
 		pr_notice("md/raid:%s: array cannot have both journal and bitmap\n",
 			  mdname(mddev));
 		return -EINVAL;