[md-6.13] md: remove bitmap file support

Message ID	20241107125911.311347-1-yukuai1@huaweicloud.com (mailing list archive)
State	Changes Requested
Headers	show Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 099B1212634; Thu, 7 Nov 2024 13:02:45 +0000 (UTC) From: Yu Kuai <yukuai1@huaweicloud.com> To: song@kernel.org Cc: linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com Subject: [PATCH md-6.13] md: remove bitmap file support Date: Thu, 7 Nov 2024 20:59:11 +0800 Message-Id: <20241107125911.311347-1-yukuai1@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[md-6.13] md: remove bitmap file support \| expand [md-6.13] md: remove bitmap file support

Context	Check	Description
mdraidci/vmtest-md-6_13-PR	success	PR summary
mdraidci/vmtest-md-6_13-VM_Test-0	success	Logs for per-patch-testing

Yu Kuai Nov. 7, 2024, 12:59 p.m. UTC

From: Yu Kuai <yukuai3@huawei.com>

The bitmap file has been marked as deprecated for more than a year now,
let's remove it, and we don't need to care about this case in the new
bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md-bitmap.c | 269 +++++------------------------------------
 drivers/md/md-bitmap.h |   1 -
 drivers/md/md.c        | 195 ++++-------------------------
 drivers/md/md.h        |  53 ++++----
 drivers/md/raid5-ppl.c |   2 +-
 drivers/md/raid5.c     |   2 +-
 6 files changed, 79 insertions(+), 443 deletions(-)

Song Liu Nov. 7, 2024, 11:41 p.m. UTC | #1

On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> The bitmap file has been marked as deprecated for more than a year now,
> let's remove it, and we don't need to care about this case in the new
> bitmap.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

What happens when an old array with bitmap file boots into a kernel
without bitmap file support?

Thanks,
Song

Yu Kuai Nov. 8, 2024, 1:03 a.m. UTC | #2

Hi,

在 2024/11/08 7:41, Song Liu 写道:
> On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> The bitmap file has been marked as deprecated for more than a year now,
>> let's remove it, and we don't need to care about this case in the new
>> bitmap.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> 
> What happens when an old array with bitmap file boots into a kernel
> without bitmap file support?

If mdadm is used with bitmap file support, then kenel will just ignore
the bitmap, the same as none bitmap. Perhaps it's better to leave a
error message?

And if mdadm is updated, reassemble will fail.

Thanks,
Kuai

> 
> Thanks,
> Song
> .
>

Song Liu Nov. 8, 2024, 1:28 a.m. UTC | #3

On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/11/08 7:41, Song Liu 写道:
> > On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> The bitmap file has been marked as deprecated for more than a year now,
> >> let's remove it, and we don't need to care about this case in the new
> >> bitmap.
> >>
> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >
> > What happens when an old array with bitmap file boots into a kernel
> > without bitmap file support?
>
> If mdadm is used with bitmap file support, then kenel will just ignore
> the bitmap, the same as none bitmap. Perhaps it's better to leave a
> error message?

Yes, we should print some error message before assembling the array.

> And if mdadm is updated, reassemble will fail.

I think we should ship this with 6.14 (not 6.13), so that we have
more time testing different combinations of old/new mdadm
and kernel. WDYT?

Thanks,
Song

Yu Kuai Nov. 8, 2024, 1:33 a.m. UTC | #4

Hi,

在 2024/11/08 9:28, Song Liu 写道:
> On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/11/08 7:41, Song Liu 写道:
>>> On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> The bitmap file has been marked as deprecated for more than a year now,
>>>> let's remove it, and we don't need to care about this case in the new
>>>> bitmap.
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>
>>> What happens when an old array with bitmap file boots into a kernel
>>> without bitmap file support?
>>
>> If mdadm is used with bitmap file support, then kenel will just ignore
>> the bitmap, the same as none bitmap. Perhaps it's better to leave a
>> error message?
> 
> Yes, we should print some error message before assembling the array.

OK
> 
>> And if mdadm is updated, reassemble will fail.
> 
> I think we should ship this with 6.14 (not 6.13), so that we have
> more time testing different combinations of old/new mdadm
> and kernel. WDYT?

Agreed!

Thanks,
Kuai

> 
> Thanks,
> Song
> .
>

Dragan Milivojević Nov. 8, 2024, 5:15 a.m. UTC | #5

On Fri, 8 Nov 2024 at 02:29, Song Liu <song@kernel.org> wrote:

> I think we should ship this with 6.14 (not 6.13), so that we have
> more time testing different combinations of old/new mdadm
> and kernel. WDYT?

I'm not sure if bitmap performance fixes are already included
but if not please include those too. Internal bitmap kills performance
and external bitmap was a workaround for that issue.

Yu Kuai Nov. 8, 2024, 6:07 a.m. UTC | #6

Hi,

在 2024/11/08 13:15, Dragan Milivojević 写道:
> On Fri, 8 Nov 2024 at 02:29, Song Liu <song@kernel.org> wrote:
> 
>> I think we should ship this with 6.14 (not 6.13), so that we have
>> more time testing different combinations of old/new mdadm
>> and kernel. WDYT?
> 
> I'm not sure if bitmap performance fixes are already included
> but if not please include those too. Internal bitmap kills performance
> and external bitmap was a workaround for that issue.

I don't think external bitmap can workaround the performance degradation
problem, because the global lock for the bitmap are the one to blame for
this, it's the same for external or internal bitmap.

Do you know that is there anyone using external bitmap in the real
world? And is there numbers for performance? We'll have to consider
to keep it untill the new lockless bitmap is ready if so.

Thanks,
Kuai

> .
>

Dragan Milivojević Nov. 8, 2024, 10:19 p.m. UTC | #7

On Fri, 8 Nov 2024 at 07:07, Yu Kuai <yukuai1@huaweicloud.com> wrote:


> I don't think external bitmap can workaround the performance degradation
> problem, because the global lock for the bitmap are the one to blame for
> this, it's the same for external or internal bitmap.

Not according to my tests:

5 disk RAID5, 64K chunk



Test                   BW         IOPS
bitmap internal 64M    700KiB/s   174
bitmap internal 128M   702KiB/s   175
bitmap internal 512M   1142KiB/s  285
bitmap internal 1024M  40.4MiB/s  10.3k
bitmap internal 2G     66.5MiB/s  17.0k
bitmap external 64M    67.8MiB/s  17.3k
bitmap external 1024M  76.5MiB/s  19.6k
bitmap none            80.6MiB/s  20.6k
Single disk 1K         54.1MiB/s  55.4k
Single disk 4K         269MiB/s   68.8k



Full test logs with system details at: pastebin. com/raw/TK4vWjQu


>
> Do you know that is there anyone using external bitmap in the real
> world? And is there numbers for performance? We'll have to consider
> to keep it untill the new lockless bitmap is ready if so.

Well I am and it's a royal pain but there isn't much of an alternative.

Yu Kuai Nov. 9, 2024, 1:43 a.m. UTC | #8

Hi,

在 2024/11/09 6:19, Dragan Milivojević 写道:
> On Fri, 8 Nov 2024 at 07:07, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
> 
>> I don't think external bitmap can workaround the performance degradation
>> problem, because the global lock for the bitmap are the one to blame for
>> this, it's the same for external or internal bitmap.
> 
> Not according to my tests:
> 
> 5 disk RAID5, 64K chunk
> 
> 
> 
> Test                   BW         IOPS
> bitmap internal 64M    700KiB/s   174
> bitmap internal 128M   702KiB/s   175
> bitmap internal 512M   1142KiB/s  285
> bitmap internal 1024M  40.4MiB/s  10.3k
> bitmap internal 2G     66.5MiB/s  17.0k
> bitmap external 64M    67.8MiB/s  17.3k
> bitmap external 1024M  76.5MiB/s  19.6k

This is not what I expected, can you give the tests procedures in
details? Including test machine, create the array and test scrpits.

> bitmap none            80.6MiB/s  20.6k
> Single disk 1K         54.1MiB/s  55.4k
> Single disk 4K         269MiB/s   68.8k
> 
> 
> 
> Full test logs with system details at: pastebin. com/raw/TK4vWjQu
> 
> 
>>
>> Do you know that is there anyone using external bitmap in the real
>> world? And is there numbers for performance? We'll have to consider
>> to keep it untill the new lockless bitmap is ready if so.
> 
> Well I am and it's a royal pain but there isn't much of an alternative.

Bitmap file will be removed, the way it's implemented is problematic. If
you have plans to upgrade kernel to v6.13+, I can keep it for now,
untill the other lockless bitmap is ready.

Thanks,
Kuai

> .
>

Dragan Milivojević Nov. 9, 2024, 2:15 a.m. UTC | #9

On Sat, 9 Nov 2024 at 02:44, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> This is not what I expected, can you give the tests procedures in
> details? Including test machine, create the array and test scrpits.

Server is Dell PowerEdge R7525, 2x AMD EPYC 7313 the rest
is in the linked pastebin, let me know if you need more info.

BTW do you guys do performance tests? All of the raid levels are
practically broken performance wise. None of them scale. Looking forward
to seeing those patches from Shushu Yi included, anyone knows when those
will be shipped?

> Bitmap file will be removed, the way it's implemented is problematic. If
> you have plans to upgrade kernel to v6.13+, I can keep it for now,
> untill the other lockless bitmap is ready.

I usually use distro kernels so no such plan for now, just thought it would
be useful to ship both at the same time. Soften the blow for those using
external bitmaps.

Yu Kuai Nov. 11, 2024, 2:04 a.m. UTC | #10

Hi,

在 2024/11/09 10:15, Dragan Milivojević 写道:
> On Sat, 9 Nov 2024 at 02:44, Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> This is not what I expected, can you give the tests procedures in
>> details? Including test machine, create the array and test scrpits.
> 
> Server is Dell PowerEdge R7525, 2x AMD EPYC 7313 the rest
> is in the linked pastebin, let me know if you need more info.

Yes, as I said please show me you how you creat the array and your test
script. I must know what you are testing, like single threaded or high
concurrency. For example your result shows bitmap none close to bitmap
external, this is impossible in our previous results. I can only guess
that you're testing single threaded.

BTW, it'll be great if you can provide some perf results of the internal
bitmap in your case, that will show us directly where is the bottleneck.

> 
> BTW do you guys do performance tests? All of the raid levels are

We do, but we never test external bitmap.

+CC Paul

Hi, do you have time to add the external bitmap in our test?

Thanks,
Kuai

Dragan Milivojević Nov. 11, 2024, 11:59 a.m. UTC | #11

On 11/11/2024 03:04, Yu Kuai wrote:

> Yes, as I said please show me you how you creat the array and your test
> script. I must know what you are testing, like single threaded or high
> concurrency. For example your result shows bitmap none close to bitmap
> external, this is impossible in our previous results. I can only guess
> that you're testing single threaded.

All of that is included in the previously linked pastebin.
I will include the contents of that pastebin at the end of this email if that helps.
Every test includes the mdadm create line, disk settings, md settings, fio test line
used and the results and the typical iostat output during the test. I hope that is
sufficient.
  
> BTW, it'll be great if you can provide some perf results of the internal
> bitmap in your case, that will show us directly where is the bottleneck.

Not right now, this server is in production and I'm not sure if I will be able
to get it to an idle state or to find the time to do it due to other projects.

>> BTW do you guys do performance tests? All of the raid levels are
> 
> We do, but we never test external bitmap.

I wasn't referring to that, more to the fact that there is a huge difference in
performance between no bitmap and bitmap or that raid (even "simple" levels like 0)
do not scale with real world workloads.

The contents of that pastebin, hopefully my email client won't mess up the formating:


5 disk RAID5, 64K chunk

Summary

Test                   BW         IOPS
bitmap internal 64M    700KiB/s   174
bitmap internal 128M   702KiB/s   175
bitmap internal 512M   1142KiB/s  285
bitmap internal 1024M  40.4MiB/s  10.3k
bitmap internal 2G     66.5MiB/s  17.0k
bitmap external 64M    67.8MiB/s  17.3k
bitmap external 1024M  76.5MiB/s  19.6k
bitmap none            80.6MiB/s  20.6k
Single disk 1K         54.1MiB/s  55.4k
Single disk 4K         269MiB/s   68.8k




AlmaLinux release 9.4 (Seafoam Ocelot)
5.14.0-427.20.1.el9_4


nvme list
Node                  Generic               SN                   Model                                    Namespace  Usage                      Format           FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            1460A0F9TSTJ         Dell DC NVMe CD8 U.2 960GB               0x1        122.33  GB / 960.20  GB    512   B +  0 B   2.0.0
/dev/nvme1n1          /dev/ng1n1            S6WRNJ0WA04045P      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme2n1          /dev/ng2n1            S6WRNJ0WA04048B      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme3n1          /dev/ng3n1            S6WRNJ0W810396H      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme4n1          /dev/ng4n1            S6WRNJ0W808149N      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme5n1          /dev/ng5n1            S6WRNJ0WA04043Z      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme6n1          /dev/ng6n1            PHBT909504AH016N     INTEL MEMPEK1J016GAL                     0x1         14.40  GB /  14.40  GB    512   B +  0 B   K4110420
/dev/nvme7n1          /dev/ng7n1            S6WRNJ0WA04036R      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7
/dev/nvme8n1          /dev/ng8n1            S6WRNJ0WA04050H      Samsung SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    512   B +  0 B   5B2QGXA7




bitmap internal 64M
================================================================
mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=4718: Sun Jun 30 02:18:30 2024
   write: IOPS=174, BW=700KiB/s (717kB/s)(41.0MiB/60005msec); 0 zone resets
     slat (usec): min=4, max=18062, avg=11.28, stdev=176.21
     clat (usec): min=46, max=13308, avg=5700.08, stdev=1194.59
      lat (usec): min=53, max=22717, avg=5711.36, stdev=1206.03
     clat percentiles (usec):
      |  1.00th=[   51],  5.00th=[ 5800], 10.00th=[ 5800], 20.00th=[ 5866],
      | 30.00th=[ 5866], 40.00th=[ 5866], 50.00th=[ 5866], 60.00th=[ 5932],
      | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5932], 95.00th=[ 5997],
      | 99.00th=[ 6194], 99.50th=[ 8586], 99.90th=[10290], 99.95th=[13042],
      | 99.99th=[13042]
    bw (  KiB/s): min=  608, max=  752, per=100.00%, avg=700.03, stdev=20.93, samples=119
    iops        : min=  152, max=  188, avg=175.01, stdev= 5.23, samples=119
   lat (usec)   : 50=0.68%, 100=3.23%
   lat (msec)   : 10=95.99%, 20=0.10%
   cpu          : usr=0.08%, sys=0.24%, ctx=10503, majf=0, minf=8
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,10499,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=700KiB/s (717kB/s), 700KiB/s-700KiB/s (717kB/s-717kB/s), io=41.0MiB (43.0MB), run=60005-60005msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.07    0.00    0.00   99.93

Device   r/s    rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
md127    0.00   0.00   0.00    0.00   0.00     0.00      175.47  0.69   0.00    0.00   5.68     4.00      0.00    0.00     1.00    100.00
nvme1n1  69.40  0.27   0.00    0.00   0.01     4.00      237.80  0.93   0.00    0.00   3.55     4.00      168.47  0.81     0.98    95.59
nvme2n1  69.20  0.27   0.00    0.00   0.01     4.00      237.60  0.93   0.00    0.00   3.55     4.00      168.47  0.81     0.98    95.61
nvme3n1  72.20  0.28   0.00    0.00   0.01     4.00      240.60  0.94   0.00    0.00   3.51     4.00      168.47  0.83     0.98    95.29
nvme4n1  68.07  0.27   0.00    0.00   0.02     4.00      236.53  0.92   0.00    0.00   3.57     4.00      168.47  0.81     0.98    95.65
nvme5n1  72.07  0.28   0.00    0.00   0.02     4.00      240.53  0.94   0.00    0.00   3.52     4.00      168.47  0.83     0.99    95.31


mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : 77fa1a1b:2f0dd646:adc85c8e:985513a8
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 64 MB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 29807 bits (chunks), 1517 dirty (5.1%)


bitmap internal 128M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=128M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=6283: Sun Jun 30 02:49:06 2024
   write: IOPS=175, BW=702KiB/s (719kB/s)(41.1MiB/60002msec); 0 zone resets
     slat (usec): min=8, max=18200, avg=16.06, stdev=177.21
     clat (usec): min=61, max=20048, avg=5675.78, stdev=1968.88
      lat (usec): min=74, max=22975, avg=5691.84, stdev=1976.14
     clat percentiles (usec):
      |  1.00th=[   68],  5.00th=[   73], 10.00th=[ 5866], 20.00th=[ 5932],
      | 30.00th=[ 5932], 40.00th=[ 5932], 50.00th=[ 5932], 60.00th=[ 5997],
      | 70.00th=[ 5997], 80.00th=[ 5997], 90.00th=[ 5997], 95.00th=[ 6063],
      | 99.00th=[14615], 99.50th=[15008], 99.90th=[16188], 99.95th=[16319],
      | 99.99th=[16319]
    bw (  KiB/s): min=  384, max=  816, per=99.97%, avg=702.12, stdev=72.52, samples=119
    iops        : min=   96, max=  204, avg=175.53, stdev=18.13, samples=119
   lat (usec)   : 100=7.62%, 250=0.01%
   lat (msec)   : 10=90.80%, 20=1.56%, 50=0.01%
   cpu          : usr=0.11%, sys=0.34%, ctx=10539, majf=0, minf=8
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,10534,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=702KiB/s (719kB/s), 702KiB/s-702KiB/s (719kB/s-719kB/s), io=41.1MiB (43.1MB), run=60002-60002msec





avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.08    0.00    0.00   99.92

Device   r/s    rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
md127    0.00   0.00   0.00    0.00   0.00     0.00      173.73  0.68   0.00    0.00   5.73     4.00      0.00    0.00     1.00    99.99
nvme1n1  65.87  0.26   0.00    0.00   0.01     4.00      226.07  0.65   0.00    0.00   3.60     2.94      160.20  0.81     0.94    92.46
nvme2n1  71.33  0.28   0.00    0.00   0.02     4.00      231.53  0.67   0.00    0.00   3.50     2.96      160.27  0.84     0.95    91.79
nvme3n1  68.60  0.27   0.00    0.00   0.02     4.00      228.80  0.66   0.00    0.00   3.68     2.95      160.27  0.93     0.99    94.37
nvme4n1  68.87  0.27   0.00    0.00   0.02     4.00      229.07  0.66   0.00    0.00   3.52     2.95      160.20  0.81     0.94    91.59
nvme5n1  72.80  0.28   0.00    0.00   0.02     4.00      233.00  0.68   0.00    0.00   3.53     2.97      160.27  0.87     0.96    92.29


mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : 93fdcd4b:ae61a1f8:4d809242:2cd4a4c7
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 128 MB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 14904 bits (chunks), 1617 dirty (10.8%)





bitmap internal 512M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=512M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5


Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=1232KiB/s][w=308 IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=6661: Sun Jun 30 02:58:11 2024
   write: IOPS=285, BW=1142KiB/s (1169kB/s)(66.9MiB/60006msec); 0 zone resets
     slat (usec): min=4, max=18130, avg=10.80, stdev=138.54
     clat (usec): min=42, max=13261, avg=3490.08, stdev=2945.95
      lat (usec): min=50, max=22827, avg=3500.88, stdev=2949.63
     clat percentiles (usec):
      |  1.00th=[   49],  5.00th=[   51], 10.00th=[   52], 20.00th=[   55],
      | 30.00th=[   58], 40.00th=[   72], 50.00th=[ 5866], 60.00th=[ 5932],
      | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5997], 95.00th=[ 5997],
      | 99.00th=[ 6128], 99.50th=[ 8586], 99.90th=[ 9896], 99.95th=[13042],
      | 99.99th=[13042]
    bw (  KiB/s): min=  600, max= 1648, per=99.68%, avg=1138.89, stdev=188.44, samples=119
    iops        : min=  150, max=  412, avg=284.72, stdev=47.11, samples=119
   lat (usec)   : 50=3.41%, 100=38.62%, 250=0.04%, 500=0.03%
   lat (msec)   : 10=57.83%, 20=0.07%
   cpu          : usr=0.09%, sys=0.40%, ctx=17130, majf=0, minf=9
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,17127,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=1142KiB/s (1169kB/s), 1142KiB/s-1142KiB/s (1169kB/s-1169kB/s), io=66.9MiB (70.2MB), run=60006-60006msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.10    0.00    0.00   99.90

Device   r/s     rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
md127    0.00    0.00   0.00    0.00   0.00     0.00      307.13  1.20   0.00    0.00   3.24     4.00      0.00    0.00     1.00    100.00
nvme1n1  120.47  0.47   0.00    0.00   0.01     4.00      286.07  0.63   0.00    0.00   3.03     2.26      165.60  0.99     1.03    96.58
nvme2n1  123.87  0.48   0.00    0.00   0.01     4.00      289.47  0.65   0.00    0.00   3.00     2.28      165.60  1.00     1.04    96.63
nvme3n1  120.87  0.47   0.00    0.00   0.01     4.00      286.47  0.63   0.00    0.00   3.02     2.27      165.60  1.00     1.03    96.39
nvme4n1  125.00  0.49   0.00    0.00   0.02     4.00      290.60  0.65   0.00    0.00   3.00     2.29      165.60  1.02     1.04    96.54
nvme5n1  124.07  0.48   0.00    0.00   0.02     4.00      289.67  0.65   0.00    0.00   3.01     2.28      165.60  1.03     1.04    96.59


mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : 17eadc76:a367542a:feb6e24e:d650576c
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 512 MB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 3726 bits (chunks), 1977 dirty (53.1%)


bitmap internal 1024M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=51.0MiB/s][w=13.1k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=7120: Sun Jun 30 03:08:12 2024
   write: IOPS=10.3k, BW=40.4MiB/s (42.4MB/s)(2425MiB/60001msec); 0 zone resets
     slat (usec): min=6, max=18135, avg= 8.93, stdev=23.41
     clat (usec): min=3, max=10459, avg=86.97, stdev=342.95
      lat (usec): min=63, max=22927, avg=95.90, stdev=344.33
     clat percentiles (usec):
      |  1.00th=[   62],  5.00th=[   63], 10.00th=[   64], 20.00th=[   65],
      | 30.00th=[   65], 40.00th=[   66], 50.00th=[   67], 60.00th=[   67],
      | 70.00th=[   68], 80.00th=[   69], 90.00th=[   70], 95.00th=[   74],
      | 99.00th=[  133], 99.50th=[  155], 99.90th=[ 5997], 99.95th=[ 5997],
      | 99.99th=[ 6063]
    bw (  KiB/s): min=  616, max=52968, per=99.80%, avg=41305.95, stdev=20465.79, samples=119
    iops        : min=  154, max=13242, avg=10326.47, stdev=5116.44, samples=119
   lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 100=98.64%, 250=1.00%
   lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.33%, 20=0.01%
   cpu          : usr=1.89%, sys=12.74%, ctx=620837, majf=0, minf=170751
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,620822,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), io=2425MiB (2543MB), run=60001-60001msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.27    0.00    0.00   98.70

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      18216.93  71.16  0.00    0.00   0.05     4.00      0.00  0.00     0.88    100.00
nvme1n1  7256.20  28.34  0.00    0.00   0.01     4.00      7256.27   28.34  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.71
nvme2n1  7302.53  28.53  0.00    0.00   0.01     4.00      7302.53   28.53  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.73
nvme3n1  7278.47  28.43  0.00    0.00   0.01     4.00      7278.53   28.43  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.57
nvme4n1  7303.93  28.53  0.00    0.00   0.01     4.00      7303.93   28.53  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.74
nvme5n1  7292.67  28.49  0.00    0.00   0.02     4.00      7292.60   28.49  0.00    0.00   0.02     4.00      0.00  0.00     0.22    99.69



mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : a0c7ad14:50689e41:e065a166:4935a186
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 1 GB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 1863 bits (chunks), 1863 dirty (100.0%)


bitmap internal 2G
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=2G /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=74.7MiB/s][w=19.1k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=7696: Sun Jun 30 03:30:40 2024
   write: IOPS=17.0k, BW=66.5MiB/s (69.8MB/s)(3993MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=18094, avg= 4.79, stdev=17.94
     clat (usec): min=5, max=10352, avg=53.37, stdev=181.29
      lat (usec): min=41, max=22883, avg=58.16, stdev=182.72
     clat percentiles (usec):
      |  1.00th=[   43],  5.00th=[   44], 10.00th=[   45], 20.00th=[   46],
      | 30.00th=[   46], 40.00th=[   47], 50.00th=[   47], 60.00th=[   48],
      | 70.00th=[   48], 80.00th=[   49], 90.00th=[   50], 95.00th=[   52],
      | 99.00th=[   90], 99.50th=[  126], 99.90th=[  873], 99.95th=[ 5997],
      | 99.99th=[ 6063]
    bw (  KiB/s): min=  640, max=80168, per=99.91%, avg=68080.94, stdev=21547.29, samples=119
    iops        : min=  160, max=20042, avg=17020.24, stdev=5386.82, samples=119
   lat (usec)   : 10=0.01%, 50=92.06%, 100=7.10%, 250=0.73%, 500=0.01%
   lat (usec)   : 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.09%, 20=0.01%
   cpu          : usr=2.22%, sys=11.55%, ctx=1022167, majf=0, minf=14
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,1022154,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=66.5MiB/s (69.8MB/s), 66.5MiB/s-66.5MiB/s (69.8MB/s-69.8MB/s), io=3993MiB (4187MB), run=60001-60001msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.15    0.00    0.00   98.81

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      18836.40  73.58  0.00    0.00   0.05     4.00      0.00  0.00     0.87    100.00
nvme1n1  7505.27  29.32  0.00    0.00   0.01     4.00      7505.40   29.32  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.93
nvme2n1  7510.00  29.34  0.00    0.00   0.01     4.00      7510.07   29.34  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.90
nvme3n1  7561.40  29.54  0.00    0.00   0.01     4.00      7561.47   29.54  0.00    0.00   0.01     4.00      0.00  0.00     0.19    100.00
nvme4n1  7543.07  29.47  0.00    0.00   0.01     4.00      7543.07   29.47  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.91
nvme5n1  7552.73  29.50  0.00    0.00   0.01     4.00      7552.80   29.50  0.00    0.00   0.01     4.00      0.00  0.00     0.22    99.91



mdadm -X /dev/nvme1n1
         Filename : /dev/nvme1n1
            Magic : 6d746962
          Version : 4
             UUID : 7d8ed7e8:4c2c4b17:723a22e5:4e9b5200
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 2 GB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 932 bits (chunks), 932 dirty (100.0%)









bitmap external 64M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=67.3MiB/s][w=17.2k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=7912: Sun Jun 30 03:39:11 2024
   write: IOPS=17.3k, BW=67.8MiB/s (71.1MB/s)(4066MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=21987, avg= 6.11, stdev=22.04
     clat (usec): min=3, max=8410, avg=50.79, stdev=27.03
      lat (usec): min=42, max=22140, avg=56.90, stdev=35.13
     clat percentiles (usec):
      |  1.00th=[   41],  5.00th=[   42], 10.00th=[   44], 20.00th=[   46],
      | 30.00th=[   47], 40.00th=[   48], 50.00th=[   49], 60.00th=[   50],
      | 70.00th=[   51], 80.00th=[   52], 90.00th=[   56], 95.00th=[   68],
      | 99.00th=[   93], 99.50th=[  124], 99.90th=[  155], 99.95th=[  237],
      | 99.99th=[ 1037]
    bw (  KiB/s): min=38120, max=82576, per=100.00%, avg=69402.96, stdev=7769.33, samples=119
    iops        : min= 9530, max=20644, avg=17350.76, stdev=1942.33, samples=119
   lat (usec)   : 4=0.01%, 20=0.01%, 50=67.87%, 100=31.34%, 250=0.76%
   lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
   cpu          : usr=2.23%, sys=14.27%, ctx=1040947, majf=0, minf=233929
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,1040925,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=67.8MiB/s (71.1MB/s), 67.8MiB/s-67.8MiB/s (71.1MB/s-71.1MB/s), io=4066MiB (4264MB), run=60001-60001msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.15    0.00    0.00   98.81

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      18428.60  71.99  0.00    0.00   0.05     4.00      0.00  0.00     0.87    99.99
nvme1n1  7399.40  28.90  0.00    0.00   0.01     4.00      7399.47   28.90  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.73
nvme2n1  7361.20  28.75  0.00    0.00   0.01     4.00      7361.27   28.75  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.63
nvme3n1  7376.67  28.82  0.00    0.00   0.01     4.00      7376.73   28.82  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.63
nvme4n1  7367.27  28.78  0.00    0.00   0.01     4.00      7367.20   28.78  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.65
nvme5n1  7352.47  28.72  0.00    0.00   0.01     4.00      7352.67   28.72  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.73
nvme8n1  0.47     0.00   0.00    0.00   0.00     4.00      293.40    1.15   0.00    0.00   0.02     4.00      0.00  0.00     0.01    24.24



mdadm -X /bitmap/bitmap.bin
         Filename : /bitmap/bitmap.bin
            Magic : 6d746962
          Version : 4
             UUID : 1e3480e5:1f9d8b8a:53ebc6b7:279afb73
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 64 MB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 29807 bits (chunks), 29665 dirty (99.5%)



bitmap external 1024M
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=70.6MiB/s][w=18.1k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=8592: Sun Jun 30 03:54:11 2024
   write: IOPS=19.6k, BW=76.5MiB/s (80.2MB/s)(4590MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=21819, avg= 4.12, stdev=20.16
     clat (usec): min=22, max=3706, avg=46.37, stdev=20.38
      lat (usec): min=40, max=21951, avg=50.49, stdev=28.81
     clat percentiles (usec):
      |  1.00th=[   40],  5.00th=[   41], 10.00th=[   42], 20.00th=[   42],
      | 30.00th=[   43], 40.00th=[   44], 50.00th=[   45], 60.00th=[   47],
      | 70.00th=[   48], 80.00th=[   49], 90.00th=[   50], 95.00th=[   52],
      | 99.00th=[   86], 99.50th=[  120], 99.90th=[  157], 99.95th=[  233],
      | 99.99th=[  906]
    bw (  KiB/s): min=61616, max=84728, per=100.00%, avg=78398.66, stdev=5410.81, samples=119
    iops        : min=15404, max=21182, avg=19599.66, stdev=1352.70, samples=119
   lat (usec)   : 50=90.10%, 100=9.16%, 250=0.72%, 500=0.01%, 750=0.01%
   lat (usec)   : 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%
   cpu          : usr=2.35%, sys=11.88%, ctx=1175104, majf=0, minf=11
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,1175086,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=76.5MiB/s (80.2MB/s), 76.5MiB/s-76.5MiB/s (80.2MB/s-80.2MB/s), io=4590MiB (4813MB), run=60001-60001msec


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.03    0.00    0.00   98.93

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      20758.20  81.09  0.00    0.00   0.04     4.00      0.00  0.00     0.89    100.00
nvme1n1  8291.67  32.39  0.00    0.00   0.01     4.00      8291.73   32.39  0.00    0.00   0.01     4.00      0.00  0.00     0.22    99.87
nvme2n1  8270.93  32.31  0.00    0.00   0.01     4.00      8271.07   32.31  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.79
nvme3n1  8310.67  32.46  0.00    0.00   0.01     4.00      8310.80   32.46  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.83
nvme4n1  8300.67  32.42  0.00    0.00   0.01     4.00      8300.67   32.42  0.00    0.00   0.01     4.00      0.00  0.00     0.23    99.76
nvme5n1  8342.13  32.59  0.00    0.00   0.02     4.00      8342.13   32.59  0.00    0.00   0.01     4.00      0.00  0.00     0.25    99.85
nvme8n1  0.33     0.00   0.00    0.00   8.40     4.00      0.00      0.00   0.00    0.00   0.00     0.00      0.00  0.00     0.00    0.33


mdadm -X /bitmap/bitmap.bin
         Filename : /bitmap/bitmap.bin
            Magic : 6d746962
          Version : 4
             UUID : 30e6211d:31ac1204:e6cdadb3:9691d3ee
           Events : 3
   Events Cleared : 3
            State : OK
        Chunksize : 1 GB
           Daemon : 5s flush period
       Write Mode : Normal
        Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
           Bitmap : 1863 bits (chunks), 1863 dirty (100.0%)



bitmap none
================================================================
for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done

mdadm --verbose --create --assume-clean --bitmap=none /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1

for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
blockdev --setra 1024 /dev/md/raid5

echo 8 > /sys/block/md127/md/group_thread_cnt
echo 8192 > /sys/block/md127/md/stripe_cache_size


fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=82.5MiB/s][w=21.1k IOPS][eta 00m:00s]
Raid5: (groupid=0, jobs=1): err= 0: pid=9158: Sun Jun 30 04:11:01 2024
   write: IOPS=20.6k, BW=80.6MiB/s (84.5MB/s)(4833MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=13598, avg= 3.50, stdev=12.46
     clat (usec): min=4, max=3694, avg=44.31, stdev=21.60
      lat (usec): min=39, max=13681, avg=47.81, stdev=24.98
     clat percentiles (usec):
      |  1.00th=[   39],  5.00th=[   40], 10.00th=[   41], 20.00th=[   41],
      | 30.00th=[   42], 40.00th=[   43], 50.00th=[   43], 60.00th=[   44],
      | 70.00th=[   45], 80.00th=[   46], 90.00th=[   48], 95.00th=[   50],
      | 99.00th=[   87], 99.50th=[  117], 99.90th=[  157], 99.95th=[  229],
      | 99.99th=[  963]
    bw (  KiB/s): min=74112, max=86712, per=100.00%, avg=82486.43, stdev=3696.94, samples=119
    iops        : min=18528, max=21678, avg=20621.59, stdev=924.23, samples=119
   lat (usec)   : 10=0.01%, 50=95.91%, 100=3.33%, 250=0.74%, 500=0.01%
   lat (usec)   : 750=0.01%, 1000=0.01%
   lat (msec)   : 2=0.01%, 4=0.01%
   cpu          : usr=2.30%, sys=10.74%, ctx=1237375, majf=0, minf=179597
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,1237359,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), io=4833MiB (5068MB), run=60001-60001msec




avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    1.06    0.00    0.00   98.91

Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
md127    0.00     0.00   0.00    0.00   0.00     0.00      20040.87  78.28  0.00    0.00   0.04     4.00      0.00  0.00     0.89    99.99
nvme1n1  8016.80  31.32  0.00    0.00   0.01     4.00      8016.93   31.32  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.68
nvme2n1  7983.20  31.18  0.00    0.00   0.01     4.00      7983.20   31.18  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.74
nvme3n1  8030.07  31.37  0.00    0.00   0.01     4.00      8030.20   31.37  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.62
nvme4n1  8016.40  31.31  0.00    0.00   0.01     4.00      8016.40   31.31  0.00    0.00   0.01     4.00      0.00  0.00     0.23    99.73
nvme5n1  8034.87  31.39  0.00    0.00   0.02     4.00      8035.00   31.39  0.00    0.00   0.01     4.00      0.00  0.00     0.24    99.71




single disk 1K RW
================================================================
fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=1k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Single
Single: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, (T) 1024B-1024B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=54.3MiB/s][w=55.6k IOPS][eta 00m:00s]
Single: (groupid=0, jobs=1): err= 0: pid=4471: Sun Jun 30 18:31:56 2024
   write: IOPS=55.4k, BW=54.1MiB/s (56.7MB/s)(3244MiB/60001msec); 0 zone resets
     slat (usec): min=2, max=2792, avg= 2.71, stdev= 2.12
     clat (nsec): min=651, max=8350.9k, avg=14864.41, stdev=5360.57
      lat (usec): min=15, max=8403, avg=17.57, stdev= 5.79
     clat percentiles (usec):
      |  1.00th=[   15],  5.00th=[   15], 10.00th=[   15], 20.00th=[   15],
      | 30.00th=[   15], 40.00th=[   15], 50.00th=[   15], 60.00th=[   15],
      | 70.00th=[   15], 80.00th=[   16], 90.00th=[   16], 95.00th=[   16],
      | 99.00th=[   18], 99.50th=[   22], 99.90th=[   32], 99.95th=[   33],
      | 99.99th=[  206]
    bw (  KiB/s): min=51884, max=56778, per=100.00%, avg=55394.37, stdev=561.60, samples=119
    iops        : min=51884, max=56778, avg=55394.44, stdev=561.62, samples=119
   lat (nsec)   : 750=0.01%, 1000=0.01%
   lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=99.43%, 50=0.54%
   lat (usec)   : 100=0.01%, 250=0.02%, 500=0.01%
   lat (msec)   : 10=0.01%
   cpu          : usr=3.57%, sys=16.41%, ctx=3321571, majf=0, minf=180742
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,3321653,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s (56.7MB/s-56.7MB/s), io=3244MiB (3401MB), run=60001-60001msec

Disk stats (read/write):
   nvme8n1: ios=0/3309968, merge=0/0, ticks=0/44637, in_queue=44638, util=99.71%


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.04    0.00    0.42    0.00    0.00   99.54

Device   r/s   rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
nvme8n1  0.00  0.00   0.00    0.00   0.00     0.00      55496.93  54.20  0.00    0.00   0.01     1.00      0.00  0.00     0.75    100.00





single disk 4K RW
================================================================
blockdev --setra 256 /dev/nvme8n1

fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Single

Single: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.35
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=270MiB/s][w=69.0k IOPS][eta 00m:00s]
Single: (groupid=0, jobs=1): err= 0: pid=4396: Sun Jun 30 18:21:52 2024
   write: IOPS=68.8k, BW=269MiB/s (282MB/s)(15.8GiB/60001msec); 0 zone resets
     slat (usec): min=2, max=796, avg= 2.45, stdev= 1.59
     clat (nsec): min=652, max=8343.1k, avg=11616.73, stdev=5088.99
      lat (usec): min=11, max=8410, avg=14.06, stdev= 5.36
     clat percentiles (usec):
      |  1.00th=[   12],  5.00th=[   12], 10.00th=[   12], 20.00th=[   12],
      | 30.00th=[   12], 40.00th=[   12], 50.00th=[   12], 60.00th=[   12],
      | 70.00th=[   12], 80.00th=[   12], 90.00th=[   12], 95.00th=[   12],
      | 99.00th=[   14], 99.50th=[   17], 99.90th=[   28], 99.95th=[   34],
      | 99.99th=[  204]
    bw (  KiB/s): min=264072, max=277568, per=100.00%, avg=275629.71, stdev=1902.45, samples=119
    iops        : min=66018, max=69392, avg=68907.43, stdev=475.55, samples=119
   lat (nsec)   : 750=0.01%, 1000=0.01%
   lat (usec)   : 2=0.01%, 4=0.01%, 10=0.04%, 20=99.55%, 50=0.38%
   lat (usec)   : 100=0.01%, 250=0.02%, 1000=0.01%
   lat (msec)   : 10=0.01%
   cpu          : usr=5.20%, sys=21.28%, ctx=4129887, majf=0, minf=45204
   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
      issued rwts: total=0,4130258,0,0 short=0,0,0,0 dropped=0,0,0,0
      latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=269MiB/s (282MB/s), 269MiB/s-269MiB/s (282MB/s-282MB/s), io=15.8GiB (16.9GB), run=60001-60001msec

Disk stats (read/write):
   nvme8n1: ios=0/4119593, merge=0/0, ticks=0/40922, in_queue=40923, util=99.89%


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.08    0.00    0.57    0.00    0.00   99.35

Device   r/s   rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s   wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
nvme8n1  0.00  0.00   0.00    0.00   0.00     0.00      69041.33  269.69  0.00    0.00   0.01     4.00      0.00  0.00     0.68    100.00

Yu Kuai Nov. 11, 2024, 1:02 p.m. UTC | #12

Hi,

在 2024/11/11 19:59, Dragan Milivojević 写道:
> On 11/11/2024 03:04, Yu Kuai wrote:
> 
>> Yes, as I said please show me you how you creat the array and your test
>> script. I must know what you are testing, like single threaded or high
>> concurrency. For example your result shows bitmap none close to bitmap
>> external, this is impossible in our previous results. I can only guess
>> that you're testing single threaded.
> 
> All of that is included in the previously linked pastebin.

TBO, I don't know what is this. :(
> I will include the contents of that pastebin at the end of this email if 
> that helps.
> Every test includes the mdadm create line, disk settings, md settings, 
> fio test line
> used and the results and the typical iostat output during the test. I 
> hope that is
> sufficient.
> 
>> BTW, it'll be great if you can provide some perf results of the internal
>> bitmap in your case, that will show us directly where is the bottleneck.
> 
> Not right now, this server is in production and I'm not sure if I will 
> be able
> to get it to an idle state or to find the time to do it due to other 
> projects.
> 
>>> BTW do you guys do performance tests? All of the raid levels are
>>
>> We do, but we never test external bitmap.
> 
> I wasn't referring to that, more to the fact that there is a huge 
> difference in
> performance between no bitmap and bitmap or that raid (even "simple" 
> levels like 0)
> do not scale with real world workloads.

Yes, this is a known problem, the gap here is that I don't think
external bitmap is much helpful, while your result disagree.

> 
> The contents of that pastebin, hopefully my email client won't mess up 
> the formating:
> 
> 
> 5 disk RAID5, 64K chunk
> 
> Summary
> 
> Test                   BW         IOPS
> bitmap internal 64M    700KiB/s   174
> bitmap internal 128M   702KiB/s   175
> bitmap internal 512M   1142KiB/s  285
> bitmap internal 1024M  40.4MiB/s  10.3k
> bitmap internal 2G     66.5MiB/s  17.0k
> bitmap external 64M    67.8MiB/s  17.3k
> bitmap external 1024M  76.5MiB/s  19.6k
> bitmap none            80.6MiB/s  20.6k
> Single disk 1K         54.1MiB/s  55.4k
> Single disk 4K         269MiB/s   68.8k
> 
> 
> 
> 
> AlmaLinux release 9.4 (Seafoam Ocelot)
> 5.14.0-427.20.1.el9_4
> 
> 
> nvme list
> Node                  Generic               SN                   
> Model                                    Namespace  
> Usage                      Format           FW Rev
> --------------------- --------------------- -------------------- 
> ---------------------------------------- ---------- 
> -------------------------- ---------------- --------
> /dev/nvme0n1          /dev/ng0n1            1460A0F9TSTJ         Dell DC 
> NVMe CD8 U.2 960GB               0x1        122.33  GB / 960.20  GB    
> 512   B +  0 B   2.0.0
> /dev/nvme1n1          /dev/ng1n1            S6WRNJ0WA04045P      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme2n1          /dev/ng2n1            S6WRNJ0WA04048B      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme3n1          /dev/ng3n1            S6WRNJ0W810396H      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme4n1          /dev/ng4n1            S6WRNJ0W808149N      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme5n1          /dev/ng5n1            S6WRNJ0WA04043Z      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme6n1          /dev/ng6n1            PHBT909504AH016N     INTEL 
> MEMPEK1J016GAL                     0x1         14.40  GB /  14.40  GB    
> 512   B +  0 B   K4110420
> /dev/nvme7n1          /dev/ng7n1            S6WRNJ0WA04036R      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> /dev/nvme8n1          /dev/ng8n1            S6WRNJ0WA04050H      Samsung 
> SSD 980 PRO with Heatsink 2TB    0x1          0.00   B /   2.00  TB    
> 512   B +  0 B   5B2QGXA7
> 
> 
> 
> 
> bitmap internal 64M
> ================================================================
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 

The array set up is fine. And the following external bitmap is using
/bitmap/bitmap.bin, does the back-end storage of this file the same as
test device?

> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5

Then this is what I suspected, the above test is quite limited and can't
replace the real world workload, 1 thread 1 iodepth with 4k randwrite.

I still can't believe your test result, and I can't figure out why
internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5,
and use the same fio script to test, the result is quite different from
yours:

ram0:			981MiB/s
non-bitmap:		132MiB/s
internal-bitmap:	95.5MiB/s
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=4718: Sun Jun 30 02:18:30 2024
>    write: IOPS=174, BW=700KiB/s (717kB/s)(41.0MiB/60005msec); 0 zone resets
>      slat (usec): min=4, max=18062, avg=11.28, stdev=176.21
>      clat (usec): min=46, max=13308, avg=5700.08, stdev=1194.59
>       lat (usec): min=53, max=22717, avg=5711.36, stdev=1206.03
>      clat percentiles (usec):
>       |  1.00th=[   51],  5.00th=[ 5800], 10.00th=[ 5800], 20.00th=[ 5866],
>       | 30.00th=[ 5866], 40.00th=[ 5866], 50.00th=[ 5866], 60.00th=[ 5932],
>       | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5932], 95.00th=[ 5997],
>       | 99.00th=[ 6194], 99.50th=[ 8586], 99.90th=[10290], 99.95th=[13042],
>       | 99.99th=[13042]
>     bw (  KiB/s): min=  608, max=  752, per=100.00%, avg=700.03, 
> stdev=20.93, samples=119

There is absolutely something wrong here, it doesn't make sense to me
that internal bitmap is so slow. However, I have no idea until you can
provide the perf result.

Thanks,
Kuai

>     iops        : min=  152, max=  188, avg=175.01, stdev= 5.23, 
> samples=119
>    lat (usec)   : 50=0.68%, 100=3.23%
>    lat (msec)   : 10=95.99%, 20=0.10%
>    cpu          : usr=0.08%, sys=0.24%, ctx=10503, majf=0, minf=8
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,10499,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=700KiB/s (717kB/s), 700KiB/s-700KiB/s (717kB/s-717kB/s), 
> io=41.0MiB (43.0MB), run=60005-60005msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.00    0.00    0.07    0.00    0.00   99.93
> 
> Device   r/s    rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  
> wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
> md127    0.00   0.00   0.00    0.00   0.00     0.00      175.47  0.69   
> 0.00    0.00   5.68     4.00      0.00    0.00     1.00    100.00
> nvme1n1  69.40  0.27   0.00    0.00   0.01     4.00      237.80  0.93   
> 0.00    0.00   3.55     4.00      168.47  0.81     0.98    95.59
> nvme2n1  69.20  0.27   0.00    0.00   0.01     4.00      237.60  0.93   
> 0.00    0.00   3.55     4.00      168.47  0.81     0.98    95.61
> nvme3n1  72.20  0.28   0.00    0.00   0.01     4.00      240.60  0.94   
> 0.00    0.00   3.51     4.00      168.47  0.83     0.98    95.29
> nvme4n1  68.07  0.27   0.00    0.00   0.02     4.00      236.53  0.92   
> 0.00    0.00   3.57     4.00      168.47  0.81     0.98    95.65
> nvme5n1  72.07  0.28   0.00    0.00   0.02     4.00      240.53  0.94   
> 0.00    0.00   3.52     4.00      168.47  0.83     0.99    95.31
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : 77fa1a1b:2f0dd646:adc85c8e:985513a8
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 64 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 29807 bits (chunks), 1517 dirty (5.1%)
> 
> 
> bitmap internal 128M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=128M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=6283: Sun Jun 30 02:49:06 2024
>    write: IOPS=175, BW=702KiB/s (719kB/s)(41.1MiB/60002msec); 0 zone resets
>      slat (usec): min=8, max=18200, avg=16.06, stdev=177.21
>      clat (usec): min=61, max=20048, avg=5675.78, stdev=1968.88
>       lat (usec): min=74, max=22975, avg=5691.84, stdev=1976.14
>      clat percentiles (usec):
>       |  1.00th=[   68],  5.00th=[   73], 10.00th=[ 5866], 20.00th=[ 5932],
>       | 30.00th=[ 5932], 40.00th=[ 5932], 50.00th=[ 5932], 60.00th=[ 5997],
>       | 70.00th=[ 5997], 80.00th=[ 5997], 90.00th=[ 5997], 95.00th=[ 6063],
>       | 99.00th=[14615], 99.50th=[15008], 99.90th=[16188], 99.95th=[16319],
>       | 99.99th=[16319]
>     bw (  KiB/s): min=  384, max=  816, per=99.97%, avg=702.12, 
> stdev=72.52, samples=119
>     iops        : min=   96, max=  204, avg=175.53, stdev=18.13, 
> samples=119
>    lat (usec)   : 100=7.62%, 250=0.01%
>    lat (msec)   : 10=90.80%, 20=1.56%, 50=0.01%
>    cpu          : usr=0.11%, sys=0.34%, ctx=10539, majf=0, minf=8
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,10534,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=702KiB/s (719kB/s), 702KiB/s-702KiB/s (719kB/s-719kB/s), 
> io=41.1MiB (43.1MB), run=60002-60002msec
> 
> 
> 
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.00    0.00    0.08    0.00    0.00   99.92
> 
> Device   r/s    rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  
> wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
> md127    0.00   0.00   0.00    0.00   0.00     0.00      173.73  0.68   
> 0.00    0.00   5.73     4.00      0.00    0.00     1.00    99.99
> nvme1n1  65.87  0.26   0.00    0.00   0.01     4.00      226.07  0.65   
> 0.00    0.00   3.60     2.94      160.20  0.81     0.94    92.46
> nvme2n1  71.33  0.28   0.00    0.00   0.02     4.00      231.53  0.67   
> 0.00    0.00   3.50     2.96      160.27  0.84     0.95    91.79
> nvme3n1  68.60  0.27   0.00    0.00   0.02     4.00      228.80  0.66   
> 0.00    0.00   3.68     2.95      160.27  0.93     0.99    94.37
> nvme4n1  68.87  0.27   0.00    0.00   0.02     4.00      229.07  0.66   
> 0.00    0.00   3.52     2.95      160.20  0.81     0.94    91.59
> nvme5n1  72.80  0.28   0.00    0.00   0.02     4.00      233.00  0.68   
> 0.00    0.00   3.53     2.97      160.27  0.87     0.96    92.29
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : 93fdcd4b:ae61a1f8:4d809242:2cd4a4c7
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 128 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 14904 bits (chunks), 1617 dirty (10.8%)
> 
> 
> 
> 
> 
> bitmap internal 512M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=512M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=1232KiB/s][w=308 IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=6661: Sun Jun 30 02:58:11 2024
>    write: IOPS=285, BW=1142KiB/s (1169kB/s)(66.9MiB/60006msec); 0 zone 
> resets
>      slat (usec): min=4, max=18130, avg=10.80, stdev=138.54
>      clat (usec): min=42, max=13261, avg=3490.08, stdev=2945.95
>       lat (usec): min=50, max=22827, avg=3500.88, stdev=2949.63
>      clat percentiles (usec):
>       |  1.00th=[   49],  5.00th=[   51], 10.00th=[   52], 20.00th=[   55],
>       | 30.00th=[   58], 40.00th=[   72], 50.00th=[ 5866], 60.00th=[ 5932],
>       | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5997], 95.00th=[ 5997],
>       | 99.00th=[ 6128], 99.50th=[ 8586], 99.90th=[ 9896], 99.95th=[13042],
>       | 99.99th=[13042]
>     bw (  KiB/s): min=  600, max= 1648, per=99.68%, avg=1138.89, 
> stdev=188.44, samples=119
>     iops        : min=  150, max=  412, avg=284.72, stdev=47.11, 
> samples=119
>    lat (usec)   : 50=3.41%, 100=38.62%, 250=0.04%, 500=0.03%
>    lat (msec)   : 10=57.83%, 20=0.07%
>    cpu          : usr=0.09%, sys=0.40%, ctx=17130, majf=0, minf=9
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,17127,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=1142KiB/s (1169kB/s), 1142KiB/s-1142KiB/s 
> (1169kB/s-1169kB/s), io=66.9MiB (70.2MB), run=60006-60006msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.00    0.00    0.10    0.00    0.00   99.90
> 
> Device   r/s     rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s     wMB/s  
> wrqm/s  %wrqm  w_await  wareq-sz  f/s     f_await  aqu-sz  %util
> md127    0.00    0.00   0.00    0.00   0.00     0.00      307.13  1.20   
> 0.00    0.00   3.24     4.00      0.00    0.00     1.00    100.00
> nvme1n1  120.47  0.47   0.00    0.00   0.01     4.00      286.07  0.63   
> 0.00    0.00   3.03     2.26      165.60  0.99     1.03    96.58
> nvme2n1  123.87  0.48   0.00    0.00   0.01     4.00      289.47  0.65   
> 0.00    0.00   3.00     2.28      165.60  1.00     1.04    96.63
> nvme3n1  120.87  0.47   0.00    0.00   0.01     4.00      286.47  0.63   
> 0.00    0.00   3.02     2.27      165.60  1.00     1.03    96.39
> nvme4n1  125.00  0.49   0.00    0.00   0.02     4.00      290.60  0.65   
> 0.00    0.00   3.00     2.29      165.60  1.02     1.04    96.54
> nvme5n1  124.07  0.48   0.00    0.00   0.02     4.00      289.67  0.65   
> 0.00    0.00   3.01     2.28      165.60  1.03     1.04    96.59
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : 17eadc76:a367542a:feb6e24e:d650576c
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 512 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 3726 bits (chunks), 1977 dirty (53.1%)
> 
> 
> bitmap internal 1024M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=51.0MiB/s][w=13.1k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=7120: Sun Jun 30 03:08:12 2024
>    write: IOPS=10.3k, BW=40.4MiB/s (42.4MB/s)(2425MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=6, max=18135, avg= 8.93, stdev=23.41
>      clat (usec): min=3, max=10459, avg=86.97, stdev=342.95
>       lat (usec): min=63, max=22927, avg=95.90, stdev=344.33
>      clat percentiles (usec):
>       |  1.00th=[   62],  5.00th=[   63], 10.00th=[   64], 20.00th=[   65],
>       | 30.00th=[   65], 40.00th=[   66], 50.00th=[   67], 60.00th=[   67],
>       | 70.00th=[   68], 80.00th=[   69], 90.00th=[   70], 95.00th=[   74],
>       | 99.00th=[  133], 99.50th=[  155], 99.90th=[ 5997], 99.95th=[ 5997],
>       | 99.99th=[ 6063]
>     bw (  KiB/s): min=  616, max=52968, per=99.80%, avg=41305.95, 
> stdev=20465.79, samples=119
>     iops        : min=  154, max=13242, avg=10326.47, stdev=5116.44, 
> samples=119
>    lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 100=98.64%, 250=1.00%
>    lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%, 10=0.33%, 20=0.01%
>    cpu          : usr=1.89%, sys=12.74%, ctx=620837, majf=0, minf=170751
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,620822,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s 
> (42.4MB/s-42.4MB/s), io=2425MiB (2543MB), run=60001-60001msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.27    0.00    0.00   98.70
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      18216.93  
> 71.16  0.00    0.00   0.05     4.00      0.00  0.00     0.88    100.00
> nvme1n1  7256.20  28.34  0.00    0.00   0.01     4.00      7256.27   
> 28.34  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.71
> nvme2n1  7302.53  28.53  0.00    0.00   0.01     4.00      7302.53   
> 28.53  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.73
> nvme3n1  7278.47  28.43  0.00    0.00   0.01     4.00      7278.53   
> 28.43  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.57
> nvme4n1  7303.93  28.53  0.00    0.00   0.01     4.00      7303.93   
> 28.53  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.74
> nvme5n1  7292.67  28.49  0.00    0.00   0.02     4.00      7292.60   
> 28.49  0.00    0.00   0.02     4.00      0.00  0.00     0.22    99.69
> 
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : a0c7ad14:50689e41:e065a166:4935a186
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 1 GB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 1863 bits (chunks), 1863 dirty (100.0%)
> 
> 
> bitmap internal 2G
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=internal 
> --bitmap-chunk=2G /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=74.7MiB/s][w=19.1k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=7696: Sun Jun 30 03:30:40 2024
>    write: IOPS=17.0k, BW=66.5MiB/s (69.8MB/s)(3993MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=18094, avg= 4.79, stdev=17.94
>      clat (usec): min=5, max=10352, avg=53.37, stdev=181.29
>       lat (usec): min=41, max=22883, avg=58.16, stdev=182.72
>      clat percentiles (usec):
>       |  1.00th=[   43],  5.00th=[   44], 10.00th=[   45], 20.00th=[   46],
>       | 30.00th=[   46], 40.00th=[   47], 50.00th=[   47], 60.00th=[   48],
>       | 70.00th=[   48], 80.00th=[   49], 90.00th=[   50], 95.00th=[   52],
>       | 99.00th=[   90], 99.50th=[  126], 99.90th=[  873], 99.95th=[ 5997],
>       | 99.99th=[ 6063]
>     bw (  KiB/s): min=  640, max=80168, per=99.91%, avg=68080.94, 
> stdev=21547.29, samples=119
>     iops        : min=  160, max=20042, avg=17020.24, stdev=5386.82, 
> samples=119
>    lat (usec)   : 10=0.01%, 50=92.06%, 100=7.10%, 250=0.73%, 500=0.01%
>    lat (usec)   : 750=0.01%, 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%, 10=0.09%, 20=0.01%
>    cpu          : usr=2.22%, sys=11.55%, ctx=1022167, majf=0, minf=14
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,1022154,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=66.5MiB/s (69.8MB/s), 66.5MiB/s-66.5MiB/s 
> (69.8MB/s-69.8MB/s), io=3993MiB (4187MB), run=60001-60001msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.15    0.00    0.00   98.81
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      18836.40  
> 73.58  0.00    0.00   0.05     4.00      0.00  0.00     0.87    100.00
> nvme1n1  7505.27  29.32  0.00    0.00   0.01     4.00      7505.40   
> 29.32  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.93
> nvme2n1  7510.00  29.34  0.00    0.00   0.01     4.00      7510.07   
> 29.34  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.90
> nvme3n1  7561.40  29.54  0.00    0.00   0.01     4.00      7561.47   
> 29.54  0.00    0.00   0.01     4.00      0.00  0.00     0.19    100.00
> nvme4n1  7543.07  29.47  0.00    0.00   0.01     4.00      7543.07   
> 29.47  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.91
> nvme5n1  7552.73  29.50  0.00    0.00   0.01     4.00      7552.80   
> 29.50  0.00    0.00   0.01     4.00      0.00  0.00     0.22    99.91
> 
> 
> 
> mdadm -X /dev/nvme1n1
>          Filename : /dev/nvme1n1
>             Magic : 6d746962
>           Version : 4
>              UUID : 7d8ed7e8:4c2c4b17:723a22e5:4e9b5200
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 2 GB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 932 bits (chunks), 932 dirty (100.0%)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> bitmap external 64M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin 
> --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=67.3MiB/s][w=17.2k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=7912: Sun Jun 30 03:39:11 2024
>    write: IOPS=17.3k, BW=67.8MiB/s (71.1MB/s)(4066MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=21987, avg= 6.11, stdev=22.04
>      clat (usec): min=3, max=8410, avg=50.79, stdev=27.03
>       lat (usec): min=42, max=22140, avg=56.90, stdev=35.13
>      clat percentiles (usec):
>       |  1.00th=[   41],  5.00th=[   42], 10.00th=[   44], 20.00th=[   46],
>       | 30.00th=[   47], 40.00th=[   48], 50.00th=[   49], 60.00th=[   50],
>       | 70.00th=[   51], 80.00th=[   52], 90.00th=[   56], 95.00th=[   68],
>       | 99.00th=[   93], 99.50th=[  124], 99.90th=[  155], 99.95th=[  237],
>       | 99.99th=[ 1037]
>     bw (  KiB/s): min=38120, max=82576, per=100.00%, avg=69402.96, 
> stdev=7769.33, samples=119
>     iops        : min= 9530, max=20644, avg=17350.76, stdev=1942.33, 
> samples=119
>    lat (usec)   : 4=0.01%, 20=0.01%, 50=67.87%, 100=31.34%, 250=0.76%
>    lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%
>    cpu          : usr=2.23%, sys=14.27%, ctx=1040947, majf=0, minf=233929
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,1040925,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=67.8MiB/s (71.1MB/s), 67.8MiB/s-67.8MiB/s 
> (71.1MB/s-71.1MB/s), io=4066MiB (4264MB), run=60001-60001msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.15    0.00    0.00   98.81
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      18428.60  
> 71.99  0.00    0.00   0.05     4.00      0.00  0.00     0.87    99.99
> nvme1n1  7399.40  28.90  0.00    0.00   0.01     4.00      7399.47   
> 28.90  0.00    0.00   0.01     4.00      0.00  0.00     0.17    99.73
> nvme2n1  7361.20  28.75  0.00    0.00   0.01     4.00      7361.27   
> 28.75  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.63
> nvme3n1  7376.67  28.82  0.00    0.00   0.01     4.00      7376.73   
> 28.82  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.63
> nvme4n1  7367.27  28.78  0.00    0.00   0.01     4.00      7367.20   
> 28.78  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.65
> nvme5n1  7352.47  28.72  0.00    0.00   0.01     4.00      7352.67   
> 28.72  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.73
> nvme8n1  0.47     0.00   0.00    0.00   0.00     4.00      293.40    
> 1.15   0.00    0.00   0.02     4.00      0.00  0.00     0.01    24.24
> 
> 
> 
> mdadm -X /bitmap/bitmap.bin
>          Filename : /bitmap/bitmap.bin
>             Magic : 6d746962
>           Version : 4
>              UUID : 1e3480e5:1f9d8b8a:53ebc6b7:279afb73
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 64 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 29807 bits (chunks), 29665 dirty (99.5%)
> 
> 
> 
> bitmap external 1024M
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin 
> --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K 
> --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=70.6MiB/s][w=18.1k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=8592: Sun Jun 30 03:54:11 2024
>    write: IOPS=19.6k, BW=76.5MiB/s (80.2MB/s)(4590MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=21819, avg= 4.12, stdev=20.16
>      clat (usec): min=22, max=3706, avg=46.37, stdev=20.38
>       lat (usec): min=40, max=21951, avg=50.49, stdev=28.81
>      clat percentiles (usec):
>       |  1.00th=[   40],  5.00th=[   41], 10.00th=[   42], 20.00th=[   42],
>       | 30.00th=[   43], 40.00th=[   44], 50.00th=[   45], 60.00th=[   47],
>       | 70.00th=[   48], 80.00th=[   49], 90.00th=[   50], 95.00th=[   52],
>       | 99.00th=[   86], 99.50th=[  120], 99.90th=[  157], 99.95th=[  233],
>       | 99.99th=[  906]
>     bw (  KiB/s): min=61616, max=84728, per=100.00%, avg=78398.66, 
> stdev=5410.81, samples=119
>     iops        : min=15404, max=21182, avg=19599.66, stdev=1352.70, 
> samples=119
>    lat (usec)   : 50=90.10%, 100=9.16%, 250=0.72%, 500=0.01%, 750=0.01%
>    lat (usec)   : 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%
>    cpu          : usr=2.35%, sys=11.88%, ctx=1175104, majf=0, minf=11
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,1175086,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=76.5MiB/s (80.2MB/s), 76.5MiB/s-76.5MiB/s 
> (80.2MB/s-80.2MB/s), io=4590MiB (4813MB), run=60001-60001msec
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.03    0.00    0.00   98.93
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      20758.20  
> 81.09  0.00    0.00   0.04     4.00      0.00  0.00     0.89    100.00
> nvme1n1  8291.67  32.39  0.00    0.00   0.01     4.00      8291.73   
> 32.39  0.00    0.00   0.01     4.00      0.00  0.00     0.22    99.87
> nvme2n1  8270.93  32.31  0.00    0.00   0.01     4.00      8271.07   
> 32.31  0.00    0.00   0.01     4.00      0.00  0.00     0.19    99.79
> nvme3n1  8310.67  32.46  0.00    0.00   0.01     4.00      8310.80   
> 32.46  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.83
> nvme4n1  8300.67  32.42  0.00    0.00   0.01     4.00      8300.67   
> 32.42  0.00    0.00   0.01     4.00      0.00  0.00     0.23    99.76
> nvme5n1  8342.13  32.59  0.00    0.00   0.02     4.00      8342.13   
> 32.59  0.00    0.00   0.01     4.00      0.00  0.00     0.25    99.85
> nvme8n1  0.33     0.00   0.00    0.00   8.40     4.00      0.00      
> 0.00   0.00    0.00   0.00     0.00      0.00  0.00     0.00    0.33
> 
> 
> mdadm -X /bitmap/bitmap.bin
>          Filename : /bitmap/bitmap.bin
>             Magic : 6d746962
>           Version : 4
>              UUID : 30e6211d:31ac1204:e6cdadb3:9691d3ee
>            Events : 3
>    Events Cleared : 3
>             State : OK
>         Chunksize : 1 GB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 1953382464 (1862.89 GiB 2000.26 GB)
>            Bitmap : 1863 bits (chunks), 1863 dirty (100.0%)
> 
> 
> 
> bitmap none
> ================================================================
> for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done
> 
> mdadm --verbose --create --assume-clean --bitmap=none /dev/md/raid5 
> --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1
> 
> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
> blockdev --setra 1024 /dev/md/raid5
> 
> echo 8 > /sys/block/md127/md/group_thread_cnt
> echo 8192 > /sys/block/md127/md/stripe_cache_size
> 
> 
> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Raid5
> 
> Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=82.5MiB/s][w=21.1k IOPS][eta 00m:00s]
> Raid5: (groupid=0, jobs=1): err= 0: pid=9158: Sun Jun 30 04:11:01 2024
>    write: IOPS=20.6k, BW=80.6MiB/s (84.5MB/s)(4833MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=13598, avg= 3.50, stdev=12.46
>      clat (usec): min=4, max=3694, avg=44.31, stdev=21.60
>       lat (usec): min=39, max=13681, avg=47.81, stdev=24.98
>      clat percentiles (usec):
>       |  1.00th=[   39],  5.00th=[   40], 10.00th=[   41], 20.00th=[   41],
>       | 30.00th=[   42], 40.00th=[   43], 50.00th=[   43], 60.00th=[   44],
>       | 70.00th=[   45], 80.00th=[   46], 90.00th=[   48], 95.00th=[   50],
>       | 99.00th=[   87], 99.50th=[  117], 99.90th=[  157], 99.95th=[  229],
>       | 99.99th=[  963]
>     bw (  KiB/s): min=74112, max=86712, per=100.00%, avg=82486.43, 
> stdev=3696.94, samples=119
>     iops        : min=18528, max=21678, avg=20621.59, stdev=924.23, 
> samples=119
>    lat (usec)   : 10=0.01%, 50=95.91%, 100=3.33%, 250=0.74%, 500=0.01%
>    lat (usec)   : 750=0.01%, 1000=0.01%
>    lat (msec)   : 2=0.01%, 4=0.01%
>    cpu          : usr=2.30%, sys=10.74%, ctx=1237375, majf=0, minf=179597
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,1237359,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s 
> (84.5MB/s-84.5MB/s), io=4833MiB (5068MB), run=60001-60001msec
> 
> 
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    1.06    0.00    0.00   98.91
> 
> Device   r/s      rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s  wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> md127    0.00     0.00   0.00    0.00   0.00     0.00      20040.87  
> 78.28  0.00    0.00   0.04     4.00      0.00  0.00     0.89    99.99
> nvme1n1  8016.80  31.32  0.00    0.00   0.01     4.00      8016.93   
> 31.32  0.00    0.00   0.01     4.00      0.00  0.00     0.21    99.68
> nvme2n1  7983.20  31.18  0.00    0.00   0.01     4.00      7983.20   
> 31.18  0.00    0.00   0.01     4.00      0.00  0.00     0.18    99.74
> nvme3n1  8030.07  31.37  0.00    0.00   0.01     4.00      8030.20   
> 31.37  0.00    0.00   0.01     4.00      0.00  0.00     0.20    99.62
> nvme4n1  8016.40  31.31  0.00    0.00   0.01     4.00      8016.40   
> 31.31  0.00    0.00   0.01     4.00      0.00  0.00     0.23    99.73
> nvme5n1  8034.87  31.39  0.00    0.00   0.02     4.00      8035.00   
> 31.39  0.00    0.00   0.01     4.00      0.00  0.00     0.24    99.71
> 
> 
> 
> 
> single disk 1K RW
> ================================================================
> fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=1k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Single
> Single: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, (T) 
> 1024B-1024B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=54.3MiB/s][w=55.6k IOPS][eta 00m:00s]
> Single: (groupid=0, jobs=1): err= 0: pid=4471: Sun Jun 30 18:31:56 2024
>    write: IOPS=55.4k, BW=54.1MiB/s (56.7MB/s)(3244MiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=2792, avg= 2.71, stdev= 2.12
>      clat (nsec): min=651, max=8350.9k, avg=14864.41, stdev=5360.57
>       lat (usec): min=15, max=8403, avg=17.57, stdev= 5.79
>      clat percentiles (usec):
>       |  1.00th=[   15],  5.00th=[   15], 10.00th=[   15], 20.00th=[   15],
>       | 30.00th=[   15], 40.00th=[   15], 50.00th=[   15], 60.00th=[   15],
>       | 70.00th=[   15], 80.00th=[   16], 90.00th=[   16], 95.00th=[   16],
>       | 99.00th=[   18], 99.50th=[   22], 99.90th=[   32], 99.95th=[   33],
>       | 99.99th=[  206]
>     bw (  KiB/s): min=51884, max=56778, per=100.00%, avg=55394.37, 
> stdev=561.60, samples=119
>     iops        : min=51884, max=56778, avg=55394.44, stdev=561.62, 
> samples=119
>    lat (nsec)   : 750=0.01%, 1000=0.01%
>    lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=99.43%, 50=0.54%
>    lat (usec)   : 100=0.01%, 250=0.02%, 500=0.01%
>    lat (msec)   : 10=0.01%
>    cpu          : usr=3.57%, sys=16.41%, ctx=3321571, majf=0, minf=180742
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,3321653,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s 
> (56.7MB/s-56.7MB/s), io=3244MiB (3401MB), run=60001-60001msec
> 
> Disk stats (read/write):
>    nvme8n1: ios=0/3309968, merge=0/0, ticks=0/44637, in_queue=44638, 
> util=99.71%
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.04    0.00    0.42    0.00    0.00   99.54
> 
> Device   r/s   rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       wMB/s  
> wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> nvme8n1  0.00  0.00   0.00    0.00   0.00     0.00      55496.93  54.20  
> 0.00    0.00   0.01     1.00      0.00  0.00     0.75    100.00
> 
> 
> 
> 
> 
> single disk 4K RW
> ================================================================
> blockdev --setra 256 /dev/nvme8n1
> 
> fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=4k 
> --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting 
> --time_based --name=Single
> 
> Single: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.35
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][w=270MiB/s][w=69.0k IOPS][eta 00m:00s]
> Single: (groupid=0, jobs=1): err= 0: pid=4396: Sun Jun 30 18:21:52 2024
>    write: IOPS=68.8k, BW=269MiB/s (282MB/s)(15.8GiB/60001msec); 0 zone 
> resets
>      slat (usec): min=2, max=796, avg= 2.45, stdev= 1.59
>      clat (nsec): min=652, max=8343.1k, avg=11616.73, stdev=5088.99
>       lat (usec): min=11, max=8410, avg=14.06, stdev= 5.36
>      clat percentiles (usec):
>       |  1.00th=[   12],  5.00th=[   12], 10.00th=[   12], 20.00th=[   12],
>       | 30.00th=[   12], 40.00th=[   12], 50.00th=[   12], 60.00th=[   12],
>       | 70.00th=[   12], 80.00th=[   12], 90.00th=[   12], 95.00th=[   12],
>       | 99.00th=[   14], 99.50th=[   17], 99.90th=[   28], 99.95th=[   34],
>       | 99.99th=[  204]
>     bw (  KiB/s): min=264072, max=277568, per=100.00%, avg=275629.71, 
> stdev=1902.45, samples=119
>     iops        : min=66018, max=69392, avg=68907.43, stdev=475.55, 
> samples=119
>    lat (nsec)   : 750=0.01%, 1000=0.01%
>    lat (usec)   : 2=0.01%, 4=0.01%, 10=0.04%, 20=99.55%, 50=0.38%
>    lat (usec)   : 100=0.01%, 250=0.02%, 1000=0.01%
>    lat (msec)   : 10=0.01%
>    cpu          : usr=5.20%, sys=21.28%, ctx=4129887, majf=0, minf=45204
>    IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>  >=64=0.0%
>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>  >=64=0.0%
>       issued rwts: total=0,4130258,0,0 short=0,0,0,0 dropped=0,0,0,0
>       latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>    WRITE: bw=269MiB/s (282MB/s), 269MiB/s-269MiB/s (282MB/s-282MB/s), 
> io=15.8GiB (16.9GB), run=60001-60001msec
> 
> Disk stats (read/write):
>    nvme8n1: ios=0/4119593, merge=0/0, ticks=0/40922, in_queue=40923, 
> util=99.89%
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.08    0.00    0.57    0.00    0.00   99.35
> 
> Device   r/s   rMB/s  rrqm/s  %rrqm  r_await  rareq-sz  w/s       
> wMB/s   wrqm/s  %wrqm  w_await  wareq-sz  f/s   f_await  aqu-sz  %util
> nvme8n1  0.00  0.00   0.00    0.00   0.00     0.00      69041.33  
> 269.69  0.00    0.00   0.01     4.00      0.00  0.00     0.68    100.00
> 
> .
>

Dragan Milivojević Nov. 11, 2024, 2:07 p.m. UTC | #13

On 11/11/2024 14:02, Yu Kuai wrote:

> TBO, I don't know what is this. :(

It's just a website where you can post text content, notes basically. I use it
with mailing list where messages get rejected if I attach a file.
I prefer not to include long debug logs, test logs etc in the body as it will just
get quoted endless amount of times and pollute the thread. Old habit from the days
when blackquoting was a thing and kilobytes mattered.


> Yes, this is a known problem, the gap here is that I don't think
> external bitmap is much helpful, while your result disagree.
> 
>> bitmap internal 64M
>> ================================================================
>> mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1
>>
>> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done
>> blockdev --setra 1024 /dev/md/raid5
>>
>> echo 8 > /sys/block/md127/md/group_thread_cnt
>> echo 8192 > /sys/block/md127/md/stripe_cache_size
>>

  
> The array set up is fine. And the following external bitmap is using
> /bitmap/bitmap.bin, does the back-end storage of this file the same as
> test device?


No, I used one of the extra devices.




>> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5
> 
> Then this is what I suspected, the above test is quite limited and can't
> replace the real world workload, 1 thread 1 iodepth with 4k randwrite.



That is true. I went down this rabbit hole because I was getting worse results
with a RAID5 array than with a single disk using real world workload, PostgreSQL in my case.
I chose these test parameters as a worst case scenario.

I did test with other parameters, did a whole battery of test with iodepth of 1 and 8 and
BS sizes 4K,8,16 all the way to 2048K. It shows similar behaviour.
For example:

5 disk RAID5, 64K chunk, default internal bitmap, iodepth 8

randread
BS     BW     IOPS    LAT      LAT_DEV  SS  SS_perc  USR_CPU  SYS_CPU
4K     574    146993  53.58    21.63    0   7.50%    13.25    59.30
8K     1127   144268  54.75    19.48    0   4.39%    11.25    61.50
16K    2084   133387  59.32    16.53    0   2.87%    10.52    63.03
32K    3942   126151  62.67    21.59    1   1.30%    13.03    60.14
64K    7225   115606  68.64    19.31    1   1.03%    9.58     65.30
128K   7947   63580   124.73   22.66    1   1.91%    8.94     63.48
256K   9216   36867   216.49   26.47    1   0.51%    2.65     69.43
512K   8065   16130   494.82   42.43    1   1.25%    2.41     72.56
1024K  8130   8130    983.01   64.22    1   0.97%    0.92     73.38
2048K  10685  5342    1496.28  132.24   0   2.50%    0.75     68.89


randwrite
BS     BW    IOPS  LAT       LAT_DEV  SS  SS_perc  USR_CPU  SYS_CPU
4K     1     375   21318.71  5059.72  0   41.06%   0.10     0.38
8K     2     354   22548.71  3084.57  0   4.90%    0.11     0.35
16K    5     346   23107.64  2517.95  0   9.77%    0.11     0.49
32K    13    420   19001.29  5500.62  0   34.75%   0.22     1.30
64K    33    530   15064.25  3916.28  0   8.07%    0.29     2.92
128K   79    637   12549.72  3249.85  0   3.99%    0.72     4.60
256K   184   739   10812.12  2576.32  0   34.02%   3.81     4.32
512K   307   615   12995.86  2891.70  0   2.99%    2.31     4.31
1024K  611   611   13071.85  3287.53  0   6.96%    3.60     8.42
2048K  1051  525   15209.81  3562.27  0   35.79%   8.67     20.12


Bitmap  none, array with the same settings (previous array was shut down, drives were "cleansed" with nvme format)


randread
BS     BW     IOPS    LAT_µs   LAT_DEV  SS  SS_perc  USR_CPU  SYS_CPU
4K     571    146399  53.80    25.07    0   5.17%    13.54    58.45
8K     1147   146866  53.87    17.48    0   3.10%    11.20    59.26
16K    1970   126136  62.70    20.11    0   2.64%    11.06    58.88
32K    3519   112637  70.36    23.60    1   1.98%    11.05    54.55
64K    6502   104037  76.27    21.71    1   1.52%    9.60     60.40
128K   7886   63093   126.05   21.88    1   1.19%    6.84     65.40
256K   9446   37787   211.05   27.00    1   0.77%    3.60     69.37
512K   8397   16794   475.58   42.16    1   1.45%    1.85     71.99
1024K  8510   8510    939.13   55.02    1   1.01%    1.00     72.60
2048K  11035  5517    1448.77  84.14    1   1.99%    0.74     73.49


randwrite
BS     BW    IOPS   LAT_µs   LAT_DEV  SS  SS_perc  USR_CPU  SYS_CPU
4K     195   50151  158.96   48.56    1   1.13%    5.74     34.68
8K     264   33897  235.39   77.11    1   1.32%    4.60     34.46
16K    343   22003  362.88   111.80   1   1.70%    5.34     37.17
32K    645   20642  386.83   145.86   0   33.84%   6.48     45.15
64K    917   14680  543.97   170.23   0   3.01%    6.05     53.27
128K   1416  11332  704.94   202.18   0   4.66%    9.69     57.63
256K   1394  5576   1433.60  375.88   1   1.52%    8.53     24.93
512K   1726  3452   2316.19  500.19   1   1.18%    12.38    30.54
1024K  2598  2598   3077.47  629.37   0   2.53%    18.74    47.02
2048K  2457  1228   6508.20  1825.67  0   3.32%    28.70    61.01



Reads are fine but writes are many times slower ...


> 
> I still can't believe your test result, and I can't figure out why
> internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5,
> and use the same fio script to test, the result is quite different from
> yours:
> 
> ram0:            981MiB/s
> non-bitmap:        132MiB/s
> internal-bitmap:    95.5MiB/s
>>

I don't know, I can provide full fio test logs including fio "tracing" for these iodepth 8
tests if that would make any difference.



> There is absolutely something wrong here, it doesn't make sense to me
> that internal bitmap is so slow. However, I have no idea until you can
> provide the perf result.

I may be able to find time to do that over the weekend, but don’t hold me to it.
The test setup will not be the same, server is in production ...
I did leave some "spare" partitions on all drives to investigate this issue further
but did not find the time.

Please send me an example of how would you like me to run the perf tool, I haven't used
it much.

Thanks
Dragan

Mariusz Tkaczyk Nov. 12, 2024, 7:54 a.m. UTC | #14

On Thu, 7 Nov 2024 17:28:43 -0800
Song Liu <song@kernel.org> wrote:

> On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > 在 2024/11/08 7:41, Song Liu 写道:  
> > > On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:  
> > >>
> > >> From: Yu Kuai <yukuai3@huawei.com>
> > >>
> > >> The bitmap file has been marked as deprecated for more than a year now,
> > >> let's remove it, and we don't need to care about this case in the new
> > >> bitmap.
> > >>
> > >> Signed-off-by: Yu Kuai <yukuai3@huawei.com>  
> > >
> > > What happens when an old array with bitmap file boots into a kernel
> > > without bitmap file support?  
> >
> > If mdadm is used with bitmap file support, then kenel will just ignore
> > the bitmap, the same as none bitmap. Perhaps it's better to leave a
> > error message?  
> 
> Yes, we should print some error message before assembling the array.
> 
> > And if mdadm is updated, reassemble will fail.  

I would be great if mdadm can just ignore it too. It comes from config file, so
simply you can ignore bitmap entry if it is different than "internal" or
"clustered". You can print error however you must do it somewhere else (outside
config.c), otherwise user would be always prompted about that on every config
read - probably we don't need to make it such noise but maybe we should (user
may not notice change if we are not screaming it loud). I have no opinion here.

The first rule is always data access- we should not break that if possible. I
think case I think it is possible to keep them assembled.

> 
> I think we should ship this with 6.14 (not 6.13), so that we have
> more time testing different combinations of old/new mdadm
> and kernel. WDYT?

Later is better because it decreases possibility that someone would met the
case with new kernel and old mdadm, where probably some ioctl/sysfs writes
fails will be observed.

I would say that we should wait around one year after removing it from mdadm.
That is preferred by me.

I will merge Kuai changes soon, before the release. I think it is valuable to
have it blocked in new mdadm release.

Mariusz

Yu Kuai Nov. 13, 2024, 1:18 a.m. UTC | #15

Hi,

在 2024/11/11 22:07, Dragan Milivojević 写道:
> Reads are fine but writes are many times slower ...
> 
> 
>>
>> I still can't believe your test result, and I can't figure out why
>> internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5,
>> and use the same fio script to test, the result is quite different from
>> yours:
>>
>> ram0:            981MiB/s
>> non-bitmap:        132MiB/s
>> internal-bitmap:    95.5MiB/s
>>>

So, I waited for Paul to have a chance to give it a test for real disks,
still, results are similar to above.
> 
> I don't know, I can provide full fio test logs including fio "tracing" 
> for these iodepth 8
> tests if that would make any difference.
> 

No, I don't need fio logs.
> 
> 
>> There is absolutely something wrong here, it doesn't make sense to me
>> that internal bitmap is so slow. However, I have no idea until you can
>> provide the perf result.
> 
> I may be able to find time to do that over the weekend, but don’t hold 
> me to it.
> The test setup will not be the same, server is in production ...
> I did leave some "spare" partitions on all drives to investigate this 
> issue further
> but did not find the time.
> 
> Please send me an example of how would you like me to run the perf tool, 
> I haven't used
> it much.

You can see examples here:

https://github.com/brendangregg/FlameGraph

To be short, while test is running:

perf record -a -g -- sleep 10
perf script -i perf.data | ./stackcollapse-perf.pl | ./flamegraph.pl

BTW, you said that you're using production environment, this will
probably make it hard to analyze performance.

Thanks,
Kuai

Dragan Milivojević Nov. 13, 2024, 1:25 a.m. UTC | #16

On 13/11/2024 02:18, Yu Kuai wrote:
>>> ram0:            981MiB/s
>>> non-bitmap:        132MiB/s
>>> internal-bitmap:    95.5MiB/s
>>>>

> 
> So, I waited for Paul to have a chance to give it a test for real disks,
> still, results are similar to above.

That is interesting. How are you running those tests?
I should try them on my hardware as well.

> 
> You can see examples here:
> 
> https://github.com/brendangregg/FlameGraph
> 
> To be short, while test is running:
> 
> perf record -a -g -- sleep 10
> perf script -i perf.data | ./stackcollapse-perf.pl | ./flamegraph.pl
> 
> BTW, you said that you're using production environment, this will
> probably make it hard to analyze performance.

I may be able to move things around for the weekend, we will see.

Thanks
Dragan

[md-6.13] md: remove bitmap file support

Checks

Commit Message

Comments

Patch