Message ID | 20241107125911.311347-1-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [md-6.13] md: remove bitmap file support | expand |
Context | Check | Description |
---|---|---|
mdraidci/vmtest-md-6_13-PR | success | PR summary |
mdraidci/vmtest-md-6_13-VM_Test-0 | success | Logs for per-patch-testing |
On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > From: Yu Kuai <yukuai3@huawei.com> > > The bitmap file has been marked as deprecated for more than a year now, > let's remove it, and we don't need to care about this case in the new > bitmap. > > Signed-off-by: Yu Kuai <yukuai3@huawei.com> What happens when an old array with bitmap file boots into a kernel without bitmap file support? Thanks, Song
Hi, 在 2024/11/08 7:41, Song Liu 写道: > On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: >> >> From: Yu Kuai <yukuai3@huawei.com> >> >> The bitmap file has been marked as deprecated for more than a year now, >> let's remove it, and we don't need to care about this case in the new >> bitmap. >> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> > > What happens when an old array with bitmap file boots into a kernel > without bitmap file support? If mdadm is used with bitmap file support, then kenel will just ignore the bitmap, the same as none bitmap. Perhaps it's better to leave a error message? And if mdadm is updated, reassemble will fail. Thanks, Kuai > > Thanks, > Song > . >
On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2024/11/08 7:41, Song Liu 写道: > > On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > >> > >> From: Yu Kuai <yukuai3@huawei.com> > >> > >> The bitmap file has been marked as deprecated for more than a year now, > >> let's remove it, and we don't need to care about this case in the new > >> bitmap. > >> > >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> > > > > What happens when an old array with bitmap file boots into a kernel > > without bitmap file support? > > If mdadm is used with bitmap file support, then kenel will just ignore > the bitmap, the same as none bitmap. Perhaps it's better to leave a > error message? Yes, we should print some error message before assembling the array. > And if mdadm is updated, reassemble will fail. I think we should ship this with 6.14 (not 6.13), so that we have more time testing different combinations of old/new mdadm and kernel. WDYT? Thanks, Song
Hi, 在 2024/11/08 9:28, Song Liu 写道: > On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: >> >> Hi, >> >> 在 2024/11/08 7:41, Song Liu 写道: >>> On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: >>>> >>>> From: Yu Kuai <yukuai3@huawei.com> >>>> >>>> The bitmap file has been marked as deprecated for more than a year now, >>>> let's remove it, and we don't need to care about this case in the new >>>> bitmap. >>>> >>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>> >>> What happens when an old array with bitmap file boots into a kernel >>> without bitmap file support? >> >> If mdadm is used with bitmap file support, then kenel will just ignore >> the bitmap, the same as none bitmap. Perhaps it's better to leave a >> error message? > > Yes, we should print some error message before assembling the array. OK > >> And if mdadm is updated, reassemble will fail. > > I think we should ship this with 6.14 (not 6.13), so that we have > more time testing different combinations of old/new mdadm > and kernel. WDYT? Agreed! Thanks, Kuai > > Thanks, > Song > . >
On Fri, 8 Nov 2024 at 02:29, Song Liu <song@kernel.org> wrote: > I think we should ship this with 6.14 (not 6.13), so that we have > more time testing different combinations of old/new mdadm > and kernel. WDYT? I'm not sure if bitmap performance fixes are already included but if not please include those too. Internal bitmap kills performance and external bitmap was a workaround for that issue.
Hi, 在 2024/11/08 13:15, Dragan Milivojević 写道: > On Fri, 8 Nov 2024 at 02:29, Song Liu <song@kernel.org> wrote: > >> I think we should ship this with 6.14 (not 6.13), so that we have >> more time testing different combinations of old/new mdadm >> and kernel. WDYT? > > I'm not sure if bitmap performance fixes are already included > but if not please include those too. Internal bitmap kills performance > and external bitmap was a workaround for that issue. I don't think external bitmap can workaround the performance degradation problem, because the global lock for the bitmap are the one to blame for this, it's the same for external or internal bitmap. Do you know that is there anyone using external bitmap in the real world? And is there numbers for performance? We'll have to consider to keep it untill the new lockless bitmap is ready if so. Thanks, Kuai > . >
On Fri, 8 Nov 2024 at 07:07, Yu Kuai <yukuai1@huaweicloud.com> wrote: > I don't think external bitmap can workaround the performance degradation > problem, because the global lock for the bitmap are the one to blame for > this, it's the same for external or internal bitmap. Not according to my tests: 5 disk RAID5, 64K chunk Test BW IOPS bitmap internal 64M 700KiB/s 174 bitmap internal 128M 702KiB/s 175 bitmap internal 512M 1142KiB/s 285 bitmap internal 1024M 40.4MiB/s 10.3k bitmap internal 2G 66.5MiB/s 17.0k bitmap external 64M 67.8MiB/s 17.3k bitmap external 1024M 76.5MiB/s 19.6k bitmap none 80.6MiB/s 20.6k Single disk 1K 54.1MiB/s 55.4k Single disk 4K 269MiB/s 68.8k Full test logs with system details at: pastebin. com/raw/TK4vWjQu > > Do you know that is there anyone using external bitmap in the real > world? And is there numbers for performance? We'll have to consider > to keep it untill the new lockless bitmap is ready if so. Well I am and it's a royal pain but there isn't much of an alternative.
Hi, 在 2024/11/09 6:19, Dragan Milivojević 写道: > On Fri, 8 Nov 2024 at 07:07, Yu Kuai <yukuai1@huaweicloud.com> wrote: > > >> I don't think external bitmap can workaround the performance degradation >> problem, because the global lock for the bitmap are the one to blame for >> this, it's the same for external or internal bitmap. > > Not according to my tests: > > 5 disk RAID5, 64K chunk > > > > Test BW IOPS > bitmap internal 64M 700KiB/s 174 > bitmap internal 128M 702KiB/s 175 > bitmap internal 512M 1142KiB/s 285 > bitmap internal 1024M 40.4MiB/s 10.3k > bitmap internal 2G 66.5MiB/s 17.0k > bitmap external 64M 67.8MiB/s 17.3k > bitmap external 1024M 76.5MiB/s 19.6k This is not what I expected, can you give the tests procedures in details? Including test machine, create the array and test scrpits. > bitmap none 80.6MiB/s 20.6k > Single disk 1K 54.1MiB/s 55.4k > Single disk 4K 269MiB/s 68.8k > > > > Full test logs with system details at: pastebin. com/raw/TK4vWjQu > > >> >> Do you know that is there anyone using external bitmap in the real >> world? And is there numbers for performance? We'll have to consider >> to keep it untill the new lockless bitmap is ready if so. > > Well I am and it's a royal pain but there isn't much of an alternative. Bitmap file will be removed, the way it's implemented is problematic. If you have plans to upgrade kernel to v6.13+, I can keep it for now, untill the other lockless bitmap is ready. Thanks, Kuai > . >
On Sat, 9 Nov 2024 at 02:44, Yu Kuai <yukuai1@huaweicloud.com> wrote: > > This is not what I expected, can you give the tests procedures in > details? Including test machine, create the array and test scrpits. Server is Dell PowerEdge R7525, 2x AMD EPYC 7313 the rest is in the linked pastebin, let me know if you need more info. BTW do you guys do performance tests? All of the raid levels are practically broken performance wise. None of them scale. Looking forward to seeing those patches from Shushu Yi included, anyone knows when those will be shipped? > Bitmap file will be removed, the way it's implemented is problematic. If > you have plans to upgrade kernel to v6.13+, I can keep it for now, > untill the other lockless bitmap is ready. I usually use distro kernels so no such plan for now, just thought it would be useful to ship both at the same time. Soften the blow for those using external bitmaps.
Hi, 在 2024/11/09 10:15, Dragan Milivojević 写道: > On Sat, 9 Nov 2024 at 02:44, Yu Kuai <yukuai1@huaweicloud.com> wrote: >> >> This is not what I expected, can you give the tests procedures in >> details? Including test machine, create the array and test scrpits. > > Server is Dell PowerEdge R7525, 2x AMD EPYC 7313 the rest > is in the linked pastebin, let me know if you need more info. Yes, as I said please show me you how you creat the array and your test script. I must know what you are testing, like single threaded or high concurrency. For example your result shows bitmap none close to bitmap external, this is impossible in our previous results. I can only guess that you're testing single threaded. BTW, it'll be great if you can provide some perf results of the internal bitmap in your case, that will show us directly where is the bottleneck. > > BTW do you guys do performance tests? All of the raid levels are We do, but we never test external bitmap. +CC Paul Hi, do you have time to add the external bitmap in our test? Thanks, Kuai
On 11/11/2024 03:04, Yu Kuai wrote: > Yes, as I said please show me you how you creat the array and your test > script. I must know what you are testing, like single threaded or high > concurrency. For example your result shows bitmap none close to bitmap > external, this is impossible in our previous results. I can only guess > that you're testing single threaded. All of that is included in the previously linked pastebin. I will include the contents of that pastebin at the end of this email if that helps. Every test includes the mdadm create line, disk settings, md settings, fio test line used and the results and the typical iostat output during the test. I hope that is sufficient. > BTW, it'll be great if you can provide some perf results of the internal > bitmap in your case, that will show us directly where is the bottleneck. Not right now, this server is in production and I'm not sure if I will be able to get it to an idle state or to find the time to do it due to other projects. >> BTW do you guys do performance tests? All of the raid levels are > > We do, but we never test external bitmap. I wasn't referring to that, more to the fact that there is a huge difference in performance between no bitmap and bitmap or that raid (even "simple" levels like 0) do not scale with real world workloads. The contents of that pastebin, hopefully my email client won't mess up the formating: 5 disk RAID5, 64K chunk Summary Test BW IOPS bitmap internal 64M 700KiB/s 174 bitmap internal 128M 702KiB/s 175 bitmap internal 512M 1142KiB/s 285 bitmap internal 1024M 40.4MiB/s 10.3k bitmap internal 2G 66.5MiB/s 17.0k bitmap external 64M 67.8MiB/s 17.3k bitmap external 1024M 76.5MiB/s 19.6k bitmap none 80.6MiB/s 20.6k Single disk 1K 54.1MiB/s 55.4k Single disk 4K 269MiB/s 68.8k AlmaLinux release 9.4 (Seafoam Ocelot) 5.14.0-427.20.1.el9_4 nvme list Node Generic SN Model Namespace Usage Format FW Rev --------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- -------- /dev/nvme0n1 /dev/ng0n1 1460A0F9TSTJ Dell DC NVMe CD8 U.2 960GB 0x1 122.33 GB / 960.20 GB 512 B + 0 B 2.0.0 /dev/nvme1n1 /dev/ng1n1 S6WRNJ0WA04045P Samsung SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB 512 B + 0 B 5B2QGXA7 /dev/nvme2n1 /dev/ng2n1 S6WRNJ0WA04048B Samsung SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB 512 B + 0 B 5B2QGXA7 /dev/nvme3n1 /dev/ng3n1 S6WRNJ0W810396H Samsung SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB 512 B + 0 B 5B2QGXA7 /dev/nvme4n1 /dev/ng4n1 S6WRNJ0W808149N Samsung SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB 512 B + 0 B 5B2QGXA7 /dev/nvme5n1 /dev/ng5n1 S6WRNJ0WA04043Z Samsung SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB 512 B + 0 B 5B2QGXA7 /dev/nvme6n1 /dev/ng6n1 PHBT909504AH016N INTEL MEMPEK1J016GAL 0x1 14.40 GB / 14.40 GB 512 B + 0 B K4110420 /dev/nvme7n1 /dev/ng7n1 S6WRNJ0WA04036R Samsung SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB 512 B + 0 B 5B2QGXA7 /dev/nvme8n1 /dev/ng8n1 S6WRNJ0WA04050H Samsung SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB 512 B + 0 B 5B2QGXA7 bitmap internal 64M ================================================================ mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done blockdev --setra 1024 /dev/md/raid5 echo 8 > /sys/block/md127/md/group_thread_cnt echo 8192 > /sys/block/md127/md/stripe_cache_size fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s] Raid5: (groupid=0, jobs=1): err= 0: pid=4718: Sun Jun 30 02:18:30 2024 write: IOPS=174, BW=700KiB/s (717kB/s)(41.0MiB/60005msec); 0 zone resets slat (usec): min=4, max=18062, avg=11.28, stdev=176.21 clat (usec): min=46, max=13308, avg=5700.08, stdev=1194.59 lat (usec): min=53, max=22717, avg=5711.36, stdev=1206.03 clat percentiles (usec): | 1.00th=[ 51], 5.00th=[ 5800], 10.00th=[ 5800], 20.00th=[ 5866], | 30.00th=[ 5866], 40.00th=[ 5866], 50.00th=[ 5866], 60.00th=[ 5932], | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5932], 95.00th=[ 5997], | 99.00th=[ 6194], 99.50th=[ 8586], 99.90th=[10290], 99.95th=[13042], | 99.99th=[13042] bw ( KiB/s): min= 608, max= 752, per=100.00%, avg=700.03, stdev=20.93, samples=119 iops : min= 152, max= 188, avg=175.01, stdev= 5.23, samples=119 lat (usec) : 50=0.68%, 100=3.23% lat (msec) : 10=95.99%, 20=0.10% cpu : usr=0.08%, sys=0.24%, ctx=10503, majf=0, minf=8 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,10499,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=700KiB/s (717kB/s), 700KiB/s-700KiB/s (717kB/s-717kB/s), io=41.0MiB (43.0MB), run=60005-60005msec avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.07 0.00 0.00 99.93 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util md127 0.00 0.00 0.00 0.00 0.00 0.00 175.47 0.69 0.00 0.00 5.68 4.00 0.00 0.00 1.00 100.00 nvme1n1 69.40 0.27 0.00 0.00 0.01 4.00 237.80 0.93 0.00 0.00 3.55 4.00 168.47 0.81 0.98 95.59 nvme2n1 69.20 0.27 0.00 0.00 0.01 4.00 237.60 0.93 0.00 0.00 3.55 4.00 168.47 0.81 0.98 95.61 nvme3n1 72.20 0.28 0.00 0.00 0.01 4.00 240.60 0.94 0.00 0.00 3.51 4.00 168.47 0.83 0.98 95.29 nvme4n1 68.07 0.27 0.00 0.00 0.02 4.00 236.53 0.92 0.00 0.00 3.57 4.00 168.47 0.81 0.98 95.65 nvme5n1 72.07 0.28 0.00 0.00 0.02 4.00 240.53 0.94 0.00 0.00 3.52 4.00 168.47 0.83 0.99 95.31 mdadm -X /dev/nvme1n1 Filename : /dev/nvme1n1 Magic : 6d746962 Version : 4 UUID : 77fa1a1b:2f0dd646:adc85c8e:985513a8 Events : 3 Events Cleared : 3 State : OK Chunksize : 64 MB Daemon : 5s flush period Write Mode : Normal Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) Bitmap : 29807 bits (chunks), 1517 dirty (5.1%) bitmap internal 128M ================================================================ for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=128M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done blockdev --setra 1024 /dev/md/raid5 echo 8 > /sys/block/md127/md/group_thread_cnt echo 8192 > /sys/block/md127/md/stripe_cache_size fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s] Raid5: (groupid=0, jobs=1): err= 0: pid=6283: Sun Jun 30 02:49:06 2024 write: IOPS=175, BW=702KiB/s (719kB/s)(41.1MiB/60002msec); 0 zone resets slat (usec): min=8, max=18200, avg=16.06, stdev=177.21 clat (usec): min=61, max=20048, avg=5675.78, stdev=1968.88 lat (usec): min=74, max=22975, avg=5691.84, stdev=1976.14 clat percentiles (usec): | 1.00th=[ 68], 5.00th=[ 73], 10.00th=[ 5866], 20.00th=[ 5932], | 30.00th=[ 5932], 40.00th=[ 5932], 50.00th=[ 5932], 60.00th=[ 5997], | 70.00th=[ 5997], 80.00th=[ 5997], 90.00th=[ 5997], 95.00th=[ 6063], | 99.00th=[14615], 99.50th=[15008], 99.90th=[16188], 99.95th=[16319], | 99.99th=[16319] bw ( KiB/s): min= 384, max= 816, per=99.97%, avg=702.12, stdev=72.52, samples=119 iops : min= 96, max= 204, avg=175.53, stdev=18.13, samples=119 lat (usec) : 100=7.62%, 250=0.01% lat (msec) : 10=90.80%, 20=1.56%, 50=0.01% cpu : usr=0.11%, sys=0.34%, ctx=10539, majf=0, minf=8 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,10534,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=702KiB/s (719kB/s), 702KiB/s-702KiB/s (719kB/s-719kB/s), io=41.1MiB (43.1MB), run=60002-60002msec avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.08 0.00 0.00 99.92 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util md127 0.00 0.00 0.00 0.00 0.00 0.00 173.73 0.68 0.00 0.00 5.73 4.00 0.00 0.00 1.00 99.99 nvme1n1 65.87 0.26 0.00 0.00 0.01 4.00 226.07 0.65 0.00 0.00 3.60 2.94 160.20 0.81 0.94 92.46 nvme2n1 71.33 0.28 0.00 0.00 0.02 4.00 231.53 0.67 0.00 0.00 3.50 2.96 160.27 0.84 0.95 91.79 nvme3n1 68.60 0.27 0.00 0.00 0.02 4.00 228.80 0.66 0.00 0.00 3.68 2.95 160.27 0.93 0.99 94.37 nvme4n1 68.87 0.27 0.00 0.00 0.02 4.00 229.07 0.66 0.00 0.00 3.52 2.95 160.20 0.81 0.94 91.59 nvme5n1 72.80 0.28 0.00 0.00 0.02 4.00 233.00 0.68 0.00 0.00 3.53 2.97 160.27 0.87 0.96 92.29 mdadm -X /dev/nvme1n1 Filename : /dev/nvme1n1 Magic : 6d746962 Version : 4 UUID : 93fdcd4b:ae61a1f8:4d809242:2cd4a4c7 Events : 3 Events Cleared : 3 State : OK Chunksize : 128 MB Daemon : 5s flush period Write Mode : Normal Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) Bitmap : 14904 bits (chunks), 1617 dirty (10.8%) bitmap internal 512M ================================================================ for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=512M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done blockdev --setra 1024 /dev/md/raid5 echo 8 > /sys/block/md127/md/group_thread_cnt echo 8192 > /sys/block/md127/md/stripe_cache_size fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=1232KiB/s][w=308 IOPS][eta 00m:00s] Raid5: (groupid=0, jobs=1): err= 0: pid=6661: Sun Jun 30 02:58:11 2024 write: IOPS=285, BW=1142KiB/s (1169kB/s)(66.9MiB/60006msec); 0 zone resets slat (usec): min=4, max=18130, avg=10.80, stdev=138.54 clat (usec): min=42, max=13261, avg=3490.08, stdev=2945.95 lat (usec): min=50, max=22827, avg=3500.88, stdev=2949.63 clat percentiles (usec): | 1.00th=[ 49], 5.00th=[ 51], 10.00th=[ 52], 20.00th=[ 55], | 30.00th=[ 58], 40.00th=[ 72], 50.00th=[ 5866], 60.00th=[ 5932], | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5997], 95.00th=[ 5997], | 99.00th=[ 6128], 99.50th=[ 8586], 99.90th=[ 9896], 99.95th=[13042], | 99.99th=[13042] bw ( KiB/s): min= 600, max= 1648, per=99.68%, avg=1138.89, stdev=188.44, samples=119 iops : min= 150, max= 412, avg=284.72, stdev=47.11, samples=119 lat (usec) : 50=3.41%, 100=38.62%, 250=0.04%, 500=0.03% lat (msec) : 10=57.83%, 20=0.07% cpu : usr=0.09%, sys=0.40%, ctx=17130, majf=0, minf=9 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,17127,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=1142KiB/s (1169kB/s), 1142KiB/s-1142KiB/s (1169kB/s-1169kB/s), io=66.9MiB (70.2MB), run=60006-60006msec avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.10 0.00 0.00 99.90 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util md127 0.00 0.00 0.00 0.00 0.00 0.00 307.13 1.20 0.00 0.00 3.24 4.00 0.00 0.00 1.00 100.00 nvme1n1 120.47 0.47 0.00 0.00 0.01 4.00 286.07 0.63 0.00 0.00 3.03 2.26 165.60 0.99 1.03 96.58 nvme2n1 123.87 0.48 0.00 0.00 0.01 4.00 289.47 0.65 0.00 0.00 3.00 2.28 165.60 1.00 1.04 96.63 nvme3n1 120.87 0.47 0.00 0.00 0.01 4.00 286.47 0.63 0.00 0.00 3.02 2.27 165.60 1.00 1.03 96.39 nvme4n1 125.00 0.49 0.00 0.00 0.02 4.00 290.60 0.65 0.00 0.00 3.00 2.29 165.60 1.02 1.04 96.54 nvme5n1 124.07 0.48 0.00 0.00 0.02 4.00 289.67 0.65 0.00 0.00 3.01 2.28 165.60 1.03 1.04 96.59 mdadm -X /dev/nvme1n1 Filename : /dev/nvme1n1 Magic : 6d746962 Version : 4 UUID : 17eadc76:a367542a:feb6e24e:d650576c Events : 3 Events Cleared : 3 State : OK Chunksize : 512 MB Daemon : 5s flush period Write Mode : Normal Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) Bitmap : 3726 bits (chunks), 1977 dirty (53.1%) bitmap internal 1024M ================================================================ for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done blockdev --setra 1024 /dev/md/raid5 echo 8 > /sys/block/md127/md/group_thread_cnt echo 8192 > /sys/block/md127/md/stripe_cache_size fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=51.0MiB/s][w=13.1k IOPS][eta 00m:00s] Raid5: (groupid=0, jobs=1): err= 0: pid=7120: Sun Jun 30 03:08:12 2024 write: IOPS=10.3k, BW=40.4MiB/s (42.4MB/s)(2425MiB/60001msec); 0 zone resets slat (usec): min=6, max=18135, avg= 8.93, stdev=23.41 clat (usec): min=3, max=10459, avg=86.97, stdev=342.95 lat (usec): min=63, max=22927, avg=95.90, stdev=344.33 clat percentiles (usec): | 1.00th=[ 62], 5.00th=[ 63], 10.00th=[ 64], 20.00th=[ 65], | 30.00th=[ 65], 40.00th=[ 66], 50.00th=[ 67], 60.00th=[ 67], | 70.00th=[ 68], 80.00th=[ 69], 90.00th=[ 70], 95.00th=[ 74], | 99.00th=[ 133], 99.50th=[ 155], 99.90th=[ 5997], 99.95th=[ 5997], | 99.99th=[ 6063] bw ( KiB/s): min= 616, max=52968, per=99.80%, avg=41305.95, stdev=20465.79, samples=119 iops : min= 154, max=13242, avg=10326.47, stdev=5116.44, samples=119 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=98.64%, 250=1.00% lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.33%, 20=0.01% cpu : usr=1.89%, sys=12.74%, ctx=620837, majf=0, minf=170751 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,620822,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), io=2425MiB (2543MB), run=60001-60001msec avg-cpu: %user %nice %system %iowait %steal %idle 0.04 0.00 1.27 0.00 0.00 98.70 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util md127 0.00 0.00 0.00 0.00 0.00 0.00 18216.93 71.16 0.00 0.00 0.05 4.00 0.00 0.00 0.88 100.00 nvme1n1 7256.20 28.34 0.00 0.00 0.01 4.00 7256.27 28.34 0.00 0.00 0.01 4.00 0.00 0.00 0.19 99.71 nvme2n1 7302.53 28.53 0.00 0.00 0.01 4.00 7302.53 28.53 0.00 0.00 0.01 4.00 0.00 0.00 0.17 99.73 nvme3n1 7278.47 28.43 0.00 0.00 0.01 4.00 7278.53 28.43 0.00 0.00 0.01 4.00 0.00 0.00 0.18 99.57 nvme4n1 7303.93 28.53 0.00 0.00 0.01 4.00 7303.93 28.53 0.00 0.00 0.01 4.00 0.00 0.00 0.21 99.74 nvme5n1 7292.67 28.49 0.00 0.00 0.02 4.00 7292.60 28.49 0.00 0.00 0.02 4.00 0.00 0.00 0.22 99.69 mdadm -X /dev/nvme1n1 Filename : /dev/nvme1n1 Magic : 6d746962 Version : 4 UUID : a0c7ad14:50689e41:e065a166:4935a186 Events : 3 Events Cleared : 3 State : OK Chunksize : 1 GB Daemon : 5s flush period Write Mode : Normal Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) Bitmap : 1863 bits (chunks), 1863 dirty (100.0%) bitmap internal 2G ================================================================ for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=2G /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done blockdev --setra 1024 /dev/md/raid5 echo 8 > /sys/block/md127/md/group_thread_cnt echo 8192 > /sys/block/md127/md/stripe_cache_size fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=74.7MiB/s][w=19.1k IOPS][eta 00m:00s] Raid5: (groupid=0, jobs=1): err= 0: pid=7696: Sun Jun 30 03:30:40 2024 write: IOPS=17.0k, BW=66.5MiB/s (69.8MB/s)(3993MiB/60001msec); 0 zone resets slat (usec): min=2, max=18094, avg= 4.79, stdev=17.94 clat (usec): min=5, max=10352, avg=53.37, stdev=181.29 lat (usec): min=41, max=22883, avg=58.16, stdev=182.72 clat percentiles (usec): | 1.00th=[ 43], 5.00th=[ 44], 10.00th=[ 45], 20.00th=[ 46], | 30.00th=[ 46], 40.00th=[ 47], 50.00th=[ 47], 60.00th=[ 48], | 70.00th=[ 48], 80.00th=[ 49], 90.00th=[ 50], 95.00th=[ 52], | 99.00th=[ 90], 99.50th=[ 126], 99.90th=[ 873], 99.95th=[ 5997], | 99.99th=[ 6063] bw ( KiB/s): min= 640, max=80168, per=99.91%, avg=68080.94, stdev=21547.29, samples=119 iops : min= 160, max=20042, avg=17020.24, stdev=5386.82, samples=119 lat (usec) : 10=0.01%, 50=92.06%, 100=7.10%, 250=0.73%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.09%, 20=0.01% cpu : usr=2.22%, sys=11.55%, ctx=1022167, majf=0, minf=14 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1022154,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=66.5MiB/s (69.8MB/s), 66.5MiB/s-66.5MiB/s (69.8MB/s-69.8MB/s), io=3993MiB (4187MB), run=60001-60001msec avg-cpu: %user %nice %system %iowait %steal %idle 0.04 0.00 1.15 0.00 0.00 98.81 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util md127 0.00 0.00 0.00 0.00 0.00 0.00 18836.40 73.58 0.00 0.00 0.05 4.00 0.00 0.00 0.87 100.00 nvme1n1 7505.27 29.32 0.00 0.00 0.01 4.00 7505.40 29.32 0.00 0.00 0.01 4.00 0.00 0.00 0.19 99.93 nvme2n1 7510.00 29.34 0.00 0.00 0.01 4.00 7510.07 29.34 0.00 0.00 0.01 4.00 0.00 0.00 0.17 99.90 nvme3n1 7561.40 29.54 0.00 0.00 0.01 4.00 7561.47 29.54 0.00 0.00 0.01 4.00 0.00 0.00 0.19 100.00 nvme4n1 7543.07 29.47 0.00 0.00 0.01 4.00 7543.07 29.47 0.00 0.00 0.01 4.00 0.00 0.00 0.21 99.91 nvme5n1 7552.73 29.50 0.00 0.00 0.01 4.00 7552.80 29.50 0.00 0.00 0.01 4.00 0.00 0.00 0.22 99.91 mdadm -X /dev/nvme1n1 Filename : /dev/nvme1n1 Magic : 6d746962 Version : 4 UUID : 7d8ed7e8:4c2c4b17:723a22e5:4e9b5200 Events : 3 Events Cleared : 3 State : OK Chunksize : 2 GB Daemon : 5s flush period Write Mode : Normal Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) Bitmap : 932 bits (chunks), 932 dirty (100.0%) bitmap external 64M ================================================================ for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done blockdev --setra 1024 /dev/md/raid5 echo 8 > /sys/block/md127/md/group_thread_cnt echo 8192 > /sys/block/md127/md/stripe_cache_size fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=67.3MiB/s][w=17.2k IOPS][eta 00m:00s] Raid5: (groupid=0, jobs=1): err= 0: pid=7912: Sun Jun 30 03:39:11 2024 write: IOPS=17.3k, BW=67.8MiB/s (71.1MB/s)(4066MiB/60001msec); 0 zone resets slat (usec): min=2, max=21987, avg= 6.11, stdev=22.04 clat (usec): min=3, max=8410, avg=50.79, stdev=27.03 lat (usec): min=42, max=22140, avg=56.90, stdev=35.13 clat percentiles (usec): | 1.00th=[ 41], 5.00th=[ 42], 10.00th=[ 44], 20.00th=[ 46], | 30.00th=[ 47], 40.00th=[ 48], 50.00th=[ 49], 60.00th=[ 50], | 70.00th=[ 51], 80.00th=[ 52], 90.00th=[ 56], 95.00th=[ 68], | 99.00th=[ 93], 99.50th=[ 124], 99.90th=[ 155], 99.95th=[ 237], | 99.99th=[ 1037] bw ( KiB/s): min=38120, max=82576, per=100.00%, avg=69402.96, stdev=7769.33, samples=119 iops : min= 9530, max=20644, avg=17350.76, stdev=1942.33, samples=119 lat (usec) : 4=0.01%, 20=0.01%, 50=67.87%, 100=31.34%, 250=0.76% lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01% cpu : usr=2.23%, sys=14.27%, ctx=1040947, majf=0, minf=233929 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1040925,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=67.8MiB/s (71.1MB/s), 67.8MiB/s-67.8MiB/s (71.1MB/s-71.1MB/s), io=4066MiB (4264MB), run=60001-60001msec avg-cpu: %user %nice %system %iowait %steal %idle 0.04 0.00 1.15 0.00 0.00 98.81 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util md127 0.00 0.00 0.00 0.00 0.00 0.00 18428.60 71.99 0.00 0.00 0.05 4.00 0.00 0.00 0.87 99.99 nvme1n1 7399.40 28.90 0.00 0.00 0.01 4.00 7399.47 28.90 0.00 0.00 0.01 4.00 0.00 0.00 0.17 99.73 nvme2n1 7361.20 28.75 0.00 0.00 0.01 4.00 7361.27 28.75 0.00 0.00 0.01 4.00 0.00 0.00 0.20 99.63 nvme3n1 7376.67 28.82 0.00 0.00 0.01 4.00 7376.73 28.82 0.00 0.00 0.01 4.00 0.00 0.00 0.21 99.63 nvme4n1 7367.27 28.78 0.00 0.00 0.01 4.00 7367.20 28.78 0.00 0.00 0.01 4.00 0.00 0.00 0.18 99.65 nvme5n1 7352.47 28.72 0.00 0.00 0.01 4.00 7352.67 28.72 0.00 0.00 0.01 4.00 0.00 0.00 0.20 99.73 nvme8n1 0.47 0.00 0.00 0.00 0.00 4.00 293.40 1.15 0.00 0.00 0.02 4.00 0.00 0.00 0.01 24.24 mdadm -X /bitmap/bitmap.bin Filename : /bitmap/bitmap.bin Magic : 6d746962 Version : 4 UUID : 1e3480e5:1f9d8b8a:53ebc6b7:279afb73 Events : 3 Events Cleared : 3 State : OK Chunksize : 64 MB Daemon : 5s flush period Write Mode : Normal Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) Bitmap : 29807 bits (chunks), 29665 dirty (99.5%) bitmap external 1024M ================================================================ for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done blockdev --setra 1024 /dev/md/raid5 echo 8 > /sys/block/md127/md/group_thread_cnt echo 8192 > /sys/block/md127/md/stripe_cache_size fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=70.6MiB/s][w=18.1k IOPS][eta 00m:00s] Raid5: (groupid=0, jobs=1): err= 0: pid=8592: Sun Jun 30 03:54:11 2024 write: IOPS=19.6k, BW=76.5MiB/s (80.2MB/s)(4590MiB/60001msec); 0 zone resets slat (usec): min=2, max=21819, avg= 4.12, stdev=20.16 clat (usec): min=22, max=3706, avg=46.37, stdev=20.38 lat (usec): min=40, max=21951, avg=50.49, stdev=28.81 clat percentiles (usec): | 1.00th=[ 40], 5.00th=[ 41], 10.00th=[ 42], 20.00th=[ 42], | 30.00th=[ 43], 40.00th=[ 44], 50.00th=[ 45], 60.00th=[ 47], | 70.00th=[ 48], 80.00th=[ 49], 90.00th=[ 50], 95.00th=[ 52], | 99.00th=[ 86], 99.50th=[ 120], 99.90th=[ 157], 99.95th=[ 233], | 99.99th=[ 906] bw ( KiB/s): min=61616, max=84728, per=100.00%, avg=78398.66, stdev=5410.81, samples=119 iops : min=15404, max=21182, avg=19599.66, stdev=1352.70, samples=119 lat (usec) : 50=90.10%, 100=9.16%, 250=0.72%, 500=0.01%, 750=0.01% lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.01% cpu : usr=2.35%, sys=11.88%, ctx=1175104, majf=0, minf=11 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1175086,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=76.5MiB/s (80.2MB/s), 76.5MiB/s-76.5MiB/s (80.2MB/s-80.2MB/s), io=4590MiB (4813MB), run=60001-60001msec avg-cpu: %user %nice %system %iowait %steal %idle 0.04 0.00 1.03 0.00 0.00 98.93 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util md127 0.00 0.00 0.00 0.00 0.00 0.00 20758.20 81.09 0.00 0.00 0.04 4.00 0.00 0.00 0.89 100.00 nvme1n1 8291.67 32.39 0.00 0.00 0.01 4.00 8291.73 32.39 0.00 0.00 0.01 4.00 0.00 0.00 0.22 99.87 nvme2n1 8270.93 32.31 0.00 0.00 0.01 4.00 8271.07 32.31 0.00 0.00 0.01 4.00 0.00 0.00 0.19 99.79 nvme3n1 8310.67 32.46 0.00 0.00 0.01 4.00 8310.80 32.46 0.00 0.00 0.01 4.00 0.00 0.00 0.20 99.83 nvme4n1 8300.67 32.42 0.00 0.00 0.01 4.00 8300.67 32.42 0.00 0.00 0.01 4.00 0.00 0.00 0.23 99.76 nvme5n1 8342.13 32.59 0.00 0.00 0.02 4.00 8342.13 32.59 0.00 0.00 0.01 4.00 0.00 0.00 0.25 99.85 nvme8n1 0.33 0.00 0.00 0.00 8.40 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 mdadm -X /bitmap/bitmap.bin Filename : /bitmap/bitmap.bin Magic : 6d746962 Version : 4 UUID : 30e6211d:31ac1204:e6cdadb3:9691d3ee Events : 3 Events Cleared : 3 State : OK Chunksize : 1 GB Daemon : 5s flush period Write Mode : Normal Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) Bitmap : 1863 bits (chunks), 1863 dirty (100.0%) bitmap none ================================================================ for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done mdadm --verbose --create --assume-clean --bitmap=none /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done blockdev --setra 1024 /dev/md/raid5 echo 8 > /sys/block/md127/md/group_thread_cnt echo 8192 > /sys/block/md127/md/stripe_cache_size fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=82.5MiB/s][w=21.1k IOPS][eta 00m:00s] Raid5: (groupid=0, jobs=1): err= 0: pid=9158: Sun Jun 30 04:11:01 2024 write: IOPS=20.6k, BW=80.6MiB/s (84.5MB/s)(4833MiB/60001msec); 0 zone resets slat (usec): min=2, max=13598, avg= 3.50, stdev=12.46 clat (usec): min=4, max=3694, avg=44.31, stdev=21.60 lat (usec): min=39, max=13681, avg=47.81, stdev=24.98 clat percentiles (usec): | 1.00th=[ 39], 5.00th=[ 40], 10.00th=[ 41], 20.00th=[ 41], | 30.00th=[ 42], 40.00th=[ 43], 50.00th=[ 43], 60.00th=[ 44], | 70.00th=[ 45], 80.00th=[ 46], 90.00th=[ 48], 95.00th=[ 50], | 99.00th=[ 87], 99.50th=[ 117], 99.90th=[ 157], 99.95th=[ 229], | 99.99th=[ 963] bw ( KiB/s): min=74112, max=86712, per=100.00%, avg=82486.43, stdev=3696.94, samples=119 iops : min=18528, max=21678, avg=20621.59, stdev=924.23, samples=119 lat (usec) : 10=0.01%, 50=95.91%, 100=3.33%, 250=0.74%, 500=0.01% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01% cpu : usr=2.30%, sys=10.74%, ctx=1237375, majf=0, minf=179597 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1237359,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), io=4833MiB (5068MB), run=60001-60001msec avg-cpu: %user %nice %system %iowait %steal %idle 0.04 0.00 1.06 0.00 0.00 98.91 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util md127 0.00 0.00 0.00 0.00 0.00 0.00 20040.87 78.28 0.00 0.00 0.04 4.00 0.00 0.00 0.89 99.99 nvme1n1 8016.80 31.32 0.00 0.00 0.01 4.00 8016.93 31.32 0.00 0.00 0.01 4.00 0.00 0.00 0.21 99.68 nvme2n1 7983.20 31.18 0.00 0.00 0.01 4.00 7983.20 31.18 0.00 0.00 0.01 4.00 0.00 0.00 0.18 99.74 nvme3n1 8030.07 31.37 0.00 0.00 0.01 4.00 8030.20 31.37 0.00 0.00 0.01 4.00 0.00 0.00 0.20 99.62 nvme4n1 8016.40 31.31 0.00 0.00 0.01 4.00 8016.40 31.31 0.00 0.00 0.01 4.00 0.00 0.00 0.23 99.73 nvme5n1 8034.87 31.39 0.00 0.00 0.02 4.00 8035.00 31.39 0.00 0.00 0.01 4.00 0.00 0.00 0.24 99.71 single disk 1K RW ================================================================ fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=1k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Single Single: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, (T) 1024B-1024B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=54.3MiB/s][w=55.6k IOPS][eta 00m:00s] Single: (groupid=0, jobs=1): err= 0: pid=4471: Sun Jun 30 18:31:56 2024 write: IOPS=55.4k, BW=54.1MiB/s (56.7MB/s)(3244MiB/60001msec); 0 zone resets slat (usec): min=2, max=2792, avg= 2.71, stdev= 2.12 clat (nsec): min=651, max=8350.9k, avg=14864.41, stdev=5360.57 lat (usec): min=15, max=8403, avg=17.57, stdev= 5.79 clat percentiles (usec): | 1.00th=[ 15], 5.00th=[ 15], 10.00th=[ 15], 20.00th=[ 15], | 30.00th=[ 15], 40.00th=[ 15], 50.00th=[ 15], 60.00th=[ 15], | 70.00th=[ 15], 80.00th=[ 16], 90.00th=[ 16], 95.00th=[ 16], | 99.00th=[ 18], 99.50th=[ 22], 99.90th=[ 32], 99.95th=[ 33], | 99.99th=[ 206] bw ( KiB/s): min=51884, max=56778, per=100.00%, avg=55394.37, stdev=561.60, samples=119 iops : min=51884, max=56778, avg=55394.44, stdev=561.62, samples=119 lat (nsec) : 750=0.01%, 1000=0.01% lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=99.43%, 50=0.54% lat (usec) : 100=0.01%, 250=0.02%, 500=0.01% lat (msec) : 10=0.01% cpu : usr=3.57%, sys=16.41%, ctx=3321571, majf=0, minf=180742 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,3321653,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s (56.7MB/s-56.7MB/s), io=3244MiB (3401MB), run=60001-60001msec Disk stats (read/write): nvme8n1: ios=0/3309968, merge=0/0, ticks=0/44637, in_queue=44638, util=99.71% avg-cpu: %user %nice %system %iowait %steal %idle 0.04 0.00 0.42 0.00 0.00 99.54 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util nvme8n1 0.00 0.00 0.00 0.00 0.00 0.00 55496.93 54.20 0.00 0.00 0.01 1.00 0.00 0.00 0.75 100.00 single disk 4K RW ================================================================ blockdev --setra 256 /dev/nvme8n1 fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Single Single: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.35 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=270MiB/s][w=69.0k IOPS][eta 00m:00s] Single: (groupid=0, jobs=1): err= 0: pid=4396: Sun Jun 30 18:21:52 2024 write: IOPS=68.8k, BW=269MiB/s (282MB/s)(15.8GiB/60001msec); 0 zone resets slat (usec): min=2, max=796, avg= 2.45, stdev= 1.59 clat (nsec): min=652, max=8343.1k, avg=11616.73, stdev=5088.99 lat (usec): min=11, max=8410, avg=14.06, stdev= 5.36 clat percentiles (usec): | 1.00th=[ 12], 5.00th=[ 12], 10.00th=[ 12], 20.00th=[ 12], | 30.00th=[ 12], 40.00th=[ 12], 50.00th=[ 12], 60.00th=[ 12], | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 12], 95.00th=[ 12], | 99.00th=[ 14], 99.50th=[ 17], 99.90th=[ 28], 99.95th=[ 34], | 99.99th=[ 204] bw ( KiB/s): min=264072, max=277568, per=100.00%, avg=275629.71, stdev=1902.45, samples=119 iops : min=66018, max=69392, avg=68907.43, stdev=475.55, samples=119 lat (nsec) : 750=0.01%, 1000=0.01% lat (usec) : 2=0.01%, 4=0.01%, 10=0.04%, 20=99.55%, 50=0.38% lat (usec) : 100=0.01%, 250=0.02%, 1000=0.01% lat (msec) : 10=0.01% cpu : usr=5.20%, sys=21.28%, ctx=4129887, majf=0, minf=45204 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,4130258,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=269MiB/s (282MB/s), 269MiB/s-269MiB/s (282MB/s-282MB/s), io=15.8GiB (16.9GB), run=60001-60001msec Disk stats (read/write): nvme8n1: ios=0/4119593, merge=0/0, ticks=0/40922, in_queue=40923, util=99.89% avg-cpu: %user %nice %system %iowait %steal %idle 0.08 0.00 0.57 0.00 0.00 99.35 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util nvme8n1 0.00 0.00 0.00 0.00 0.00 0.00 69041.33 269.69 0.00 0.00 0.01 4.00 0.00 0.00 0.68 100.00
Hi, 在 2024/11/11 19:59, Dragan Milivojević 写道: > On 11/11/2024 03:04, Yu Kuai wrote: > >> Yes, as I said please show me you how you creat the array and your test >> script. I must know what you are testing, like single threaded or high >> concurrency. For example your result shows bitmap none close to bitmap >> external, this is impossible in our previous results. I can only guess >> that you're testing single threaded. > > All of that is included in the previously linked pastebin. TBO, I don't know what is this. :( > I will include the contents of that pastebin at the end of this email if > that helps. > Every test includes the mdadm create line, disk settings, md settings, > fio test line > used and the results and the typical iostat output during the test. I > hope that is > sufficient. > >> BTW, it'll be great if you can provide some perf results of the internal >> bitmap in your case, that will show us directly where is the bottleneck. > > Not right now, this server is in production and I'm not sure if I will > be able > to get it to an idle state or to find the time to do it due to other > projects. > >>> BTW do you guys do performance tests? All of the raid levels are >> >> We do, but we never test external bitmap. > > I wasn't referring to that, more to the fact that there is a huge > difference in > performance between no bitmap and bitmap or that raid (even "simple" > levels like 0) > do not scale with real world workloads. Yes, this is a known problem, the gap here is that I don't think external bitmap is much helpful, while your result disagree. > > The contents of that pastebin, hopefully my email client won't mess up > the formating: > > > 5 disk RAID5, 64K chunk > > Summary > > Test BW IOPS > bitmap internal 64M 700KiB/s 174 > bitmap internal 128M 702KiB/s 175 > bitmap internal 512M 1142KiB/s 285 > bitmap internal 1024M 40.4MiB/s 10.3k > bitmap internal 2G 66.5MiB/s 17.0k > bitmap external 64M 67.8MiB/s 17.3k > bitmap external 1024M 76.5MiB/s 19.6k > bitmap none 80.6MiB/s 20.6k > Single disk 1K 54.1MiB/s 55.4k > Single disk 4K 269MiB/s 68.8k > > > > > AlmaLinux release 9.4 (Seafoam Ocelot) > 5.14.0-427.20.1.el9_4 > > > nvme list > Node Generic SN > Model Namespace > Usage Format FW Rev > --------------------- --------------------- -------------------- > ---------------------------------------- ---------- > -------------------------- ---------------- -------- > /dev/nvme0n1 /dev/ng0n1 1460A0F9TSTJ Dell DC > NVMe CD8 U.2 960GB 0x1 122.33 GB / 960.20 GB > 512 B + 0 B 2.0.0 > /dev/nvme1n1 /dev/ng1n1 S6WRNJ0WA04045P Samsung > SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB > 512 B + 0 B 5B2QGXA7 > /dev/nvme2n1 /dev/ng2n1 S6WRNJ0WA04048B Samsung > SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB > 512 B + 0 B 5B2QGXA7 > /dev/nvme3n1 /dev/ng3n1 S6WRNJ0W810396H Samsung > SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB > 512 B + 0 B 5B2QGXA7 > /dev/nvme4n1 /dev/ng4n1 S6WRNJ0W808149N Samsung > SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB > 512 B + 0 B 5B2QGXA7 > /dev/nvme5n1 /dev/ng5n1 S6WRNJ0WA04043Z Samsung > SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB > 512 B + 0 B 5B2QGXA7 > /dev/nvme6n1 /dev/ng6n1 PHBT909504AH016N INTEL > MEMPEK1J016GAL 0x1 14.40 GB / 14.40 GB > 512 B + 0 B K4110420 > /dev/nvme7n1 /dev/ng7n1 S6WRNJ0WA04036R Samsung > SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB > 512 B + 0 B 5B2QGXA7 > /dev/nvme8n1 /dev/ng8n1 S6WRNJ0WA04050H Samsung > SSD 980 PRO with Heatsink 2TB 0x1 0.00 B / 2.00 TB > 512 B + 0 B 5B2QGXA7 > > > > > bitmap internal 64M > ================================================================ > mdadm --verbose --create --assume-clean --bitmap=internal > --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K > --raid-devices=5 /dev/nvme{1..5}n1 > > for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done > blockdev --setra 1024 /dev/md/raid5 > > echo 8 > /sys/block/md127/md/group_thread_cnt > echo 8192 > /sys/block/md127/md/stripe_cache_size > The array set up is fine. And the following external bitmap is using /bitmap/bitmap.bin, does the back-end storage of this file the same as test device? > > fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Raid5 Then this is what I suspected, the above test is quite limited and can't replace the real world workload, 1 thread 1 iodepth with 4k randwrite. I still can't believe your test result, and I can't figure out why internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5, and use the same fio script to test, the result is quite different from yours: ram0: 981MiB/s non-bitmap: 132MiB/s internal-bitmap: 95.5MiB/s > > Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s] > Raid5: (groupid=0, jobs=1): err= 0: pid=4718: Sun Jun 30 02:18:30 2024 > write: IOPS=174, BW=700KiB/s (717kB/s)(41.0MiB/60005msec); 0 zone resets > slat (usec): min=4, max=18062, avg=11.28, stdev=176.21 > clat (usec): min=46, max=13308, avg=5700.08, stdev=1194.59 > lat (usec): min=53, max=22717, avg=5711.36, stdev=1206.03 > clat percentiles (usec): > | 1.00th=[ 51], 5.00th=[ 5800], 10.00th=[ 5800], 20.00th=[ 5866], > | 30.00th=[ 5866], 40.00th=[ 5866], 50.00th=[ 5866], 60.00th=[ 5932], > | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5932], 95.00th=[ 5997], > | 99.00th=[ 6194], 99.50th=[ 8586], 99.90th=[10290], 99.95th=[13042], > | 99.99th=[13042] > bw ( KiB/s): min= 608, max= 752, per=100.00%, avg=700.03, > stdev=20.93, samples=119 There is absolutely something wrong here, it doesn't make sense to me that internal bitmap is so slow. However, I have no idea until you can provide the perf result. Thanks, Kuai > iops : min= 152, max= 188, avg=175.01, stdev= 5.23, > samples=119 > lat (usec) : 50=0.68%, 100=3.23% > lat (msec) : 10=95.99%, 20=0.10% > cpu : usr=0.08%, sys=0.24%, ctx=10503, majf=0, minf=8 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,10499,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=700KiB/s (717kB/s), 700KiB/s-700KiB/s (717kB/s-717kB/s), > io=41.0MiB (43.0MB), run=60005-60005msec > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.00 0.00 0.07 0.00 0.00 99.93 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s > wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > md127 0.00 0.00 0.00 0.00 0.00 0.00 175.47 0.69 > 0.00 0.00 5.68 4.00 0.00 0.00 1.00 100.00 > nvme1n1 69.40 0.27 0.00 0.00 0.01 4.00 237.80 0.93 > 0.00 0.00 3.55 4.00 168.47 0.81 0.98 95.59 > nvme2n1 69.20 0.27 0.00 0.00 0.01 4.00 237.60 0.93 > 0.00 0.00 3.55 4.00 168.47 0.81 0.98 95.61 > nvme3n1 72.20 0.28 0.00 0.00 0.01 4.00 240.60 0.94 > 0.00 0.00 3.51 4.00 168.47 0.83 0.98 95.29 > nvme4n1 68.07 0.27 0.00 0.00 0.02 4.00 236.53 0.92 > 0.00 0.00 3.57 4.00 168.47 0.81 0.98 95.65 > nvme5n1 72.07 0.28 0.00 0.00 0.02 4.00 240.53 0.94 > 0.00 0.00 3.52 4.00 168.47 0.83 0.99 95.31 > > > mdadm -X /dev/nvme1n1 > Filename : /dev/nvme1n1 > Magic : 6d746962 > Version : 4 > UUID : 77fa1a1b:2f0dd646:adc85c8e:985513a8 > Events : 3 > Events Cleared : 3 > State : OK > Chunksize : 64 MB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) > Bitmap : 29807 bits (chunks), 1517 dirty (5.1%) > > > bitmap internal 128M > ================================================================ > for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done > > mdadm --verbose --create --assume-clean --bitmap=internal > --bitmap-chunk=128M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K > --raid-devices=5 /dev/nvme{1..5}n1 > > for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done > blockdev --setra 1024 /dev/md/raid5 > > echo 8 > /sys/block/md127/md/group_thread_cnt > echo 8192 > /sys/block/md127/md/stripe_cache_size > > > fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Raid5 > > Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=716KiB/s][w=179 IOPS][eta 00m:00s] > Raid5: (groupid=0, jobs=1): err= 0: pid=6283: Sun Jun 30 02:49:06 2024 > write: IOPS=175, BW=702KiB/s (719kB/s)(41.1MiB/60002msec); 0 zone resets > slat (usec): min=8, max=18200, avg=16.06, stdev=177.21 > clat (usec): min=61, max=20048, avg=5675.78, stdev=1968.88 > lat (usec): min=74, max=22975, avg=5691.84, stdev=1976.14 > clat percentiles (usec): > | 1.00th=[ 68], 5.00th=[ 73], 10.00th=[ 5866], 20.00th=[ 5932], > | 30.00th=[ 5932], 40.00th=[ 5932], 50.00th=[ 5932], 60.00th=[ 5997], > | 70.00th=[ 5997], 80.00th=[ 5997], 90.00th=[ 5997], 95.00th=[ 6063], > | 99.00th=[14615], 99.50th=[15008], 99.90th=[16188], 99.95th=[16319], > | 99.99th=[16319] > bw ( KiB/s): min= 384, max= 816, per=99.97%, avg=702.12, > stdev=72.52, samples=119 > iops : min= 96, max= 204, avg=175.53, stdev=18.13, > samples=119 > lat (usec) : 100=7.62%, 250=0.01% > lat (msec) : 10=90.80%, 20=1.56%, 50=0.01% > cpu : usr=0.11%, sys=0.34%, ctx=10539, majf=0, minf=8 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,10534,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=702KiB/s (719kB/s), 702KiB/s-702KiB/s (719kB/s-719kB/s), > io=41.1MiB (43.1MB), run=60002-60002msec > > > > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.00 0.00 0.08 0.00 0.00 99.92 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s > wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > md127 0.00 0.00 0.00 0.00 0.00 0.00 173.73 0.68 > 0.00 0.00 5.73 4.00 0.00 0.00 1.00 99.99 > nvme1n1 65.87 0.26 0.00 0.00 0.01 4.00 226.07 0.65 > 0.00 0.00 3.60 2.94 160.20 0.81 0.94 92.46 > nvme2n1 71.33 0.28 0.00 0.00 0.02 4.00 231.53 0.67 > 0.00 0.00 3.50 2.96 160.27 0.84 0.95 91.79 > nvme3n1 68.60 0.27 0.00 0.00 0.02 4.00 228.80 0.66 > 0.00 0.00 3.68 2.95 160.27 0.93 0.99 94.37 > nvme4n1 68.87 0.27 0.00 0.00 0.02 4.00 229.07 0.66 > 0.00 0.00 3.52 2.95 160.20 0.81 0.94 91.59 > nvme5n1 72.80 0.28 0.00 0.00 0.02 4.00 233.00 0.68 > 0.00 0.00 3.53 2.97 160.27 0.87 0.96 92.29 > > > mdadm -X /dev/nvme1n1 > Filename : /dev/nvme1n1 > Magic : 6d746962 > Version : 4 > UUID : 93fdcd4b:ae61a1f8:4d809242:2cd4a4c7 > Events : 3 > Events Cleared : 3 > State : OK > Chunksize : 128 MB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) > Bitmap : 14904 bits (chunks), 1617 dirty (10.8%) > > > > > > bitmap internal 512M > ================================================================ > for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done > > mdadm --verbose --create --assume-clean --bitmap=internal > --bitmap-chunk=512M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K > --raid-devices=5 /dev/nvme{1..5}n1 > > for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done > blockdev --setra 1024 /dev/md/raid5 > > echo 8 > /sys/block/md127/md/group_thread_cnt > echo 8192 > /sys/block/md127/md/stripe_cache_size > > > fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Raid5 > > > Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=1232KiB/s][w=308 IOPS][eta 00m:00s] > Raid5: (groupid=0, jobs=1): err= 0: pid=6661: Sun Jun 30 02:58:11 2024 > write: IOPS=285, BW=1142KiB/s (1169kB/s)(66.9MiB/60006msec); 0 zone > resets > slat (usec): min=4, max=18130, avg=10.80, stdev=138.54 > clat (usec): min=42, max=13261, avg=3490.08, stdev=2945.95 > lat (usec): min=50, max=22827, avg=3500.88, stdev=2949.63 > clat percentiles (usec): > | 1.00th=[ 49], 5.00th=[ 51], 10.00th=[ 52], 20.00th=[ 55], > | 30.00th=[ 58], 40.00th=[ 72], 50.00th=[ 5866], 60.00th=[ 5932], > | 70.00th=[ 5932], 80.00th=[ 5932], 90.00th=[ 5997], 95.00th=[ 5997], > | 99.00th=[ 6128], 99.50th=[ 8586], 99.90th=[ 9896], 99.95th=[13042], > | 99.99th=[13042] > bw ( KiB/s): min= 600, max= 1648, per=99.68%, avg=1138.89, > stdev=188.44, samples=119 > iops : min= 150, max= 412, avg=284.72, stdev=47.11, > samples=119 > lat (usec) : 50=3.41%, 100=38.62%, 250=0.04%, 500=0.03% > lat (msec) : 10=57.83%, 20=0.07% > cpu : usr=0.09%, sys=0.40%, ctx=17130, majf=0, minf=9 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,17127,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=1142KiB/s (1169kB/s), 1142KiB/s-1142KiB/s > (1169kB/s-1169kB/s), io=66.9MiB (70.2MB), run=60006-60006msec > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.00 0.00 0.10 0.00 0.00 99.90 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s > wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > md127 0.00 0.00 0.00 0.00 0.00 0.00 307.13 1.20 > 0.00 0.00 3.24 4.00 0.00 0.00 1.00 100.00 > nvme1n1 120.47 0.47 0.00 0.00 0.01 4.00 286.07 0.63 > 0.00 0.00 3.03 2.26 165.60 0.99 1.03 96.58 > nvme2n1 123.87 0.48 0.00 0.00 0.01 4.00 289.47 0.65 > 0.00 0.00 3.00 2.28 165.60 1.00 1.04 96.63 > nvme3n1 120.87 0.47 0.00 0.00 0.01 4.00 286.47 0.63 > 0.00 0.00 3.02 2.27 165.60 1.00 1.03 96.39 > nvme4n1 125.00 0.49 0.00 0.00 0.02 4.00 290.60 0.65 > 0.00 0.00 3.00 2.29 165.60 1.02 1.04 96.54 > nvme5n1 124.07 0.48 0.00 0.00 0.02 4.00 289.67 0.65 > 0.00 0.00 3.01 2.28 165.60 1.03 1.04 96.59 > > > mdadm -X /dev/nvme1n1 > Filename : /dev/nvme1n1 > Magic : 6d746962 > Version : 4 > UUID : 17eadc76:a367542a:feb6e24e:d650576c > Events : 3 > Events Cleared : 3 > State : OK > Chunksize : 512 MB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) > Bitmap : 3726 bits (chunks), 1977 dirty (53.1%) > > > bitmap internal 1024M > ================================================================ > for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done > > mdadm --verbose --create --assume-clean --bitmap=internal > --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K > --raid-devices=5 /dev/nvme{1..5}n1 > > for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done > blockdev --setra 1024 /dev/md/raid5 > > echo 8 > /sys/block/md127/md/group_thread_cnt > echo 8192 > /sys/block/md127/md/stripe_cache_size > > > fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Raid5 > > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=51.0MiB/s][w=13.1k IOPS][eta 00m:00s] > Raid5: (groupid=0, jobs=1): err= 0: pid=7120: Sun Jun 30 03:08:12 2024 > write: IOPS=10.3k, BW=40.4MiB/s (42.4MB/s)(2425MiB/60001msec); 0 zone > resets > slat (usec): min=6, max=18135, avg= 8.93, stdev=23.41 > clat (usec): min=3, max=10459, avg=86.97, stdev=342.95 > lat (usec): min=63, max=22927, avg=95.90, stdev=344.33 > clat percentiles (usec): > | 1.00th=[ 62], 5.00th=[ 63], 10.00th=[ 64], 20.00th=[ 65], > | 30.00th=[ 65], 40.00th=[ 66], 50.00th=[ 67], 60.00th=[ 67], > | 70.00th=[ 68], 80.00th=[ 69], 90.00th=[ 70], 95.00th=[ 74], > | 99.00th=[ 133], 99.50th=[ 155], 99.90th=[ 5997], 99.95th=[ 5997], > | 99.99th=[ 6063] > bw ( KiB/s): min= 616, max=52968, per=99.80%, avg=41305.95, > stdev=20465.79, samples=119 > iops : min= 154, max=13242, avg=10326.47, stdev=5116.44, > samples=119 > lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 100=98.64%, 250=1.00% > lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.33%, 20=0.01% > cpu : usr=1.89%, sys=12.74%, ctx=620837, majf=0, minf=170751 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,620822,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s > (42.4MB/s-42.4MB/s), io=2425MiB (2543MB), run=60001-60001msec > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.04 0.00 1.27 0.00 0.00 98.70 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s > wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > md127 0.00 0.00 0.00 0.00 0.00 0.00 18216.93 > 71.16 0.00 0.00 0.05 4.00 0.00 0.00 0.88 100.00 > nvme1n1 7256.20 28.34 0.00 0.00 0.01 4.00 7256.27 > 28.34 0.00 0.00 0.01 4.00 0.00 0.00 0.19 99.71 > nvme2n1 7302.53 28.53 0.00 0.00 0.01 4.00 7302.53 > 28.53 0.00 0.00 0.01 4.00 0.00 0.00 0.17 99.73 > nvme3n1 7278.47 28.43 0.00 0.00 0.01 4.00 7278.53 > 28.43 0.00 0.00 0.01 4.00 0.00 0.00 0.18 99.57 > nvme4n1 7303.93 28.53 0.00 0.00 0.01 4.00 7303.93 > 28.53 0.00 0.00 0.01 4.00 0.00 0.00 0.21 99.74 > nvme5n1 7292.67 28.49 0.00 0.00 0.02 4.00 7292.60 > 28.49 0.00 0.00 0.02 4.00 0.00 0.00 0.22 99.69 > > > > mdadm -X /dev/nvme1n1 > Filename : /dev/nvme1n1 > Magic : 6d746962 > Version : 4 > UUID : a0c7ad14:50689e41:e065a166:4935a186 > Events : 3 > Events Cleared : 3 > State : OK > Chunksize : 1 GB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) > Bitmap : 1863 bits (chunks), 1863 dirty (100.0%) > > > bitmap internal 2G > ================================================================ > for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done > > mdadm --verbose --create --assume-clean --bitmap=internal > --bitmap-chunk=2G /dev/md/raid5 --name=raid5 --level=5 --chunk=64K > --raid-devices=5 /dev/nvme{1..5}n1 > > for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done > blockdev --setra 1024 /dev/md/raid5 > > echo 8 > /sys/block/md127/md/group_thread_cnt > echo 8192 > /sys/block/md127/md/stripe_cache_size > > > fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Raid5 > > Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=74.7MiB/s][w=19.1k IOPS][eta 00m:00s] > Raid5: (groupid=0, jobs=1): err= 0: pid=7696: Sun Jun 30 03:30:40 2024 > write: IOPS=17.0k, BW=66.5MiB/s (69.8MB/s)(3993MiB/60001msec); 0 zone > resets > slat (usec): min=2, max=18094, avg= 4.79, stdev=17.94 > clat (usec): min=5, max=10352, avg=53.37, stdev=181.29 > lat (usec): min=41, max=22883, avg=58.16, stdev=182.72 > clat percentiles (usec): > | 1.00th=[ 43], 5.00th=[ 44], 10.00th=[ 45], 20.00th=[ 46], > | 30.00th=[ 46], 40.00th=[ 47], 50.00th=[ 47], 60.00th=[ 48], > | 70.00th=[ 48], 80.00th=[ 49], 90.00th=[ 50], 95.00th=[ 52], > | 99.00th=[ 90], 99.50th=[ 126], 99.90th=[ 873], 99.95th=[ 5997], > | 99.99th=[ 6063] > bw ( KiB/s): min= 640, max=80168, per=99.91%, avg=68080.94, > stdev=21547.29, samples=119 > iops : min= 160, max=20042, avg=17020.24, stdev=5386.82, > samples=119 > lat (usec) : 10=0.01%, 50=92.06%, 100=7.10%, 250=0.73%, 500=0.01% > lat (usec) : 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.09%, 20=0.01% > cpu : usr=2.22%, sys=11.55%, ctx=1022167, majf=0, minf=14 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,1022154,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=66.5MiB/s (69.8MB/s), 66.5MiB/s-66.5MiB/s > (69.8MB/s-69.8MB/s), io=3993MiB (4187MB), run=60001-60001msec > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.04 0.00 1.15 0.00 0.00 98.81 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s > wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > md127 0.00 0.00 0.00 0.00 0.00 0.00 18836.40 > 73.58 0.00 0.00 0.05 4.00 0.00 0.00 0.87 100.00 > nvme1n1 7505.27 29.32 0.00 0.00 0.01 4.00 7505.40 > 29.32 0.00 0.00 0.01 4.00 0.00 0.00 0.19 99.93 > nvme2n1 7510.00 29.34 0.00 0.00 0.01 4.00 7510.07 > 29.34 0.00 0.00 0.01 4.00 0.00 0.00 0.17 99.90 > nvme3n1 7561.40 29.54 0.00 0.00 0.01 4.00 7561.47 > 29.54 0.00 0.00 0.01 4.00 0.00 0.00 0.19 100.00 > nvme4n1 7543.07 29.47 0.00 0.00 0.01 4.00 7543.07 > 29.47 0.00 0.00 0.01 4.00 0.00 0.00 0.21 99.91 > nvme5n1 7552.73 29.50 0.00 0.00 0.01 4.00 7552.80 > 29.50 0.00 0.00 0.01 4.00 0.00 0.00 0.22 99.91 > > > > mdadm -X /dev/nvme1n1 > Filename : /dev/nvme1n1 > Magic : 6d746962 > Version : 4 > UUID : 7d8ed7e8:4c2c4b17:723a22e5:4e9b5200 > Events : 3 > Events Cleared : 3 > State : OK > Chunksize : 2 GB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) > Bitmap : 932 bits (chunks), 932 dirty (100.0%) > > > > > > > > > > bitmap external 64M > ================================================================ > for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done > > mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin > --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K > --raid-devices=5 /dev/nvme{1..5}n1 > > for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done > blockdev --setra 1024 /dev/md/raid5 > > echo 8 > /sys/block/md127/md/group_thread_cnt > echo 8192 > /sys/block/md127/md/stripe_cache_size > > > fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Raid5 > > Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=67.3MiB/s][w=17.2k IOPS][eta 00m:00s] > Raid5: (groupid=0, jobs=1): err= 0: pid=7912: Sun Jun 30 03:39:11 2024 > write: IOPS=17.3k, BW=67.8MiB/s (71.1MB/s)(4066MiB/60001msec); 0 zone > resets > slat (usec): min=2, max=21987, avg= 6.11, stdev=22.04 > clat (usec): min=3, max=8410, avg=50.79, stdev=27.03 > lat (usec): min=42, max=22140, avg=56.90, stdev=35.13 > clat percentiles (usec): > | 1.00th=[ 41], 5.00th=[ 42], 10.00th=[ 44], 20.00th=[ 46], > | 30.00th=[ 47], 40.00th=[ 48], 50.00th=[ 49], 60.00th=[ 50], > | 70.00th=[ 51], 80.00th=[ 52], 90.00th=[ 56], 95.00th=[ 68], > | 99.00th=[ 93], 99.50th=[ 124], 99.90th=[ 155], 99.95th=[ 237], > | 99.99th=[ 1037] > bw ( KiB/s): min=38120, max=82576, per=100.00%, avg=69402.96, > stdev=7769.33, samples=119 > iops : min= 9530, max=20644, avg=17350.76, stdev=1942.33, > samples=119 > lat (usec) : 4=0.01%, 20=0.01%, 50=67.87%, 100=31.34%, 250=0.76% > lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.01% > cpu : usr=2.23%, sys=14.27%, ctx=1040947, majf=0, minf=233929 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,1040925,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=67.8MiB/s (71.1MB/s), 67.8MiB/s-67.8MiB/s > (71.1MB/s-71.1MB/s), io=4066MiB (4264MB), run=60001-60001msec > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.04 0.00 1.15 0.00 0.00 98.81 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s > wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > md127 0.00 0.00 0.00 0.00 0.00 0.00 18428.60 > 71.99 0.00 0.00 0.05 4.00 0.00 0.00 0.87 99.99 > nvme1n1 7399.40 28.90 0.00 0.00 0.01 4.00 7399.47 > 28.90 0.00 0.00 0.01 4.00 0.00 0.00 0.17 99.73 > nvme2n1 7361.20 28.75 0.00 0.00 0.01 4.00 7361.27 > 28.75 0.00 0.00 0.01 4.00 0.00 0.00 0.20 99.63 > nvme3n1 7376.67 28.82 0.00 0.00 0.01 4.00 7376.73 > 28.82 0.00 0.00 0.01 4.00 0.00 0.00 0.21 99.63 > nvme4n1 7367.27 28.78 0.00 0.00 0.01 4.00 7367.20 > 28.78 0.00 0.00 0.01 4.00 0.00 0.00 0.18 99.65 > nvme5n1 7352.47 28.72 0.00 0.00 0.01 4.00 7352.67 > 28.72 0.00 0.00 0.01 4.00 0.00 0.00 0.20 99.73 > nvme8n1 0.47 0.00 0.00 0.00 0.00 4.00 293.40 > 1.15 0.00 0.00 0.02 4.00 0.00 0.00 0.01 24.24 > > > > mdadm -X /bitmap/bitmap.bin > Filename : /bitmap/bitmap.bin > Magic : 6d746962 > Version : 4 > UUID : 1e3480e5:1f9d8b8a:53ebc6b7:279afb73 > Events : 3 > Events Cleared : 3 > State : OK > Chunksize : 64 MB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) > Bitmap : 29807 bits (chunks), 29665 dirty (99.5%) > > > > bitmap external 1024M > ================================================================ > for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done > > mdadm --verbose --create --assume-clean --bitmap=/bitmap/bitmap.bin > --bitmap-chunk=1024M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K > --raid-devices=5 /dev/nvme{1..5}n1 > > for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done > blockdev --setra 1024 /dev/md/raid5 > > echo 8 > /sys/block/md127/md/group_thread_cnt > echo 8192 > /sys/block/md127/md/stripe_cache_size > > > fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Raid5 > > Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=70.6MiB/s][w=18.1k IOPS][eta 00m:00s] > Raid5: (groupid=0, jobs=1): err= 0: pid=8592: Sun Jun 30 03:54:11 2024 > write: IOPS=19.6k, BW=76.5MiB/s (80.2MB/s)(4590MiB/60001msec); 0 zone > resets > slat (usec): min=2, max=21819, avg= 4.12, stdev=20.16 > clat (usec): min=22, max=3706, avg=46.37, stdev=20.38 > lat (usec): min=40, max=21951, avg=50.49, stdev=28.81 > clat percentiles (usec): > | 1.00th=[ 40], 5.00th=[ 41], 10.00th=[ 42], 20.00th=[ 42], > | 30.00th=[ 43], 40.00th=[ 44], 50.00th=[ 45], 60.00th=[ 47], > | 70.00th=[ 48], 80.00th=[ 49], 90.00th=[ 50], 95.00th=[ 52], > | 99.00th=[ 86], 99.50th=[ 120], 99.90th=[ 157], 99.95th=[ 233], > | 99.99th=[ 906] > bw ( KiB/s): min=61616, max=84728, per=100.00%, avg=78398.66, > stdev=5410.81, samples=119 > iops : min=15404, max=21182, avg=19599.66, stdev=1352.70, > samples=119 > lat (usec) : 50=90.10%, 100=9.16%, 250=0.72%, 500=0.01%, 750=0.01% > lat (usec) : 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01% > cpu : usr=2.35%, sys=11.88%, ctx=1175104, majf=0, minf=11 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,1175086,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=76.5MiB/s (80.2MB/s), 76.5MiB/s-76.5MiB/s > (80.2MB/s-80.2MB/s), io=4590MiB (4813MB), run=60001-60001msec > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.04 0.00 1.03 0.00 0.00 98.93 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s > wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > md127 0.00 0.00 0.00 0.00 0.00 0.00 20758.20 > 81.09 0.00 0.00 0.04 4.00 0.00 0.00 0.89 100.00 > nvme1n1 8291.67 32.39 0.00 0.00 0.01 4.00 8291.73 > 32.39 0.00 0.00 0.01 4.00 0.00 0.00 0.22 99.87 > nvme2n1 8270.93 32.31 0.00 0.00 0.01 4.00 8271.07 > 32.31 0.00 0.00 0.01 4.00 0.00 0.00 0.19 99.79 > nvme3n1 8310.67 32.46 0.00 0.00 0.01 4.00 8310.80 > 32.46 0.00 0.00 0.01 4.00 0.00 0.00 0.20 99.83 > nvme4n1 8300.67 32.42 0.00 0.00 0.01 4.00 8300.67 > 32.42 0.00 0.00 0.01 4.00 0.00 0.00 0.23 99.76 > nvme5n1 8342.13 32.59 0.00 0.00 0.02 4.00 8342.13 > 32.59 0.00 0.00 0.01 4.00 0.00 0.00 0.25 99.85 > nvme8n1 0.33 0.00 0.00 0.00 8.40 4.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 > > > mdadm -X /bitmap/bitmap.bin > Filename : /bitmap/bitmap.bin > Magic : 6d746962 > Version : 4 > UUID : 30e6211d:31ac1204:e6cdadb3:9691d3ee > Events : 3 > Events Cleared : 3 > State : OK > Chunksize : 1 GB > Daemon : 5s flush period > Write Mode : Normal > Sync Size : 1953382464 (1862.89 GiB 2000.26 GB) > Bitmap : 1863 bits (chunks), 1863 dirty (100.0%) > > > > bitmap none > ================================================================ > for dev in /dev/nvme{1..5}n1; do nvme format --force $dev; done > > mdadm --verbose --create --assume-clean --bitmap=none /dev/md/raid5 > --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 > > for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done > blockdev --setra 1024 /dev/md/raid5 > > echo 8 > /sys/block/md127/md/group_thread_cnt > echo 8192 > /sys/block/md127/md/stripe_cache_size > > > fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Raid5 > > Raid5: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=82.5MiB/s][w=21.1k IOPS][eta 00m:00s] > Raid5: (groupid=0, jobs=1): err= 0: pid=9158: Sun Jun 30 04:11:01 2024 > write: IOPS=20.6k, BW=80.6MiB/s (84.5MB/s)(4833MiB/60001msec); 0 zone > resets > slat (usec): min=2, max=13598, avg= 3.50, stdev=12.46 > clat (usec): min=4, max=3694, avg=44.31, stdev=21.60 > lat (usec): min=39, max=13681, avg=47.81, stdev=24.98 > clat percentiles (usec): > | 1.00th=[ 39], 5.00th=[ 40], 10.00th=[ 41], 20.00th=[ 41], > | 30.00th=[ 42], 40.00th=[ 43], 50.00th=[ 43], 60.00th=[ 44], > | 70.00th=[ 45], 80.00th=[ 46], 90.00th=[ 48], 95.00th=[ 50], > | 99.00th=[ 87], 99.50th=[ 117], 99.90th=[ 157], 99.95th=[ 229], > | 99.99th=[ 963] > bw ( KiB/s): min=74112, max=86712, per=100.00%, avg=82486.43, > stdev=3696.94, samples=119 > iops : min=18528, max=21678, avg=20621.59, stdev=924.23, > samples=119 > lat (usec) : 10=0.01%, 50=95.91%, 100=3.33%, 250=0.74%, 500=0.01% > lat (usec) : 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01% > cpu : usr=2.30%, sys=10.74%, ctx=1237375, majf=0, minf=179597 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,1237359,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s > (84.5MB/s-84.5MB/s), io=4833MiB (5068MB), run=60001-60001msec > > > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.04 0.00 1.06 0.00 0.00 98.91 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s > wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > md127 0.00 0.00 0.00 0.00 0.00 0.00 20040.87 > 78.28 0.00 0.00 0.04 4.00 0.00 0.00 0.89 99.99 > nvme1n1 8016.80 31.32 0.00 0.00 0.01 4.00 8016.93 > 31.32 0.00 0.00 0.01 4.00 0.00 0.00 0.21 99.68 > nvme2n1 7983.20 31.18 0.00 0.00 0.01 4.00 7983.20 > 31.18 0.00 0.00 0.01 4.00 0.00 0.00 0.18 99.74 > nvme3n1 8030.07 31.37 0.00 0.00 0.01 4.00 8030.20 > 31.37 0.00 0.00 0.01 4.00 0.00 0.00 0.20 99.62 > nvme4n1 8016.40 31.31 0.00 0.00 0.01 4.00 8016.40 > 31.31 0.00 0.00 0.01 4.00 0.00 0.00 0.23 99.73 > nvme5n1 8034.87 31.39 0.00 0.00 0.02 4.00 8035.00 > 31.39 0.00 0.00 0.01 4.00 0.00 0.00 0.24 99.71 > > > > > single disk 1K RW > ================================================================ > fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=1k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Single > Single: (g=0): rw=randwrite, bs=(R) 1024B-1024B, (W) 1024B-1024B, (T) > 1024B-1024B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=54.3MiB/s][w=55.6k IOPS][eta 00m:00s] > Single: (groupid=0, jobs=1): err= 0: pid=4471: Sun Jun 30 18:31:56 2024 > write: IOPS=55.4k, BW=54.1MiB/s (56.7MB/s)(3244MiB/60001msec); 0 zone > resets > slat (usec): min=2, max=2792, avg= 2.71, stdev= 2.12 > clat (nsec): min=651, max=8350.9k, avg=14864.41, stdev=5360.57 > lat (usec): min=15, max=8403, avg=17.57, stdev= 5.79 > clat percentiles (usec): > | 1.00th=[ 15], 5.00th=[ 15], 10.00th=[ 15], 20.00th=[ 15], > | 30.00th=[ 15], 40.00th=[ 15], 50.00th=[ 15], 60.00th=[ 15], > | 70.00th=[ 15], 80.00th=[ 16], 90.00th=[ 16], 95.00th=[ 16], > | 99.00th=[ 18], 99.50th=[ 22], 99.90th=[ 32], 99.95th=[ 33], > | 99.99th=[ 206] > bw ( KiB/s): min=51884, max=56778, per=100.00%, avg=55394.37, > stdev=561.60, samples=119 > iops : min=51884, max=56778, avg=55394.44, stdev=561.62, > samples=119 > lat (nsec) : 750=0.01%, 1000=0.01% > lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=99.43%, 50=0.54% > lat (usec) : 100=0.01%, 250=0.02%, 500=0.01% > lat (msec) : 10=0.01% > cpu : usr=3.57%, sys=16.41%, ctx=3321571, majf=0, minf=180742 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,3321653,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s > (56.7MB/s-56.7MB/s), io=3244MiB (3401MB), run=60001-60001msec > > Disk stats (read/write): > nvme8n1: ios=0/3309968, merge=0/0, ticks=0/44637, in_queue=44638, > util=99.71% > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.04 0.00 0.42 0.00 0.00 99.54 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s > wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > nvme8n1 0.00 0.00 0.00 0.00 0.00 0.00 55496.93 54.20 > 0.00 0.00 0.01 1.00 0.00 0.00 0.75 100.00 > > > > > > single disk 4K RW > ================================================================ > blockdev --setra 256 /dev/nvme8n1 > > fio --filename=/dev/nvme8n1 --direct=1 --rw=randwrite --bs=4k > --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting > --time_based --name=Single > > Single: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > 4096B-4096B, ioengine=libaio, iodepth=1 > fio-3.35 > Starting 1 process > Jobs: 1 (f=1): [w(1)][100.0%][w=270MiB/s][w=69.0k IOPS][eta 00m:00s] > Single: (groupid=0, jobs=1): err= 0: pid=4396: Sun Jun 30 18:21:52 2024 > write: IOPS=68.8k, BW=269MiB/s (282MB/s)(15.8GiB/60001msec); 0 zone > resets > slat (usec): min=2, max=796, avg= 2.45, stdev= 1.59 > clat (nsec): min=652, max=8343.1k, avg=11616.73, stdev=5088.99 > lat (usec): min=11, max=8410, avg=14.06, stdev= 5.36 > clat percentiles (usec): > | 1.00th=[ 12], 5.00th=[ 12], 10.00th=[ 12], 20.00th=[ 12], > | 30.00th=[ 12], 40.00th=[ 12], 50.00th=[ 12], 60.00th=[ 12], > | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 12], 95.00th=[ 12], > | 99.00th=[ 14], 99.50th=[ 17], 99.90th=[ 28], 99.95th=[ 34], > | 99.99th=[ 204] > bw ( KiB/s): min=264072, max=277568, per=100.00%, avg=275629.71, > stdev=1902.45, samples=119 > iops : min=66018, max=69392, avg=68907.43, stdev=475.55, > samples=119 > lat (nsec) : 750=0.01%, 1000=0.01% > lat (usec) : 2=0.01%, 4=0.01%, 10=0.04%, 20=99.55%, 50=0.38% > lat (usec) : 100=0.01%, 250=0.02%, 1000=0.01% > lat (msec) : 10=0.01% > cpu : usr=5.20%, sys=21.28%, ctx=4129887, majf=0, minf=45204 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=0,4130258,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): > WRITE: bw=269MiB/s (282MB/s), 269MiB/s-269MiB/s (282MB/s-282MB/s), > io=15.8GiB (16.9GB), run=60001-60001msec > > Disk stats (read/write): > nvme8n1: ios=0/4119593, merge=0/0, ticks=0/40922, in_queue=40923, > util=99.89% > > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.08 0.00 0.57 0.00 0.00 99.35 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s > wMB/s wrqm/s %wrqm w_await wareq-sz f/s f_await aqu-sz %util > nvme8n1 0.00 0.00 0.00 0.00 0.00 0.00 69041.33 > 269.69 0.00 0.00 0.01 4.00 0.00 0.00 0.68 100.00 > > . >
On 11/11/2024 14:02, Yu Kuai wrote: > TBO, I don't know what is this. :( It's just a website where you can post text content, notes basically. I use it with mailing list where messages get rejected if I attach a file. I prefer not to include long debug logs, test logs etc in the body as it will just get quoted endless amount of times and pollute the thread. Old habit from the days when blackquoting was a thing and kilobytes mattered. > Yes, this is a known problem, the gap here is that I don't think > external bitmap is much helpful, while your result disagree. > >> bitmap internal 64M >> ================================================================ >> mdadm --verbose --create --assume-clean --bitmap=internal --bitmap-chunk=64M /dev/md/raid5 --name=raid5 --level=5 --chunk=64K --raid-devices=5 /dev/nvme{1..5}n1 >> >> for dev in /dev/nvme{1..5}n1; do blockdev --setra 256 $dev; done >> blockdev --setra 1024 /dev/md/raid5 >> >> echo 8 > /sys/block/md127/md/group_thread_cnt >> echo 8192 > /sys/block/md127/md/stripe_cache_size >> > The array set up is fine. And the following external bitmap is using > /bitmap/bitmap.bin, does the back-end storage of this file the same as > test device? No, I used one of the extra devices. >> fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5 > > Then this is what I suspected, the above test is quite limited and can't > replace the real world workload, 1 thread 1 iodepth with 4k randwrite. That is true. I went down this rabbit hole because I was getting worse results with a RAID5 array than with a single disk using real world workload, PostgreSQL in my case. I chose these test parameters as a worst case scenario. I did test with other parameters, did a whole battery of test with iodepth of 1 and 8 and BS sizes 4K,8,16 all the way to 2048K. It shows similar behaviour. For example: 5 disk RAID5, 64K chunk, default internal bitmap, iodepth 8 randread BS BW IOPS LAT LAT_DEV SS SS_perc USR_CPU SYS_CPU 4K 574 146993 53.58 21.63 0 7.50% 13.25 59.30 8K 1127 144268 54.75 19.48 0 4.39% 11.25 61.50 16K 2084 133387 59.32 16.53 0 2.87% 10.52 63.03 32K 3942 126151 62.67 21.59 1 1.30% 13.03 60.14 64K 7225 115606 68.64 19.31 1 1.03% 9.58 65.30 128K 7947 63580 124.73 22.66 1 1.91% 8.94 63.48 256K 9216 36867 216.49 26.47 1 0.51% 2.65 69.43 512K 8065 16130 494.82 42.43 1 1.25% 2.41 72.56 1024K 8130 8130 983.01 64.22 1 0.97% 0.92 73.38 2048K 10685 5342 1496.28 132.24 0 2.50% 0.75 68.89 randwrite BS BW IOPS LAT LAT_DEV SS SS_perc USR_CPU SYS_CPU 4K 1 375 21318.71 5059.72 0 41.06% 0.10 0.38 8K 2 354 22548.71 3084.57 0 4.90% 0.11 0.35 16K 5 346 23107.64 2517.95 0 9.77% 0.11 0.49 32K 13 420 19001.29 5500.62 0 34.75% 0.22 1.30 64K 33 530 15064.25 3916.28 0 8.07% 0.29 2.92 128K 79 637 12549.72 3249.85 0 3.99% 0.72 4.60 256K 184 739 10812.12 2576.32 0 34.02% 3.81 4.32 512K 307 615 12995.86 2891.70 0 2.99% 2.31 4.31 1024K 611 611 13071.85 3287.53 0 6.96% 3.60 8.42 2048K 1051 525 15209.81 3562.27 0 35.79% 8.67 20.12 Bitmap none, array with the same settings (previous array was shut down, drives were "cleansed" with nvme format) randread BS BW IOPS LAT_µs LAT_DEV SS SS_perc USR_CPU SYS_CPU 4K 571 146399 53.80 25.07 0 5.17% 13.54 58.45 8K 1147 146866 53.87 17.48 0 3.10% 11.20 59.26 16K 1970 126136 62.70 20.11 0 2.64% 11.06 58.88 32K 3519 112637 70.36 23.60 1 1.98% 11.05 54.55 64K 6502 104037 76.27 21.71 1 1.52% 9.60 60.40 128K 7886 63093 126.05 21.88 1 1.19% 6.84 65.40 256K 9446 37787 211.05 27.00 1 0.77% 3.60 69.37 512K 8397 16794 475.58 42.16 1 1.45% 1.85 71.99 1024K 8510 8510 939.13 55.02 1 1.01% 1.00 72.60 2048K 11035 5517 1448.77 84.14 1 1.99% 0.74 73.49 randwrite BS BW IOPS LAT_µs LAT_DEV SS SS_perc USR_CPU SYS_CPU 4K 195 50151 158.96 48.56 1 1.13% 5.74 34.68 8K 264 33897 235.39 77.11 1 1.32% 4.60 34.46 16K 343 22003 362.88 111.80 1 1.70% 5.34 37.17 32K 645 20642 386.83 145.86 0 33.84% 6.48 45.15 64K 917 14680 543.97 170.23 0 3.01% 6.05 53.27 128K 1416 11332 704.94 202.18 0 4.66% 9.69 57.63 256K 1394 5576 1433.60 375.88 1 1.52% 8.53 24.93 512K 1726 3452 2316.19 500.19 1 1.18% 12.38 30.54 1024K 2598 2598 3077.47 629.37 0 2.53% 18.74 47.02 2048K 2457 1228 6508.20 1825.67 0 3.32% 28.70 61.01 Reads are fine but writes are many times slower ... > > I still can't believe your test result, and I can't figure out why > internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5, > and use the same fio script to test, the result is quite different from > yours: > > ram0: 981MiB/s > non-bitmap: 132MiB/s > internal-bitmap: 95.5MiB/s >> I don't know, I can provide full fio test logs including fio "tracing" for these iodepth 8 tests if that would make any difference. > There is absolutely something wrong here, it doesn't make sense to me > that internal bitmap is so slow. However, I have no idea until you can > provide the perf result. I may be able to find time to do that over the weekend, but don’t hold me to it. The test setup will not be the same, server is in production ... I did leave some "spare" partitions on all drives to investigate this issue further but did not find the time. Please send me an example of how would you like me to run the perf tool, I haven't used it much. Thanks Dragan
On Thu, 7 Nov 2024 17:28:43 -0800 Song Liu <song@kernel.org> wrote: > On Thu, Nov 7, 2024 at 5:03 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > > > Hi, > > > > 在 2024/11/08 7:41, Song Liu 写道: > > > On Thu, Nov 7, 2024 at 5:02 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > >> > > >> From: Yu Kuai <yukuai3@huawei.com> > > >> > > >> The bitmap file has been marked as deprecated for more than a year now, > > >> let's remove it, and we don't need to care about this case in the new > > >> bitmap. > > >> > > >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> > > > > > > What happens when an old array with bitmap file boots into a kernel > > > without bitmap file support? > > > > If mdadm is used with bitmap file support, then kenel will just ignore > > the bitmap, the same as none bitmap. Perhaps it's better to leave a > > error message? > > Yes, we should print some error message before assembling the array. > > > And if mdadm is updated, reassemble will fail. I would be great if mdadm can just ignore it too. It comes from config file, so simply you can ignore bitmap entry if it is different than "internal" or "clustered". You can print error however you must do it somewhere else (outside config.c), otherwise user would be always prompted about that on every config read - probably we don't need to make it such noise but maybe we should (user may not notice change if we are not screaming it loud). I have no opinion here. The first rule is always data access- we should not break that if possible. I think case I think it is possible to keep them assembled. > > I think we should ship this with 6.14 (not 6.13), so that we have > more time testing different combinations of old/new mdadm > and kernel. WDYT? Later is better because it decreases possibility that someone would met the case with new kernel and old mdadm, where probably some ioctl/sysfs writes fails will be observed. I would say that we should wait around one year after removing it from mdadm. That is preferred by me. I will merge Kuai changes soon, before the release. I think it is valuable to have it blocked in new mdadm release. Mariusz
Hi, 在 2024/11/11 22:07, Dragan Milivojević 写道: > Reads are fine but writes are many times slower ... > > >> >> I still can't believe your test result, and I can't figure out why >> internal bitmap is so slow. Hence I use ramdisk(10GB) to create a raid5, >> and use the same fio script to test, the result is quite different from >> yours: >> >> ram0: 981MiB/s >> non-bitmap: 132MiB/s >> internal-bitmap: 95.5MiB/s >>> So, I waited for Paul to have a chance to give it a test for real disks, still, results are similar to above. > > I don't know, I can provide full fio test logs including fio "tracing" > for these iodepth 8 > tests if that would make any difference. > No, I don't need fio logs. > > >> There is absolutely something wrong here, it doesn't make sense to me >> that internal bitmap is so slow. However, I have no idea until you can >> provide the perf result. > > I may be able to find time to do that over the weekend, but don’t hold > me to it. > The test setup will not be the same, server is in production ... > I did leave some "spare" partitions on all drives to investigate this > issue further > but did not find the time. > > Please send me an example of how would you like me to run the perf tool, > I haven't used > it much. You can see examples here: https://github.com/brendangregg/FlameGraph To be short, while test is running: perf record -a -g -- sleep 10 perf script -i perf.data | ./stackcollapse-perf.pl | ./flamegraph.pl BTW, you said that you're using production environment, this will probably make it hard to analyze performance. Thanks, Kuai
On 13/11/2024 02:18, Yu Kuai wrote: >>> ram0: 981MiB/s >>> non-bitmap: 132MiB/s >>> internal-bitmap: 95.5MiB/s >>>> > > So, I waited for Paul to have a chance to give it a test for real disks, > still, results are similar to above. That is interesting. How are you running those tests? I should try them on my hardware as well. > > You can see examples here: > > https://github.com/brendangregg/FlameGraph > > To be short, while test is running: > > perf record -a -g -- sleep 10 > perf script -i perf.data | ./stackcollapse-perf.pl | ./flamegraph.pl > > BTW, you said that you're using production environment, this will > probably make it hard to analyze performance. I may be able to move things around for the weekend, we will see. Thanks Dragan
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 29da10e6f703..6895883fc622 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -167,8 +167,6 @@ struct bitmap { int need_sync; struct bitmap_storage { - /* backing disk file */ - struct file *file; /* cached copy of the bitmap file superblock */ struct page *sb_page; unsigned long sb_index; @@ -495,135 +493,6 @@ static void write_sb_page(struct bitmap *bitmap, unsigned long pg_index, static void md_bitmap_file_kick(struct bitmap *bitmap); -#ifdef CONFIG_MD_BITMAP_FILE -static void write_file_page(struct bitmap *bitmap, struct page *page, int wait) -{ - struct buffer_head *bh = page_buffers(page); - - while (bh && bh->b_blocknr) { - atomic_inc(&bitmap->pending_writes); - set_buffer_locked(bh); - set_buffer_mapped(bh); - submit_bh(REQ_OP_WRITE | REQ_SYNC, bh); - bh = bh->b_this_page; - } - - if (wait) - wait_event(bitmap->write_wait, - atomic_read(&bitmap->pending_writes) == 0); -} - -static void end_bitmap_write(struct buffer_head *bh, int uptodate) -{ - struct bitmap *bitmap = bh->b_private; - - if (!uptodate) - set_bit(BITMAP_WRITE_ERROR, &bitmap->flags); - if (atomic_dec_and_test(&bitmap->pending_writes)) - wake_up(&bitmap->write_wait); -} - -static void free_buffers(struct page *page) -{ - struct buffer_head *bh; - - if (!PagePrivate(page)) - return; - - bh = page_buffers(page); - while (bh) { - struct buffer_head *next = bh->b_this_page; - free_buffer_head(bh); - bh = next; - } - detach_page_private(page); - put_page(page); -} - -/* read a page from a file. - * We both read the page, and attach buffers to the page to record the - * address of each block (using bmap). These addresses will be used - * to write the block later, completely bypassing the filesystem. - * This usage is similar to how swap files are handled, and allows us - * to write to a file with no concerns of memory allocation failing. - */ -static int read_file_page(struct file *file, unsigned long index, - struct bitmap *bitmap, unsigned long count, struct page *page) -{ - int ret = 0; - struct inode *inode = file_inode(file); - struct buffer_head *bh; - sector_t block, blk_cur; - unsigned long blocksize = i_blocksize(inode); - - pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE, - (unsigned long long)index << PAGE_SHIFT); - - bh = alloc_page_buffers(page, blocksize); - if (!bh) { - ret = -ENOMEM; - goto out; - } - attach_page_private(page, bh); - blk_cur = index << (PAGE_SHIFT - inode->i_blkbits); - while (bh) { - block = blk_cur; - - if (count == 0) - bh->b_blocknr = 0; - else { - ret = bmap(inode, &block); - if (ret || !block) { - ret = -EINVAL; - bh->b_blocknr = 0; - goto out; - } - - bh->b_blocknr = block; - bh->b_bdev = inode->i_sb->s_bdev; - if (count < blocksize) - count = 0; - else - count -= blocksize; - - bh->b_end_io = end_bitmap_write; - bh->b_private = bitmap; - atomic_inc(&bitmap->pending_writes); - set_buffer_locked(bh); - set_buffer_mapped(bh); - submit_bh(REQ_OP_READ, bh); - } - blk_cur++; - bh = bh->b_this_page; - } - - wait_event(bitmap->write_wait, - atomic_read(&bitmap->pending_writes)==0); - if (test_bit(BITMAP_WRITE_ERROR, &bitmap->flags)) - ret = -EIO; -out: - if (ret) - pr_err("md: bitmap read error: (%dB @ %llu): %d\n", - (int)PAGE_SIZE, - (unsigned long long)index << PAGE_SHIFT, - ret); - return ret; -} -#else /* CONFIG_MD_BITMAP_FILE */ -static void write_file_page(struct bitmap *bitmap, struct page *page, int wait) -{ -} -static int read_file_page(struct file *file, unsigned long index, - struct bitmap *bitmap, unsigned long count, struct page *page) -{ - return -EIO; -} -static void free_buffers(struct page *page) -{ - put_page(page); -} -#endif /* CONFIG_MD_BITMAP_FILE */ - /* * bitmap file superblock operations */ @@ -642,10 +511,7 @@ static void filemap_write_page(struct bitmap *bitmap, unsigned long pg_index, pg_index += store->sb_index; } - if (store->file) - write_file_page(bitmap, page, wait); - else - write_sb_page(bitmap, pg_index, page, wait); + write_sb_page(bitmap, pg_index, page, wait); } /* @@ -655,18 +521,15 @@ static void filemap_write_page(struct bitmap *bitmap, unsigned long pg_index, */ static void md_bitmap_wait_writes(struct bitmap *bitmap) { - if (bitmap->storage.file) - wait_event(bitmap->write_wait, - atomic_read(&bitmap->pending_writes)==0); - else - /* Note that we ignore the return value. The writes - * might have failed, but that would just mean that - * some bits which should be cleared haven't been, - * which is safe. The relevant bitmap blocks will - * probably get written again, but there is no great - * loss if they aren't. - */ - md_super_wait(bitmap->mddev); + /* + * Note that we ignore the return value. The writes + * might have failed, but that would just mean that + * some bits which should be cleared haven't been, + * which is safe. The relevant bitmap blocks will + * probably get written again, but there is no great + * loss if they aren't. + */ + md_super_wait(bitmap->mddev); } @@ -704,11 +567,8 @@ static void bitmap_update_sb(void *data) bitmap_info.space); kunmap_atomic(sb); - if (bitmap->storage.file) - write_file_page(bitmap, bitmap->storage.sb_page, 1); - else - write_sb_page(bitmap, bitmap->storage.sb_index, - bitmap->storage.sb_page, 1); + write_sb_page(bitmap, bitmap->storage.sb_index, bitmap->storage.sb_page, + 1); } static void bitmap_print_sb(struct bitmap *bitmap) @@ -821,7 +681,7 @@ static int md_bitmap_read_sb(struct bitmap *bitmap) struct page *sb_page; loff_t offset = 0; - if (!bitmap->storage.file && !bitmap->mddev->bitmap_info.offset) { + if (!bitmap->mddev->bitmap_info.offset) { chunksize = 128 * 1024 * 1024; daemon_sleep = 5 * HZ; write_behind = 0; @@ -851,16 +711,8 @@ static int md_bitmap_read_sb(struct bitmap *bitmap) bitmap->cluster_slot, offset); } - if (bitmap->storage.file) { - loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host); - int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; - - err = read_file_page(bitmap->storage.file, 0, - bitmap, bytes, sb_page); - } else { - err = read_sb_page(bitmap->mddev, offset, sb_page, 0, - sizeof(bitmap_super_t)); - } + err = read_sb_page(bitmap->mddev, offset, sb_page, 0, + sizeof(bitmap_super_t)); if (err) return err; @@ -1062,25 +914,18 @@ static int md_bitmap_storage_alloc(struct bitmap_storage *store, static void md_bitmap_file_unmap(struct bitmap_storage *store) { - struct file *file = store->file; struct page *sb_page = store->sb_page; struct page **map = store->filemap; int pages = store->file_pages; while (pages--) if (map[pages] != sb_page) /* 0 is sb_page, release it below */ - free_buffers(map[pages]); + put_page(map[pages]); kfree(map); kfree(store->filemap_attr); if (sb_page) - free_buffers(sb_page); - - if (file) { - struct inode *inode = file_inode(file); - invalidate_mapping_pages(inode->i_mapping, 0, -1); - fput(file); - } + put_page(sb_page); } /* @@ -1092,14 +937,8 @@ static void md_bitmap_file_kick(struct bitmap *bitmap) { if (!test_and_set_bit(BITMAP_STALE, &bitmap->flags)) { bitmap_update_sb(bitmap); - - if (bitmap->storage.file) { - pr_warn("%s: kicking failed bitmap file %pD4 from array!\n", - bmname(bitmap), bitmap->storage.file); - - } else - pr_warn("%s: disabling internal bitmap due to errors\n", - bmname(bitmap)); + pr_warn("%s: disabling internal bitmap due to errors\n", + bmname(bitmap)); } } @@ -1319,13 +1158,12 @@ static int md_bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) struct mddev *mddev = bitmap->mddev; unsigned long chunks = bitmap->counts.chunks; struct bitmap_storage *store = &bitmap->storage; - struct file *file = store->file; unsigned long node_offset = 0; unsigned long bit_cnt = 0; unsigned long i; int ret; - if (!file && !mddev->bitmap_info.offset) { + if (!mddev->bitmap_info.offset) { /* No permanent bitmap - fill with '1s'. */ store->filemap = NULL; store->file_pages = 0; @@ -1340,15 +1178,6 @@ static int md_bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) return 0; } - if (file && i_size_read(file->f_mapping->host) < store->bytes) { - pr_warn("%s: bitmap file too short %lu < %lu\n", - bmname(bitmap), - (unsigned long) i_size_read(file->f_mapping->host), - store->bytes); - ret = -ENOSPC; - goto err; - } - if (mddev_is_clustered(mddev)) node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE)); @@ -1362,11 +1191,7 @@ static int md_bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) else count = PAGE_SIZE; - if (file) - ret = read_file_page(file, i, bitmap, count, page); - else - ret = read_sb_page(mddev, 0, page, i + node_offset, - count); + ret = read_sb_page(mddev, 0, page, i + node_offset, count); if (ret) goto err; } @@ -1444,10 +1269,6 @@ static void bitmap_write_all(struct mddev *mddev) if (!bitmap || !bitmap->storage.filemap) return; - /* Only one copy, so nothing needed */ - if (bitmap->storage.file) - return; - for (i = 0; i < bitmap->storage.file_pages; i++) set_page_attr(bitmap, i, BITMAP_PAGE_NEEDWRITE); bitmap->allclean = 0; @@ -2105,14 +1926,11 @@ static struct bitmap *__bitmap_create(struct mddev *mddev, int slot) { struct bitmap *bitmap; sector_t blocks = mddev->resync_max_sectors; - struct file *file = mddev->bitmap_info.file; int err; struct kernfs_node *bm = NULL; BUILD_BUG_ON(sizeof(bitmap_super_t) != 256); - BUG_ON(file && mddev->bitmap_info.offset); - if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) { pr_notice("md/raid:%s: array with journal cannot have bitmap\n", mdname(mddev)); @@ -2140,15 +1958,6 @@ static struct bitmap *__bitmap_create(struct mddev *mddev, int slot) } else bitmap->sysfs_can_clear = NULL; - bitmap->storage.file = file; - if (file) { - get_file(file); - /* As future accesses to this file will use bmap, - * and bypass the page cache, we must sync the file - * first. - */ - vfs_fsync(file, 1); - } /* read superblock from bitmap file (this sets mddev->bitmap_info.chunksize) */ if (!mddev->bitmap_info.external) { /* @@ -2352,7 +2161,6 @@ static int bitmap_get_stats(void *data, struct md_bitmap_stats *stats) storage = &bitmap->storage; stats->file_pages = storage->file_pages; - stats->file = storage->file; stats->behind_writes = atomic_read(&bitmap->behind_writes); stats->behind_wait = wq_has_sleeper(&bitmap->behind_wait); @@ -2383,11 +2191,6 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, long pages; struct bitmap_page *new_bp; - if (bitmap->storage.file && !init) { - pr_info("md: cannot resize file-based bitmap\n"); - return -EINVAL; - } - if (chunksize == 0) { /* If there is enough space, leave the chunk size unchanged, * else increase by factor of two until there is enough space. @@ -2421,7 +2224,7 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, chunks = DIV_ROUND_UP_SECTOR_T(blocks, 1 << chunkshift); memset(&store, 0, sizeof(store)); - if (bitmap->mddev->bitmap_info.offset || bitmap->mddev->bitmap_info.file) + if (bitmap->mddev->bitmap_info.offset) ret = md_bitmap_storage_alloc(&store, chunks, !bitmap->mddev->bitmap_info.external, mddev_is_clustered(bitmap->mddev) @@ -2443,9 +2246,6 @@ static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, if (!init) bitmap->mddev->pers->quiesce(bitmap->mddev, 1); - store.file = bitmap->storage.file; - bitmap->storage.file = NULL; - if (store.sb_page && bitmap->storage.sb_page) memcpy(page_address(store.sb_page), page_address(bitmap->storage.sb_page), @@ -2582,9 +2382,7 @@ static ssize_t location_show(struct mddev *mddev, char *page) { ssize_t len; - if (mddev->bitmap_info.file) - len = sprintf(page, "file"); - else if (mddev->bitmap_info.offset) + if (mddev->bitmap_info.offset) len = sprintf(page, "%+lld", (long long)mddev->bitmap_info.offset); else len = sprintf(page, "none"); @@ -2608,8 +2406,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len) } } - if (mddev->bitmap || mddev->bitmap_info.file || - mddev->bitmap_info.offset) { + if (mddev->bitmap || mddev->bitmap_info.offset) { /* bitmap already configured. Only option is to clear it */ if (strncmp(buf, "none", 4) != 0) { rv = -EBUSY; @@ -2618,22 +2415,11 @@ location_store(struct mddev *mddev, const char *buf, size_t len) bitmap_destroy(mddev); mddev->bitmap_info.offset = 0; - if (mddev->bitmap_info.file) { - struct file *f = mddev->bitmap_info.file; - mddev->bitmap_info.file = NULL; - fput(f); - } } else { /* No bitmap, OK to set a location */ long long offset; - if (strncmp(buf, "none", 4) == 0) - /* nothing to be done */; - else if (strncmp(buf, "file:", 5) == 0) { - /* Not supported yet */ - rv = -EINVAL; - goto out; - } else { + if (strncmp(buf, "none", 4) != 0) { if (buf[0] == '+') rv = kstrtoll(buf+1, 10, &offset); else @@ -2864,10 +2650,9 @@ static ssize_t metadata_show(struct mddev *mddev, char *page) static ssize_t metadata_store(struct mddev *mddev, const char *buf, size_t len) { - if (mddev->bitmap || - mddev->bitmap_info.file || - mddev->bitmap_info.offset) + if (mddev->bitmap || mddev->bitmap_info.offset) return -EBUSY; + if (strncmp(buf, "external", 8) == 0) mddev->bitmap_info.external = 1; else if ((strncmp(buf, "internal", 8) == 0) || diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h index 662e6fc141a7..4b386954f5f5 100644 --- a/drivers/md/md-bitmap.h +++ b/drivers/md/md-bitmap.h @@ -67,7 +67,6 @@ struct md_bitmap_stats { unsigned long file_pages; unsigned long sync_size; unsigned long pages; - struct file *file; }; struct bitmap_operations { diff --git a/drivers/md/md.c b/drivers/md/md.c index 35c2e1e761aa..03f2a9fafea2 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -1155,7 +1155,7 @@ struct super_type { */ int md_check_no_bitmap(struct mddev *mddev) { - if (!mddev->bitmap_info.file && !mddev->bitmap_info.offset) + if (!mddev->bitmap_info.offset) return 0; pr_warn("%s: bitmaps are not supported for %s\n", mdname(mddev), mddev->pers->name); @@ -1349,8 +1349,7 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *freshest, stru mddev->max_disks = MD_SB_DISKS; - if (sb->state & (1<<MD_SB_BITMAP_PRESENT) && - mddev->bitmap_info.file == NULL) { + if (sb->state & (1<<MD_SB_BITMAP_PRESENT)) { mddev->bitmap_info.offset = mddev->bitmap_info.default_offset; mddev->bitmap_info.space = @@ -1476,7 +1475,7 @@ static void super_90_sync(struct mddev *mddev, struct md_rdev *rdev) sb->layout = mddev->layout; sb->chunk_size = mddev->chunk_sectors << 9; - if (mddev->bitmap && mddev->bitmap_info.file == NULL) + if (mddev->bitmap) sb->state |= (1<<MD_SB_BITMAP_PRESENT); sb->disks[0].state = (1<<MD_DISK_REMOVED); @@ -1824,8 +1823,7 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *freshest, struc mddev->max_disks = (4096-256)/2; - if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET) && - mddev->bitmap_info.file == NULL) { + if (le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET) { mddev->bitmap_info.offset = (__s32)le32_to_cpu(sb->bitmap_offset); /* Metadata doesn't record how much space is available. @@ -2030,7 +2028,7 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev) sb->data_offset = cpu_to_le64(rdev->data_offset); sb->data_size = cpu_to_le64(rdev->sectors); - if (mddev->bitmap && mddev->bitmap_info.file == NULL) { + if (mddev->bitmap) { sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_info.offset); sb->feature_map = cpu_to_le32(MD_FEATURE_BITMAP_OFFSET); } @@ -2227,6 +2225,10 @@ static int super_1_allow_new_offset(struct md_rdev *rdev, unsigned long long new_offset) { + struct mddev *mddev = rdev->mddev; + struct md_bitmap_stats stats; + int err; + /* All necessary checks on new >= old have been done */ if (new_offset >= rdev->data_offset) return 1; @@ -2245,21 +2247,12 @@ super_1_allow_new_offset(struct md_rdev *rdev, if (rdev->sb_start + (32+4)*2 > new_offset) return 0; - if (!rdev->mddev->bitmap_info.file) { - struct mddev *mddev = rdev->mddev; - struct md_bitmap_stats stats; - int err; - - err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); - if (!err && rdev->sb_start + mddev->bitmap_info.offset + - stats.file_pages * (PAGE_SIZE >> 9) > new_offset) - return 0; - } - - if (rdev->badblocks.sector + rdev->badblocks.size > new_offset) - return 0; + err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); + if (err) + return 1; - return 1; + return rdev->sb_start + mddev->bitmap_info.offset + + stats.file_pages * (PAGE_SIZE >> 9) <= new_offset; } static struct super_type super_types[] = { @@ -6150,8 +6143,7 @@ int md_run(struct mddev *mddev) (unsigned long long)pers->size(mddev, 0, 0) / 2); err = -EINVAL; } - if (err == 0 && pers->sync_request && - (mddev->bitmap_info.file || mddev->bitmap_info.offset)) { + if (err == 0 && pers->sync_request && mddev->bitmap_info.offset) { err = mddev->bitmap_ops->create(mddev, -1); if (err) pr_warn("%s: failed to create bitmap (%d)\n", @@ -6563,17 +6555,8 @@ static int do_md_stop(struct mddev *mddev, int mode) if (mode == 0) { pr_info("md: %s stopped.\n", mdname(mddev)); - if (mddev->bitmap_info.file) { - struct file *f = mddev->bitmap_info.file; - spin_lock(&mddev->lock); - mddev->bitmap_info.file = NULL; - spin_unlock(&mddev->lock); - fput(f); - } mddev->bitmap_info.offset = 0; - export_array(mddev); - md_clean(mddev); if (mddev->hold_active == UNTIL_STOP) mddev->hold_active = 0; @@ -6767,38 +6750,6 @@ static int get_array_info(struct mddev *mddev, void __user *arg) return 0; } -static int get_bitmap_file(struct mddev *mddev, void __user * arg) -{ - mdu_bitmap_file_t *file = NULL; /* too big for stack allocation */ - char *ptr; - int err; - - file = kzalloc(sizeof(*file), GFP_NOIO); - if (!file) - return -ENOMEM; - - err = 0; - spin_lock(&mddev->lock); - /* bitmap enabled */ - if (mddev->bitmap_info.file) { - ptr = file_path(mddev->bitmap_info.file, file->pathname, - sizeof(file->pathname)); - if (IS_ERR(ptr)) - err = PTR_ERR(ptr); - else - memmove(file->pathname, ptr, - sizeof(file->pathname)-(ptr-file->pathname)); - } - spin_unlock(&mddev->lock); - - if (err == 0 && - copy_to_user(arg, file, sizeof(*file))) - err = -EFAULT; - - kfree(file); - return err; -} - static int get_disk_info(struct mddev *mddev, void __user * arg) { mdu_disk_info_t info; @@ -7153,92 +7104,6 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev) return err; } -static int set_bitmap_file(struct mddev *mddev, int fd) -{ - int err = 0; - - if (mddev->pers) { - if (!mddev->pers->quiesce || !mddev->thread) - return -EBUSY; - if (mddev->recovery || mddev->sync_thread) - return -EBUSY; - /* we should be able to change the bitmap.. */ - } - - if (fd >= 0) { - struct inode *inode; - struct file *f; - - if (mddev->bitmap || mddev->bitmap_info.file) - return -EEXIST; /* cannot add when bitmap is present */ - - if (!IS_ENABLED(CONFIG_MD_BITMAP_FILE)) { - pr_warn("%s: bitmap files not supported by this kernel\n", - mdname(mddev)); - return -EINVAL; - } - pr_warn("%s: using deprecated bitmap file support\n", - mdname(mddev)); - - f = fget(fd); - - if (f == NULL) { - pr_warn("%s: error: failed to get bitmap file\n", - mdname(mddev)); - return -EBADF; - } - - inode = f->f_mapping->host; - if (!S_ISREG(inode->i_mode)) { - pr_warn("%s: error: bitmap file must be a regular file\n", - mdname(mddev)); - err = -EBADF; - } else if (!(f->f_mode & FMODE_WRITE)) { - pr_warn("%s: error: bitmap file must open for write\n", - mdname(mddev)); - err = -EBADF; - } else if (atomic_read(&inode->i_writecount) != 1) { - pr_warn("%s: error: bitmap file is already in use\n", - mdname(mddev)); - err = -EBUSY; - } - if (err) { - fput(f); - return err; - } - mddev->bitmap_info.file = f; - mddev->bitmap_info.offset = 0; /* file overrides offset */ - } else if (mddev->bitmap == NULL) - return -ENOENT; /* cannot remove what isn't there */ - err = 0; - if (mddev->pers) { - if (fd >= 0) { - err = mddev->bitmap_ops->create(mddev, -1); - if (!err) - err = mddev->bitmap_ops->load(mddev); - - if (err) { - mddev->bitmap_ops->destroy(mddev); - fd = -1; - } - } else if (fd < 0) { - mddev->bitmap_ops->destroy(mddev); - } - } - - if (fd < 0) { - struct file *f = mddev->bitmap_info.file; - if (f) { - spin_lock(&mddev->lock); - mddev->bitmap_info.file = NULL; - spin_unlock(&mddev->lock); - fput(f); - } - } - - return err; -} - /* * md_set_array_info is used two different ways * The original usage is when creating a new array. @@ -7520,11 +7385,6 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) if (rv) goto err; - if (stats.file) { - rv = -EINVAL; - goto err; - } - if (mddev->bitmap_info.nodes) { /* hold PW on all the bitmap lock */ if (md_cluster_ops->lock_all_bitmaps(mddev) <= 0) { @@ -7589,18 +7449,19 @@ static int md_getgeo(struct block_device *bdev, struct hd_geometry *geo) static inline int md_ioctl_valid(unsigned int cmd) { switch (cmd) { + case GET_BITMAP_FILE: + case SET_BITMAP_FILE: + return -EOPNOTSUPP; case GET_ARRAY_INFO: case GET_DISK_INFO: case RAID_VERSION: return 0; case ADD_NEW_DISK: - case GET_BITMAP_FILE: case HOT_ADD_DISK: case HOT_REMOVE_DISK: case RESTART_ARRAY_RW: case RUN_ARRAY: case SET_ARRAY_INFO: - case SET_BITMAP_FILE: case SET_DISK_FAULTY: case STOP_ARRAY: case STOP_ARRAY_RO: @@ -7619,7 +7480,6 @@ static bool md_ioctl_need_suspend(unsigned int cmd) case ADD_NEW_DISK: case HOT_ADD_DISK: case HOT_REMOVE_DISK: - case SET_BITMAP_FILE: case SET_ARRAY_INFO: return true; default: @@ -7699,9 +7559,6 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, case SET_DISK_FAULTY: return set_disk_faulty(mddev, new_decode_dev(arg)); - - case GET_BITMAP_FILE: - return get_bitmap_file(mddev, argp); } if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) { @@ -7734,10 +7591,8 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, */ /* if we are not initialised yet, only ADD_NEW_DISK, STOP_ARRAY, * RUN_ARRAY, and GET_ and SET_BITMAP_FILE are allowed */ - if ((!mddev->raid_disks && !mddev->external) - && cmd != ADD_NEW_DISK && cmd != STOP_ARRAY - && cmd != RUN_ARRAY && cmd != SET_BITMAP_FILE - && cmd != GET_BITMAP_FILE) { + if (!mddev->raid_disks && !mddev->external && cmd != ADD_NEW_DISK && + cmd != STOP_ARRAY && cmd != RUN_ARRAY) { err = -ENODEV; goto unlock; } @@ -7833,10 +7688,6 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, err = do_md_run(mddev); goto unlock; - case SET_BITMAP_FILE: - err = set_bitmap_file(mddev, (int)arg); - goto unlock; - default: err = -EINVAL; goto unlock; @@ -7855,6 +7706,7 @@ static int md_ioctl(struct block_device *bdev, blk_mode_t mode, clear_bit(MD_CLOSING, &mddev->flags); return err; } + #ifdef CONFIG_COMPAT static int md_compat_ioctl(struct block_device *bdev, blk_mode_t mode, unsigned int cmd, unsigned long arg) @@ -8328,11 +8180,6 @@ static void md_bitmap_status(struct seq_file *seq, struct mddev *mddev) chunk_kb ? chunk_kb : mddev->bitmap_info.chunksize, chunk_kb ? "KB" : "B"); - if (stats.file) { - seq_puts(seq, ", file: "); - seq_file_path(seq, stats.file, " \t\n"); - } - seq_putc(seq, '\n'); } diff --git a/drivers/md/md.h b/drivers/md/md.h index 4ba93af36126..bae257bc630c 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -360,6 +360,34 @@ enum { MD_RESYNC_ACTIVE = 3, }; +struct bitmap_info { + /* + * offset from superblock of start of bitmap. May be negative, but not + * '0' For external metadata, offset from start of device. + */ + loff_t offset; + /* space available at this offset */ + unsigned long space; + /* + * this is the offset to use when hot-adding a bitmap. It should + * eventually be settable by sysfs. + */ + loff_t default_offset; + /* space available at default offset */ + unsigned long default_space; + struct mutex mutex; + unsigned long chunksize; + /* how many jiffies between updates? */ + unsigned long daemon_sleep; + /* write-behind mode */ + unsigned long max_write_behind; + int external; + /* Maximum number of nodes in the cluster */ + int nodes; + /* Name of the cluster */ + char cluster_name[64]; +}; + struct mddev { void *private; struct md_personality *pers; @@ -519,7 +547,6 @@ struct mddev { * in_sync - and related safemode and MD_CHANGE changes * pers (also protected by reconfig_mutex and pending IO). * clearing ->bitmap - * clearing ->bitmap_info.file * changing ->resync_{min,max} * setting MD_RECOVERY_RUNNING (which interacts with resync_{min,max}) */ @@ -537,29 +564,7 @@ struct mddev { void *bitmap; /* the bitmap for the device */ struct bitmap_operations *bitmap_ops; - struct { - struct file *file; /* the bitmap file */ - loff_t offset; /* offset from superblock of - * start of bitmap. May be - * negative, but not '0' - * For external metadata, offset - * from start of device. - */ - unsigned long space; /* space available at this offset */ - loff_t default_offset; /* this is the offset to use when - * hot-adding a bitmap. It should - * eventually be settable by sysfs. - */ - unsigned long default_space; /* space available at - * default offset */ - struct mutex mutex; - unsigned long chunksize; - unsigned long daemon_sleep; /* how many jiffies between updates? */ - unsigned long max_write_behind; /* write-behind mode */ - int external; - int nodes; /* Maximum number of nodes in the cluster */ - char cluster_name[64]; /* Name of the cluster */ - } bitmap_info; + struct bitmap_info bitmap_info; atomic_t max_corr_read_errors; /* max read retries */ struct list_head all_mddevs; diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c index 37c4da5311ca..6a1c8d6e1849 100644 --- a/drivers/md/raid5-ppl.c +++ b/drivers/md/raid5-ppl.c @@ -1332,7 +1332,7 @@ int ppl_init_log(struct r5conf *conf) return -EINVAL; } - if (mddev->bitmap_info.file || mddev->bitmap_info.offset) { + if (mddev->bitmap_info.offset) { pr_warn("md/raid:%s PPL is not compatible with bitmap\n", mdname(mddev)); return -EINVAL; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index f5ac81dd21b2..296501838a60 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -7811,7 +7811,7 @@ static int raid5_run(struct mddev *mddev) } if ((test_bit(MD_HAS_JOURNAL, &mddev->flags) || journal_dev) && - (mddev->bitmap_info.offset || mddev->bitmap_info.file)) { + (mddev->bitmap_info.offset)) { pr_notice("md/raid:%s: array cannot have both journal and bitmap\n", mdname(mddev)); return -EINVAL;