Message ID | 20230217183139.106-1-jonathan.derrick@linux.dev (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | [v2] md: Use optimal I/O size for last bitmap page | expand |
On Sat, Feb 18, 2023 at 2:36 AM Jonathan Derrick <jonathan.derrick@linux.dev> wrote: > > From: Jon Derrick <jonathan.derrick@linux.dev> > > If the bitmap space has enough room, size the I/O for the last bitmap > page write to the optimal I/O size for the storage device. The expanded > write is checked that it won't overrun the data or metadata. > > This change helps increase performance by preventing unnecessary > device-side read-mod-writes due to non-atomic write unit sizes. Hi Jonathan Thanks for your patch. Could you explain more about the how the optimal io size can affect the device-side read-mod-writes? Regards Xiao > > Example Intel/Solidigm P5520 > Raid10, Chunk-size 64M, bitmap-size 57228 bits > > $ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1 --assume-clean --bitmap=internal --bitmap-chunk=64M > $ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60 > > Without patch: > write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets > > With patch: > write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets > > Biosnoop: > Without patch: > Time Process PID Device LBA Size Lat > 1.410377 md0_raid10 6900 nvme0n1 W 16 4096 0.02 > 1.410387 md0_raid10 6900 nvme2n1 W 16 4096 0.02 > 1.410374 md0_raid10 6900 nvme3n1 W 16 4096 0.01 > 1.410381 md0_raid10 6900 nvme1n1 W 16 4096 0.02 > 1.410411 md0_raid10 6900 nvme1n1 W 115346512 4096 0.01 > 1.410418 md0_raid10 6900 nvme0n1 W 115346512 4096 0.02 > 1.410915 md0_raid10 6900 nvme2n1 W 24 3584 0.43 > 1.410935 md0_raid10 6900 nvme3n1 W 24 3584 0.45 > 1.411124 md0_raid10 6900 nvme1n1 W 24 3584 0.64 > 1.411147 md0_raid10 6900 nvme0n1 W 24 3584 0.66 > 1.411176 md0_raid10 6900 nvme3n1 W 2019022184 4096 0.01 > 1.411189 md0_raid10 6900 nvme2n1 W 2019022184 4096 0.02 > > With patch: > Time Process PID Device LBA Size Lat > 5.747193 md0_raid10 727 nvme0n1 W 16 4096 0.01 > 5.747192 md0_raid10 727 nvme1n1 W 16 4096 0.02 > 5.747195 md0_raid10 727 nvme3n1 W 16 4096 0.01 > 5.747202 md0_raid10 727 nvme2n1 W 16 4096 0.02 > 5.747229 md0_raid10 727 nvme3n1 W 1196223704 4096 0.02 > 5.747224 md0_raid10 727 nvme0n1 W 1196223704 4096 0.01 > 5.747279 md0_raid10 727 nvme0n1 W 24 4096 0.01 > 5.747279 md0_raid10 727 nvme1n1 W 24 4096 0.02 > 5.747284 md0_raid10 727 nvme3n1 W 24 4096 0.02 > 5.747291 md0_raid10 727 nvme2n1 W 24 4096 0.02 > 5.747314 md0_raid10 727 nvme2n1 W 2234636712 4096 0.01 > 5.747317 md0_raid10 727 nvme1n1 W 2234636712 4096 0.02 > > Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev> > --- > v1->v2: Add more information to commit message > > drivers/md/md-bitmap.c | 27 ++++++++++++++++++--------- > 1 file changed, 18 insertions(+), 9 deletions(-) > > diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c > index e7cc6ba1b657..569297ea9b99 100644 > --- a/drivers/md/md-bitmap.c > +++ b/drivers/md/md-bitmap.c > @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) > rdev = NULL; > while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { > int size = PAGE_SIZE; > + int optimal_size = PAGE_SIZE; > loff_t offset = mddev->bitmap_info.offset; > > bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev; > @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) > int last_page_size = store->bytes & (PAGE_SIZE-1); > if (last_page_size == 0) > last_page_size = PAGE_SIZE; > - size = roundup(last_page_size, > - bdev_logical_block_size(bdev)); > + size = roundup(last_page_size, bdev_logical_block_size(bdev)); > + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev)) > + optimal_size = roundup(last_page_size, bdev_io_opt(bdev)); > + else > + optimal_size = size; > } > + > + > /* Just make sure we aren't corrupting data or > * metadata > */ > @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) > goto bad_alignment; > } else if (offset < 0) { > /* DATA BITMAP METADATA */ > - if (offset > - + (long)(page->index * (PAGE_SIZE/512)) > - + size/512 > 0) > + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512)); > + if (size != optimal_size && > + off + optimal_size/512 <= 0) > + size = optimal_size; > + else if (off + size/512 > 0) > /* bitmap runs in to metadata */ > goto bad_alignment; > if (rdev->data_offset + mddev->dev_sectors > @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) > goto bad_alignment; > } else if (rdev->sb_start < rdev->data_offset) { > /* METADATA BITMAP DATA */ > - if (rdev->sb_start > - + offset > - + page->index*(PAGE_SIZE/512) + size/512 > - > rdev->data_offset) > + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512); > + if (size != optimal_size && > + off + optimal_size/512 <= rdev->data_offset) > + size = optimal_size; > + else if (off + size/512 > rdev->data_offset) > /* bitmap runs in to data */ > goto bad_alignment; > } else { > -- > 2.27.0 >
On 2/20/2023 7:38 PM, Xiao Ni wrote: > On Sat, Feb 18, 2023 at 2:36 AM Jonathan Derrick > <jonathan.derrick@linux.dev> wrote: >> >> From: Jon Derrick <jonathan.derrick@linux.dev> >> >> If the bitmap space has enough room, size the I/O for the last bitmap >> page write to the optimal I/O size for the storage device. The expanded >> write is checked that it won't overrun the data or metadata. >> >> This change helps increase performance by preventing unnecessary >> device-side read-mod-writes due to non-atomic write unit sizes. > > Hi Jonathan > > Thanks for your patch. > > Could you explain more about the how the optimal io size can affect > the device-side read-mod-writes? The effects of device-side read-mod-writes are a device-specific implementation detail. This is not something expected to be universal as its an abstracted detail. However the NVMe spec allows the storage namespace to provide implementation hints about its underlying media. Both NPWA and NOWS are used in the NVMe driver, where NOWS provides the optimal_io hint: Per NVMe Command Set 1.0b, Figure 97 Namespace Preferred Write Alignment (NPWA): This field indicates the recommended write alignment in logical blocks for this namespace Namespace Optimal Write Size (NOWS): This field indicates the size in logical blocks for optimal write performance for this namespace. Per 5.8.2.1 Improved I/O examples (non-normative) When constructing write operations, the host should minimally construct writes that meet the recommendations of NPWG and NPWA, but may achieve optimal write performance by constructing writes that meet the recommendation of NOWS. It then makes sense that an NVMe drive would provide optimal io size hints that matches its underlying media unit size, such that sub-4k writes might invoke a device-side read-mod-write whereas 4k writes become atomic. > > Regards > Xiao >> >> Example Intel/Solidigm P5520 >> Raid10, Chunk-size 64M, bitmap-size 57228 bits >> >> $ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1 --assume-clean --bitmap=internal --bitmap-chunk=64M >> $ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60 >> >> Without patch: >> write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets >> >> With patch: >> write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets >> >> Biosnoop: >> Without patch: >> Time Process PID Device LBA Size Lat >> 1.410377 md0_raid10 6900 nvme0n1 W 16 4096 0.02 >> 1.410387 md0_raid10 6900 nvme2n1 W 16 4096 0.02 >> 1.410374 md0_raid10 6900 nvme3n1 W 16 4096 0.01 >> 1.410381 md0_raid10 6900 nvme1n1 W 16 4096 0.02 >> 1.410411 md0_raid10 6900 nvme1n1 W 115346512 4096 0.01 >> 1.410418 md0_raid10 6900 nvme0n1 W 115346512 4096 0.02 >> 1.410915 md0_raid10 6900 nvme2n1 W 24 3584 0.43 >> 1.410935 md0_raid10 6900 nvme3n1 W 24 3584 0.45 >> 1.411124 md0_raid10 6900 nvme1n1 W 24 3584 0.64 >> 1.411147 md0_raid10 6900 nvme0n1 W 24 3584 0.66 >> 1.411176 md0_raid10 6900 nvme3n1 W 2019022184 4096 0.01 >> 1.411189 md0_raid10 6900 nvme2n1 W 2019022184 4096 0.02 >> >> With patch: >> Time Process PID Device LBA Size Lat >> 5.747193 md0_raid10 727 nvme0n1 W 16 4096 0.01 >> 5.747192 md0_raid10 727 nvme1n1 W 16 4096 0.02 >> 5.747195 md0_raid10 727 nvme3n1 W 16 4096 0.01 >> 5.747202 md0_raid10 727 nvme2n1 W 16 4096 0.02 >> 5.747229 md0_raid10 727 nvme3n1 W 1196223704 4096 0.02 >> 5.747224 md0_raid10 727 nvme0n1 W 1196223704 4096 0.01 >> 5.747279 md0_raid10 727 nvme0n1 W 24 4096 0.01 >> 5.747279 md0_raid10 727 nvme1n1 W 24 4096 0.02 >> 5.747284 md0_raid10 727 nvme3n1 W 24 4096 0.02 >> 5.747291 md0_raid10 727 nvme2n1 W 24 4096 0.02 >> 5.747314 md0_raid10 727 nvme2n1 W 2234636712 4096 0.01 >> 5.747317 md0_raid10 727 nvme1n1 W 2234636712 4096 0.02 >> >> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev> >> --- >> v1->v2: Add more information to commit message >> >> drivers/md/md-bitmap.c | 27 ++++++++++++++++++--------- >> 1 file changed, 18 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c >> index e7cc6ba1b657..569297ea9b99 100644 >> --- a/drivers/md/md-bitmap.c >> +++ b/drivers/md/md-bitmap.c >> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) >> rdev = NULL; >> while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { >> int size = PAGE_SIZE; >> + int optimal_size = PAGE_SIZE; >> loff_t offset = mddev->bitmap_info.offset; >> >> bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev; >> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) >> int last_page_size = store->bytes & (PAGE_SIZE-1); >> if (last_page_size == 0) >> last_page_size = PAGE_SIZE; >> - size = roundup(last_page_size, >> - bdev_logical_block_size(bdev)); >> + size = roundup(last_page_size, bdev_logical_block_size(bdev)); >> + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev)) >> + optimal_size = roundup(last_page_size, bdev_io_opt(bdev)); >> + else >> + optimal_size = size; >> } >> + >> + >> /* Just make sure we aren't corrupting data or >> * metadata >> */ >> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) >> goto bad_alignment; >> } else if (offset < 0) { >> /* DATA BITMAP METADATA */ >> - if (offset >> - + (long)(page->index * (PAGE_SIZE/512)) >> - + size/512 > 0) >> + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512)); >> + if (size != optimal_size && >> + off + optimal_size/512 <= 0) >> + size = optimal_size; >> + else if (off + size/512 > 0) >> /* bitmap runs in to metadata */ >> goto bad_alignment; >> if (rdev->data_offset + mddev->dev_sectors >> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) >> goto bad_alignment; >> } else if (rdev->sb_start < rdev->data_offset) { >> /* METADATA BITMAP DATA */ >> - if (rdev->sb_start >> - + offset >> - + page->index*(PAGE_SIZE/512) + size/512 >> - > rdev->data_offset) >> + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512); >> + if (size != optimal_size && >> + off + optimal_size/512 <= rdev->data_offset) >> + size = optimal_size; >> + else if (off + size/512 > rdev->data_offset) >> /* bitmap runs in to data */ >> goto bad_alignment; >> } else { >> -- >> 2.27.0 >> >
On Tue, Feb 21, 2023 at 11:31 AM Jonathan Derrick <jonathan.derrick@linux.dev> wrote: > > > > On 2/20/2023 7:38 PM, Xiao Ni wrote: > > On Sat, Feb 18, 2023 at 2:36 AM Jonathan Derrick > > <jonathan.derrick@linux.dev> wrote: > >> > >> From: Jon Derrick <jonathan.derrick@linux.dev> > >> > >> If the bitmap space has enough room, size the I/O for the last bitmap > >> page write to the optimal I/O size for the storage device. The expanded > >> write is checked that it won't overrun the data or metadata. > >> > >> This change helps increase performance by preventing unnecessary > >> device-side read-mod-writes due to non-atomic write unit sizes. > > > > Hi Jonathan > > > > Thanks for your patch. > > > > Could you explain more about the how the optimal io size can affect > > the device-side read-mod-writes? > > The effects of device-side read-mod-writes are a device-specific implementation detail. > This is not something expected to be universal as its an abstracted detail. > > However the NVMe spec allows the storage namespace to provide implementation > hints about its underlying media. > > Both NPWA and NOWS are used in the NVMe driver, where NOWS provides the > optimal_io hint: > > Per NVMe Command Set 1.0b, Figure 97 > Namespace Preferred Write Alignment (NPWA): This field indicates the recommended > write alignment in logical blocks for this namespace > > Namespace Optimal Write Size (NOWS): This field indicates the size in logical blocks > for optimal write performance for this namespace. > > Per 5.8.2.1 Improved I/O examples (non-normative) > When constructing write operations, the host should minimally construct writes > that meet the recommendations of NPWG and NPWA, but may achieve optimal write > performance by constructing writes that meet the recommendation of NOWS. > > > It then makes sense that an NVMe drive would provide optimal io size hints > that matches its underlying media unit size, such that sub-4k writes might > invoke a device-side read-mod-write whereas 4k writes become atomic. Thanks very much for the detailed explanation. If I want to try it myself, it must use nvme disks that have 4k optimal_io_size, right? I tried to check in many environments with this command: cat /sys/block/nvme4n1/queue/optimal_io_size (it shows 0) So I need to find a machine that is optimal_io_size is 4096, right? Regards Xiao > > > > > Regards > > Xiao > >> > >> Example Intel/Solidigm P5520 > >> Raid10, Chunk-size 64M, bitmap-size 57228 bits > >> > >> $ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1 --assume-clean --bitmap=internal --bitmap-chunk=64M > >> $ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60 > >> > >> Without patch: > >> write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets > >> > >> With patch: > >> write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets > >> > >> Biosnoop: > >> Without patch: > >> Time Process PID Device LBA Size Lat > >> 1.410377 md0_raid10 6900 nvme0n1 W 16 4096 0.02 > >> 1.410387 md0_raid10 6900 nvme2n1 W 16 4096 0.02 > >> 1.410374 md0_raid10 6900 nvme3n1 W 16 4096 0.01 > >> 1.410381 md0_raid10 6900 nvme1n1 W 16 4096 0.02 > >> 1.410411 md0_raid10 6900 nvme1n1 W 115346512 4096 0.01 > >> 1.410418 md0_raid10 6900 nvme0n1 W 115346512 4096 0.02 > >> 1.410915 md0_raid10 6900 nvme2n1 W 24 3584 0.43 > >> 1.410935 md0_raid10 6900 nvme3n1 W 24 3584 0.45 > >> 1.411124 md0_raid10 6900 nvme1n1 W 24 3584 0.64 > >> 1.411147 md0_raid10 6900 nvme0n1 W 24 3584 0.66 > >> 1.411176 md0_raid10 6900 nvme3n1 W 2019022184 4096 0.01 > >> 1.411189 md0_raid10 6900 nvme2n1 W 2019022184 4096 0.02 > >> > >> With patch: > >> Time Process PID Device LBA Size Lat > >> 5.747193 md0_raid10 727 nvme0n1 W 16 4096 0.01 > >> 5.747192 md0_raid10 727 nvme1n1 W 16 4096 0.02 > >> 5.747195 md0_raid10 727 nvme3n1 W 16 4096 0.01 > >> 5.747202 md0_raid10 727 nvme2n1 W 16 4096 0.02 > >> 5.747229 md0_raid10 727 nvme3n1 W 1196223704 4096 0.02 > >> 5.747224 md0_raid10 727 nvme0n1 W 1196223704 4096 0.01 > >> 5.747279 md0_raid10 727 nvme0n1 W 24 4096 0.01 > >> 5.747279 md0_raid10 727 nvme1n1 W 24 4096 0.02 > >> 5.747284 md0_raid10 727 nvme3n1 W 24 4096 0.02 > >> 5.747291 md0_raid10 727 nvme2n1 W 24 4096 0.02 > >> 5.747314 md0_raid10 727 nvme2n1 W 2234636712 4096 0.01 > >> 5.747317 md0_raid10 727 nvme1n1 W 2234636712 4096 0.02 > >> > >> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev> > >> --- > >> v1->v2: Add more information to commit message > >> > >> drivers/md/md-bitmap.c | 27 ++++++++++++++++++--------- > >> 1 file changed, 18 insertions(+), 9 deletions(-) > >> > >> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c > >> index e7cc6ba1b657..569297ea9b99 100644 > >> --- a/drivers/md/md-bitmap.c > >> +++ b/drivers/md/md-bitmap.c > >> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) > >> rdev = NULL; > >> while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { > >> int size = PAGE_SIZE; > >> + int optimal_size = PAGE_SIZE; > >> loff_t offset = mddev->bitmap_info.offset; > >> > >> bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev; > >> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) > >> int last_page_size = store->bytes & (PAGE_SIZE-1); > >> if (last_page_size == 0) > >> last_page_size = PAGE_SIZE; > >> - size = roundup(last_page_size, > >> - bdev_logical_block_size(bdev)); > >> + size = roundup(last_page_size, bdev_logical_block_size(bdev)); > >> + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev)) > >> + optimal_size = roundup(last_page_size, bdev_io_opt(bdev)); > >> + else > >> + optimal_size = size; > >> } > >> + > >> + > >> /* Just make sure we aren't corrupting data or > >> * metadata > >> */ > >> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) > >> goto bad_alignment; > >> } else if (offset < 0) { > >> /* DATA BITMAP METADATA */ > >> - if (offset > >> - + (long)(page->index * (PAGE_SIZE/512)) > >> - + size/512 > 0) > >> + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512)); > >> + if (size != optimal_size && > >> + off + optimal_size/512 <= 0) > >> + size = optimal_size; > >> + else if (off + size/512 > 0) > >> /* bitmap runs in to metadata */ > >> goto bad_alignment; > >> if (rdev->data_offset + mddev->dev_sectors > >> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) > >> goto bad_alignment; > >> } else if (rdev->sb_start < rdev->data_offset) { > >> /* METADATA BITMAP DATA */ > >> - if (rdev->sb_start > >> - + offset > >> - + page->index*(PAGE_SIZE/512) + size/512 > >> - > rdev->data_offset) > >> + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512); > >> + if (size != optimal_size && > >> + off + optimal_size/512 <= rdev->data_offset) > >> + size = optimal_size; > >> + else if (off + size/512 > rdev->data_offset) > >> /* bitmap runs in to data */ > >> goto bad_alignment; > >> } else { > >> -- > >> 2.27.0 > >> > > >
On 2/20/2023 10:35 PM, Xiao Ni wrote: > On Tue, Feb 21, 2023 at 11:31 AM Jonathan Derrick > <jonathan.derrick@linux.dev> wrote: >> >> >> >> On 2/20/2023 7:38 PM, Xiao Ni wrote: >>> On Sat, Feb 18, 2023 at 2:36 AM Jonathan Derrick >>> <jonathan.derrick@linux.dev> wrote: >>>> >>>> From: Jon Derrick <jonathan.derrick@linux.dev> >>>> >>>> If the bitmap space has enough room, size the I/O for the last bitmap >>>> page write to the optimal I/O size for the storage device. The expanded >>>> write is checked that it won't overrun the data or metadata. >>>> >>>> This change helps increase performance by preventing unnecessary >>>> device-side read-mod-writes due to non-atomic write unit sizes. >>> >>> Hi Jonathan >>> >>> Thanks for your patch. >>> >>> Could you explain more about the how the optimal io size can affect >>> the device-side read-mod-writes? >> >> The effects of device-side read-mod-writes are a device-specific implementation detail. >> This is not something expected to be universal as its an abstracted detail. >> >> However the NVMe spec allows the storage namespace to provide implementation >> hints about its underlying media. >> >> Both NPWA and NOWS are used in the NVMe driver, where NOWS provides the >> optimal_io hint: >> >> Per NVMe Command Set 1.0b, Figure 97 >> Namespace Preferred Write Alignment (NPWA): This field indicates the recommended >> write alignment in logical blocks for this namespace >> >> Namespace Optimal Write Size (NOWS): This field indicates the size in logical blocks >> for optimal write performance for this namespace. >> >> Per 5.8.2.1 Improved I/O examples (non-normative) >> When constructing write operations, the host should minimally construct writes >> that meet the recommendations of NPWG and NPWA, but may achieve optimal write >> performance by constructing writes that meet the recommendation of NOWS. >> >> >> It then makes sense that an NVMe drive would provide optimal io size hints >> that matches its underlying media unit size, such that sub-4k writes might >> invoke a device-side read-mod-write whereas 4k writes become atomic. > > Thanks very much for the detailed explanation. > > If I want to try it myself, it must use nvme disks that have 4k > optimal_io_size, right? > I tried to check in many environments with this command: > cat /sys/block/nvme4n1/queue/optimal_io_size (it shows 0) > > So I need to find a machine that is optimal_io_size is 4096, right? > You can force it with something like this: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index d307ae4d8a57..9c349ad54352 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1901,6 +1901,8 @@ static void nvme_update_disk_info(struct gendisk *disk, /* NOWS = Namespace Optimal Write Size */ io_opt = bs * (1 + le16_to_cpu(id->nows)); } + phys_bs = 4096; + io_opt = 4096; blk_queue_logical_block_size(disk->queue, bs); /* > Regards > Xiao >> >>> >>> Regards >>> Xiao >>>> >>>> Example Intel/Solidigm P5520 >>>> Raid10, Chunk-size 64M, bitmap-size 57228 bits >>>> >>>> $ mdadm --create /dev/md0 --level=10 --raid-devices=4 /dev/nvme{0,1,2,3}n1 --assume-clean --bitmap=internal --bitmap-chunk=64M >>>> $ fio --name=test --direct=1 --filename=/dev/md0 --rw=randwrite --bs=4k --runtime=60 >>>> >>>> Without patch: >>>> write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets >>>> >>>> With patch: >>>> write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets >>>> >>>> Biosnoop: >>>> Without patch: >>>> Time Process PID Device LBA Size Lat >>>> 1.410377 md0_raid10 6900 nvme0n1 W 16 4096 0.02 >>>> 1.410387 md0_raid10 6900 nvme2n1 W 16 4096 0.02 >>>> 1.410374 md0_raid10 6900 nvme3n1 W 16 4096 0.01 >>>> 1.410381 md0_raid10 6900 nvme1n1 W 16 4096 0.02 >>>> 1.410411 md0_raid10 6900 nvme1n1 W 115346512 4096 0.01 >>>> 1.410418 md0_raid10 6900 nvme0n1 W 115346512 4096 0.02 >>>> 1.410915 md0_raid10 6900 nvme2n1 W 24 3584 0.43 >>>> 1.410935 md0_raid10 6900 nvme3n1 W 24 3584 0.45 >>>> 1.411124 md0_raid10 6900 nvme1n1 W 24 3584 0.64 >>>> 1.411147 md0_raid10 6900 nvme0n1 W 24 3584 0.66 >>>> 1.411176 md0_raid10 6900 nvme3n1 W 2019022184 4096 0.01 >>>> 1.411189 md0_raid10 6900 nvme2n1 W 2019022184 4096 0.02 >>>> >>>> With patch: >>>> Time Process PID Device LBA Size Lat >>>> 5.747193 md0_raid10 727 nvme0n1 W 16 4096 0.01 >>>> 5.747192 md0_raid10 727 nvme1n1 W 16 4096 0.02 >>>> 5.747195 md0_raid10 727 nvme3n1 W 16 4096 0.01 >>>> 5.747202 md0_raid10 727 nvme2n1 W 16 4096 0.02 >>>> 5.747229 md0_raid10 727 nvme3n1 W 1196223704 4096 0.02 >>>> 5.747224 md0_raid10 727 nvme0n1 W 1196223704 4096 0.01 >>>> 5.747279 md0_raid10 727 nvme0n1 W 24 4096 0.01 >>>> 5.747279 md0_raid10 727 nvme1n1 W 24 4096 0.02 >>>> 5.747284 md0_raid10 727 nvme3n1 W 24 4096 0.02 >>>> 5.747291 md0_raid10 727 nvme2n1 W 24 4096 0.02 >>>> 5.747314 md0_raid10 727 nvme2n1 W 2234636712 4096 0.01 >>>> 5.747317 md0_raid10 727 nvme1n1 W 2234636712 4096 0.02 >>>> >>>> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev> >>>> --- >>>> v1->v2: Add more information to commit message >>>> >>>> drivers/md/md-bitmap.c | 27 ++++++++++++++++++--------- >>>> 1 file changed, 18 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c >>>> index e7cc6ba1b657..569297ea9b99 100644 >>>> --- a/drivers/md/md-bitmap.c >>>> +++ b/drivers/md/md-bitmap.c >>>> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) >>>> rdev = NULL; >>>> while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { >>>> int size = PAGE_SIZE; >>>> + int optimal_size = PAGE_SIZE; >>>> loff_t offset = mddev->bitmap_info.offset; >>>> >>>> bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev; >>>> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) >>>> int last_page_size = store->bytes & (PAGE_SIZE-1); >>>> if (last_page_size == 0) >>>> last_page_size = PAGE_SIZE; >>>> - size = roundup(last_page_size, >>>> - bdev_logical_block_size(bdev)); >>>> + size = roundup(last_page_size, bdev_logical_block_size(bdev)); >>>> + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev)) >>>> + optimal_size = roundup(last_page_size, bdev_io_opt(bdev)); >>>> + else >>>> + optimal_size = size; >>>> } >>>> + >>>> + >>>> /* Just make sure we aren't corrupting data or >>>> * metadata >>>> */ >>>> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) >>>> goto bad_alignment; >>>> } else if (offset < 0) { >>>> /* DATA BITMAP METADATA */ >>>> - if (offset >>>> - + (long)(page->index * (PAGE_SIZE/512)) >>>> - + size/512 > 0) >>>> + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512)); >>>> + if (size != optimal_size && >>>> + off + optimal_size/512 <= 0) >>>> + size = optimal_size; >>>> + else if (off + size/512 > 0) >>>> /* bitmap runs in to metadata */ >>>> goto bad_alignment; >>>> if (rdev->data_offset + mddev->dev_sectors >>>> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) >>>> goto bad_alignment; >>>> } else if (rdev->sb_start < rdev->data_offset) { >>>> /* METADATA BITMAP DATA */ >>>> - if (rdev->sb_start >>>> - + offset >>>> - + page->index*(PAGE_SIZE/512) + size/512 >>>> - > rdev->data_offset) >>>> + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512); >>>> + if (size != optimal_size && >>>> + off + optimal_size/512 <= rdev->data_offset) >>>> + size = optimal_size; >>>> + else if (off + size/512 > rdev->data_offset) >>>> /* bitmap runs in to data */ >>>> goto bad_alignment; >>>> } else { >>>> -- >>>> 2.27.0 >>>> >>> >> >
On Tue, Feb 21, 2023 at 12:34:51AM -0700, Jonathan Derrick wrote:
> You can force it with something like this:
Also scsi_debug lets you set the optimal I/O size using the
opt_blks parameter.
This becomes prety unreadable with all the long lines. Can you do a little refactoring, with: 1) a prep patch that splits out a helper to write the bitmap to one device (i.e. the main loop body) 2) add a helper that returns the optimal I/O sizse in your new patch. Please also use unsigned int instead of plain int for the I/O size everywhere as it can't be signed. > /* DATA BITMAP METADATA */ > + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512)); And all the arithmetics here look ott to say nicely. page->index is a pgoff_t, and multiplying it could cause overflows on 32-bit architetures, so you need to cast it to a 64-bit type before multiplying it. It also will be unsigned, and I'm not quite sure if loff_t is the right type for something that is in Linux block number uints, that would usually be a sector_t. Last but not least it should be using SECTOR_SIZe and have whitespaces around the operators. I know at least some of this is pre-existing in the coe, but I think this is a good time to get it right.
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index e7cc6ba1b657..569297ea9b99 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) rdev = NULL; while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { int size = PAGE_SIZE; + int optimal_size = PAGE_SIZE; loff_t offset = mddev->bitmap_info.offset; bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev; @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) int last_page_size = store->bytes & (PAGE_SIZE-1); if (last_page_size == 0) last_page_size = PAGE_SIZE; - size = roundup(last_page_size, - bdev_logical_block_size(bdev)); + size = roundup(last_page_size, bdev_logical_block_size(bdev)); + if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev)) + optimal_size = roundup(last_page_size, bdev_io_opt(bdev)); + else + optimal_size = size; } + + /* Just make sure we aren't corrupting data or * metadata */ @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) goto bad_alignment; } else if (offset < 0) { /* DATA BITMAP METADATA */ - if (offset - + (long)(page->index * (PAGE_SIZE/512)) - + size/512 > 0) + loff_t off = offset + (long)(page->index * (PAGE_SIZE/512)); + if (size != optimal_size && + off + optimal_size/512 <= 0) + size = optimal_size; + else if (off + size/512 > 0) /* bitmap runs in to metadata */ goto bad_alignment; if (rdev->data_offset + mddev->dev_sectors @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait) goto bad_alignment; } else if (rdev->sb_start < rdev->data_offset) { /* METADATA BITMAP DATA */ - if (rdev->sb_start - + offset - + page->index*(PAGE_SIZE/512) + size/512 - > rdev->data_offset) + loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512); + if (size != optimal_size && + off + optimal_size/512 <= rdev->data_offset) + size = optimal_size; + else if (off + size/512 > rdev->data_offset) /* bitmap runs in to data */ goto bad_alignment; } else {