md: Use optimal I/O size for last bitmap page

Message ID	20230118005319.147-1-jonathan.derrick@linux.dev (mailing list archive)
State	Accepted, archived
Delegated to:	Song Liu
Headers	show Return-Path: <linux-raid-owner@vger.kernel.org> From: Jonathan Derrick <jonathan.derrick@linux.dev> To: Song Liu <song@kernel.org>, <linux-raid@vger.kernel.org> Cc: Sushma Kalakota <sushma.kalakota@intel.com>, Jon Derrick <jonathan.derrick@linux.dev> Subject: [PATCH] md: Use optimal I/O size for last bitmap page Date: Tue, 17 Jan 2023 17:53:19 -0700 Message-Id: <20230118005319.147-1-jonathan.derrick@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	md: Use optimal I/O size for last bitmap page \| expand md: Use optimal I/O size for last bitmap page

Jonathan Derrick Jan. 18, 2023, 12:53 a.m. UTC

From: Jon Derrick <jonathan.derrick@linux.dev>

If the bitmap space has enough room, size the I/O for the last bitmap
page write to the optimal I/O size for the storage device. The expanded
write is checked that it won't overrun the data or metadata.

This change helps increase performance by preventing unnecessary
device-side read-mod-writes due to non-atomic write unit sizes.

Ex biosnoop log. Device lba size 512, optimal size 4k:
Before:
Time        Process        PID     Device      LBA        Size      Lat
0.843734    md0_raid10     5267    nvme0n1   W 24         3584      1.17
0.843933    md0_raid10     5267    nvme1n1   W 24         3584      1.36
0.843968    md0_raid10     5267    nvme1n1   W 14207939968 4096      0.01
0.843979    md0_raid10     5267    nvme0n1   W 14207939968 4096      0.02

After:
Time        Process        PID     Device      LBA        Size      Lat
18.374244   md0_raid10     6559    nvme0n1   W 24         4096      0.01
18.374253   md0_raid10     6559    nvme1n1   W 24         4096      0.01
18.374300   md0_raid10     6559    nvme0n1   W 11020272296 4096      0.01
18.374306   md0_raid10     6559    nvme1n1   W 11020272296 4096      0.02

Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
---
 drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

Jonathan Derrick Feb. 9, 2023, 8:37 p.m. UTC | #1

Hi Song,

Any thoughts on this?

On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
> From: Jon Derrick <jonathan.derrick@linux.dev>
> 
> If the bitmap space has enough room, size the I/O for the last bitmap
> page write to the optimal I/O size for the storage device. The expanded
> write is checked that it won't overrun the data or metadata.
> 
> This change helps increase performance by preventing unnecessary
> device-side read-mod-writes due to non-atomic write unit sizes.
> 
> Ex biosnoop log. Device lba size 512, optimal size 4k:
> Before:
> Time        Process        PID     Device      LBA        Size      Lat
> 0.843734    md0_raid10     5267    nvme0n1   W 24         3584      1.17
> 0.843933    md0_raid10     5267    nvme1n1   W 24         3584      1.36
> 0.843968    md0_raid10     5267    nvme1n1   W 14207939968 4096      0.01
> 0.843979    md0_raid10     5267    nvme0n1   W 14207939968 4096      0.02
> 
> After:
> Time        Process        PID     Device      LBA        Size      Lat
> 18.374244   md0_raid10     6559    nvme0n1   W 24         4096      0.01
> 18.374253   md0_raid10     6559    nvme1n1   W 24         4096      0.01
> 18.374300   md0_raid10     6559    nvme0n1   W 11020272296 4096      0.01
> 18.374306   md0_raid10     6559    nvme1n1   W 11020272296 4096      0.02
> 
> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
> ---
>  drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
>  1 file changed, 18 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> index e7cc6ba1b657..569297ea9b99 100644
> --- a/drivers/md/md-bitmap.c
> +++ b/drivers/md/md-bitmap.c
> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>  	rdev = NULL;
>  	while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
>  		int size = PAGE_SIZE;
> +		int optimal_size = PAGE_SIZE;
>  		loff_t offset = mddev->bitmap_info.offset;
>  
>  		bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>  			int last_page_size = store->bytes & (PAGE_SIZE-1);
>  			if (last_page_size == 0)
>  				last_page_size = PAGE_SIZE;
> -			size = roundup(last_page_size,
> -				       bdev_logical_block_size(bdev));
> +			size = roundup(last_page_size, bdev_logical_block_size(bdev));
> +			if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
> +				optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
> +			else
> +				optimal_size = size;
>  		}
> +
> +
>  		/* Just make sure we aren't corrupting data or
>  		 * metadata
>  		 */
> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>  				goto bad_alignment;
>  		} else if (offset < 0) {
>  			/* DATA  BITMAP METADATA  */
> -			if (offset
> -			    + (long)(page->index * (PAGE_SIZE/512))
> -			    + size/512 > 0)
> +			loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
> +			if (size != optimal_size &&
> +			    off + optimal_size/512 <= 0)
> +				size = optimal_size;
> +			else if (off + size/512 > 0)
>  				/* bitmap runs in to metadata */
>  				goto bad_alignment;
>  			if (rdev->data_offset + mddev->dev_sectors
> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>  				goto bad_alignment;
>  		} else if (rdev->sb_start < rdev->data_offset) {
>  			/* METADATA BITMAP DATA */
> -			if (rdev->sb_start
> -			    + offset
> -			    + page->index*(PAGE_SIZE/512) + size/512
> -			    > rdev->data_offset)
> +			loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
> +			if (size != optimal_size &&
> +			    off + optimal_size/512 <= rdev->data_offset)
> +				size = optimal_size;
> +			else if (off + size/512 > rdev->data_offset)
>  				/* bitmap runs in to data */
>  				goto bad_alignment;
>  		} else {

Song Liu Feb. 10, 2023, 5:32 p.m. UTC | #2

Hi Jonathan,

On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick
<jonathan.derrick@linux.dev> wrote:
>
> Hi Song,
>
> Any thoughts on this?

I am really sorry that I missed this patch.

>
> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
> > From: Jon Derrick <jonathan.derrick@linux.dev>
> >
> > If the bitmap space has enough room, size the I/O for the last bitmap
> > page write to the optimal I/O size for the storage device. The expanded
> > write is checked that it won't overrun the data or metadata.
> >
> > This change helps increase performance by preventing unnecessary
> > device-side read-mod-writes due to non-atomic write unit sizes.
> >
> > Ex biosnoop log. Device lba size 512, optimal size 4k:
> > Before:
> > Time        Process        PID     Device      LBA        Size      Lat
> > 0.843734    md0_raid10     5267    nvme0n1   W 24         3584      1.17
> > 0.843933    md0_raid10     5267    nvme1n1   W 24         3584      1.36
> > 0.843968    md0_raid10     5267    nvme1n1   W 14207939968 4096      0.01
> > 0.843979    md0_raid10     5267    nvme0n1   W 14207939968 4096      0.02
> >
> > After:
> > Time        Process        PID     Device      LBA        Size      Lat
> > 18.374244   md0_raid10     6559    nvme0n1   W 24         4096      0.01
> > 18.374253   md0_raid10     6559    nvme1n1   W 24         4096      0.01
> > 18.374300   md0_raid10     6559    nvme0n1   W 11020272296 4096      0.01
> > 18.374306   md0_raid10     6559    nvme1n1   W 11020272296 4096      0.02

Do we see significant improvements from io benchmarks?

IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
such optimizations in the future.

Thanks,
Song

> >
> > Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
> > ---
> >  drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
> >  1 file changed, 18 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
> > index e7cc6ba1b657..569297ea9b99 100644
> > --- a/drivers/md/md-bitmap.c
> > +++ b/drivers/md/md-bitmap.c
> > @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> >       rdev = NULL;
> >       while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
> >               int size = PAGE_SIZE;
> > +             int optimal_size = PAGE_SIZE;
> >               loff_t offset = mddev->bitmap_info.offset;
> >
> >               bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
> > @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> >                       int last_page_size = store->bytes & (PAGE_SIZE-1);
> >                       if (last_page_size == 0)
> >                               last_page_size = PAGE_SIZE;
> > -                     size = roundup(last_page_size,
> > -                                    bdev_logical_block_size(bdev));
> > +                     size = roundup(last_page_size, bdev_logical_block_size(bdev));
> > +                     if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
> > +                             optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
> > +                     else
> > +                             optimal_size = size;
> >               }
> > +
> > +
> >               /* Just make sure we aren't corrupting data or
> >                * metadata
> >                */
> > @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> >                               goto bad_alignment;
> >               } else if (offset < 0) {
> >                       /* DATA  BITMAP METADATA  */
> > -                     if (offset
> > -                         + (long)(page->index * (PAGE_SIZE/512))
> > -                         + size/512 > 0)
> > +                     loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
> > +                     if (size != optimal_size &&
> > +                         off + optimal_size/512 <= 0)
> > +                             size = optimal_size;
> > +                     else if (off + size/512 > 0)
> >                               /* bitmap runs in to metadata */
> >                               goto bad_alignment;
> >                       if (rdev->data_offset + mddev->dev_sectors
> > @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
> >                               goto bad_alignment;
> >               } else if (rdev->sb_start < rdev->data_offset) {
> >                       /* METADATA BITMAP DATA */
> > -                     if (rdev->sb_start
> > -                         + offset
> > -                         + page->index*(PAGE_SIZE/512) + size/512
> > -                         > rdev->data_offset)
> > +                     loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
> > +                     if (size != optimal_size &&
> > +                         off + optimal_size/512 <= rdev->data_offset)
> > +                             size = optimal_size;
> > +                     else if (off + size/512 > rdev->data_offset)
> >                               /* bitmap runs in to data */
> >                               goto bad_alignment;
> >               } else {

Jonathan Derrick Feb. 16, 2023, 11:52 p.m. UTC | #3

On 2/10/2023 10:32 AM, Song Liu wrote:
> Hi Jonathan,
> 
> On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick
> <jonathan.derrick@linux.dev> wrote:
>>
>> Hi Song,
>>
>> Any thoughts on this?
> 
> I am really sorry that I missed this patch.
> 
>>
>> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
>>> From: Jon Derrick <jonathan.derrick@linux.dev>
>>>
>>> If the bitmap space has enough room, size the I/O for the last bitmap
>>> page write to the optimal I/O size for the storage device. The expanded
>>> write is checked that it won't overrun the data or metadata.
>>>
>>> This change helps increase performance by preventing unnecessary
>>> device-side read-mod-writes due to non-atomic write unit sizes.
>>>
>>> Ex biosnoop log. Device lba size 512, optimal size 4k:
>>> Before:
>>> Time        Process        PID     Device      LBA        Size      Lat
>>> 0.843734    md0_raid10     5267    nvme0n1   W 24         3584      1.17
>>> 0.843933    md0_raid10     5267    nvme1n1   W 24         3584      1.36
>>> 0.843968    md0_raid10     5267    nvme1n1   W 14207939968 4096      0.01
>>> 0.843979    md0_raid10     5267    nvme0n1   W 14207939968 4096      0.02
>>>
>>> After:
>>> Time        Process        PID     Device      LBA        Size      Lat
>>> 18.374244   md0_raid10     6559    nvme0n1   W 24         4096      0.01
>>> 18.374253   md0_raid10     6559    nvme1n1   W 24         4096      0.01
>>> 18.374300   md0_raid10     6559    nvme0n1   W 11020272296 4096      0.01
>>> 18.374306   md0_raid10     6559    nvme1n1   W 11020272296 4096      0.02
> 
> Do we see significant improvements from io benchmarks?
Yes. With lbaf=512, optimal i/o size=4k:

Without patch:
  write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
With patch:
  write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets

It's going to be different for different drives, but this was a drive where a
drive-side read-mod-write has huge penalty.

> 
> IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
> such optimizations in the future.
Maybe. But many drives still ship with 512B formatting as default.

> 
> Thanks,
> Song
> 
>>>
>>> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
>>> ---
>>>  drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
>>>  1 file changed, 18 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
>>> index e7cc6ba1b657..569297ea9b99 100644
>>> --- a/drivers/md/md-bitmap.c
>>> +++ b/drivers/md/md-bitmap.c
>>> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>       rdev = NULL;
>>>       while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
>>>               int size = PAGE_SIZE;
>>> +             int optimal_size = PAGE_SIZE;
>>>               loff_t offset = mddev->bitmap_info.offset;
>>>
>>>               bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
>>> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>                       int last_page_size = store->bytes & (PAGE_SIZE-1);
>>>                       if (last_page_size == 0)
>>>                               last_page_size = PAGE_SIZE;
>>> -                     size = roundup(last_page_size,
>>> -                                    bdev_logical_block_size(bdev));
>>> +                     size = roundup(last_page_size, bdev_logical_block_size(bdev));
>>> +                     if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
>>> +                             optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
>>> +                     else
>>> +                             optimal_size = size;
>>>               }
>>> +
>>> +
>>>               /* Just make sure we aren't corrupting data or
>>>                * metadata
>>>                */
>>> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>                               goto bad_alignment;
>>>               } else if (offset < 0) {
>>>                       /* DATA  BITMAP METADATA  */
>>> -                     if (offset
>>> -                         + (long)(page->index * (PAGE_SIZE/512))
>>> -                         + size/512 > 0)
>>> +                     loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
>>> +                     if (size != optimal_size &&
>>> +                         off + optimal_size/512 <= 0)
>>> +                             size = optimal_size;
>>> +                     else if (off + size/512 > 0)
>>>                               /* bitmap runs in to metadata */
>>>                               goto bad_alignment;
>>>                       if (rdev->data_offset + mddev->dev_sectors
>>> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>                               goto bad_alignment;
>>>               } else if (rdev->sb_start < rdev->data_offset) {
>>>                       /* METADATA BITMAP DATA */
>>> -                     if (rdev->sb_start
>>> -                         + offset
>>> -                         + page->index*(PAGE_SIZE/512) + size/512
>>> -                         > rdev->data_offset)
>>> +                     loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
>>> +                     if (size != optimal_size &&
>>> +                         off + optimal_size/512 <= rdev->data_offset)
>>> +                             size = optimal_size;
>>> +                     else if (off + size/512 > rdev->data_offset)
>>>                               /* bitmap runs in to data */
>>>                               goto bad_alignment;
>>>               } else {

Paul Menzel Feb. 17, 2023, 1:21 p.m. UTC | #4

Dear Jonathan,


Thank you for your patch.

Am 17.02.23 um 00:52 schrieb Jonathan Derrick:

> On 2/10/2023 10:32 AM, Song Liu wrote:

>> On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick wrote:

>>> Any thoughts on this?
>>
>> I am really sorry that I missed this patch.
>>
>>> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
>>>> From: Jon Derrick <jonathan.derrick@linux.dev>
>>>>
>>>> If the bitmap space has enough room, size the I/O for the last bitmap
>>>> page write to the optimal I/O size for the storage device. The expanded
>>>> write is checked that it won't overrun the data or metadata.
>>>>
>>>> This change helps increase performance by preventing unnecessary
>>>> device-side read-mod-writes due to non-atomic write unit sizes.
>>>>
>>>> Ex biosnoop log. Device lba size 512, optimal size 4k:
>>>> Before:
>>>> Time        Process        PID     Device      LBA        Size      Lat
>>>> 0.843734    md0_raid10     5267    nvme0n1   W 24         3584      1.17
>>>> 0.843933    md0_raid10     5267    nvme1n1   W 24         3584      1.36
>>>> 0.843968    md0_raid10     5267    nvme1n1   W 14207939968 4096      0.01
>>>> 0.843979    md0_raid10     5267    nvme0n1   W 14207939968 4096      0.02
>>>>
>>>> After:
>>>> Time        Process        PID     Device      LBA        Size      Lat
>>>> 18.374244   md0_raid10     6559    nvme0n1   W 24         4096      0.01
>>>> 18.374253   md0_raid10     6559    nvme1n1   W 24         4096      0.01
>>>> 18.374300   md0_raid10     6559    nvme0n1   W 11020272296 4096      0.01
>>>> 18.374306   md0_raid10     6559    nvme1n1   W 11020272296 4096      0.02
>>
>> Do we see significant improvements from io benchmarks?
> 
> Yes. With lbaf=512, optimal i/o size=4k:
> 
> Without patch:
>    write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
> With patch:
>    write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets
> 
> It's going to be different for different drives, but this was a drive where a
> drive-side read-mod-write has huge penalty.

Nice. Can you share the drive model used for benchmarking. Also, maybe 
also add that to the commit message?

>> IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
>> such optimizations in the future.
> 
> Maybe. But many drives still ship with 512B formatting as default.

Is that the problem on “desktop” HDDs, where `PHY-SEC` and `LOG-SEC` are 
different?

     $ lsblk -o 
NAME,MODEL,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,ROTA,SCHED,RQ-SIZE,RA,WSAME 
-d /dev/sda
     NAME MODEL              ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC 
ROTA SCHED RQ-SIZE  RA WSAME
     sda  TOSHIBA DT01ACA050         0   4096      0    4096     512 
1 bfq        64 128    0B

(Though these are not used in a RAID here.)


Kind regards,

Paul


[1]: 
https://toshiba.semicon-storage.com/content/dam/toshiba-ss-v3/master/en/storage/product/internal-specialty/cHDD-DT01ACAxxx-Product-Overview.pdf


>>>> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
>>>> ---
>>>>   drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
>>>>   1 file changed, 18 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
>>>> index e7cc6ba1b657..569297ea9b99 100644
>>>> --- a/drivers/md/md-bitmap.c
>>>> +++ b/drivers/md/md-bitmap.c
>>>> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>        rdev = NULL;
>>>>        while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
>>>>                int size = PAGE_SIZE;
>>>> +             int optimal_size = PAGE_SIZE;
>>>>                loff_t offset = mddev->bitmap_info.offset;
>>>>
>>>>                bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
>>>> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>                        int last_page_size = store->bytes & (PAGE_SIZE-1);
>>>>                        if (last_page_size == 0)
>>>>                                last_page_size = PAGE_SIZE;
>>>> -                     size = roundup(last_page_size,
>>>> -                                    bdev_logical_block_size(bdev));
>>>> +                     size = roundup(last_page_size, bdev_logical_block_size(bdev));
>>>> +                     if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
>>>> +                             optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
>>>> +                     else
>>>> +                             optimal_size = size;
>>>>                }
>>>> +
>>>> +
>>>>                /* Just make sure we aren't corrupting data or
>>>>                 * metadata
>>>>                 */
>>>> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>                                goto bad_alignment;
>>>>                } else if (offset < 0) {
>>>>                        /* DATA  BITMAP METADATA  */
>>>> -                     if (offset
>>>> -                         + (long)(page->index * (PAGE_SIZE/512))
>>>> -                         + size/512 > 0)
>>>> +                     loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
>>>> +                     if (size != optimal_size &&
>>>> +                         off + optimal_size/512 <= 0)
>>>> +                             size = optimal_size;
>>>> +                     else if (off + size/512 > 0)
>>>>                                /* bitmap runs in to metadata */
>>>>                                goto bad_alignment;
>>>>                        if (rdev->data_offset + mddev->dev_sectors
>>>> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>                                goto bad_alignment;
>>>>                } else if (rdev->sb_start < rdev->data_offset) {
>>>>                        /* METADATA BITMAP DATA */
>>>> -                     if (rdev->sb_start
>>>> -                         + offset
>>>> -                         + page->index*(PAGE_SIZE/512) + size/512
>>>> -                         > rdev->data_offset)
>>>> +                     loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
>>>> +                     if (size != optimal_size &&
>>>> +                         off + optimal_size/512 <= rdev->data_offset)
>>>> +                             size = optimal_size;
>>>> +                     else if (off + size/512 > rdev->data_offset)
>>>>                                /* bitmap runs in to data */
>>>>                                goto bad_alignment;
>>>>                } else {

Jonathan Derrick Feb. 17, 2023, 6:22 p.m. UTC | #5

On 2/17/2023 6:21 AM, Paul Menzel wrote:
> Dear Jonathan,
> 
> 
> Thank you for your patch.
> 
> Am 17.02.23 um 00:52 schrieb Jonathan Derrick:
> 
>> On 2/10/2023 10:32 AM, Song Liu wrote:
> 
>>> On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick wrote:
> 
>>>> Any thoughts on this?
>>>
>>> I am really sorry that I missed this patch.
>>>
>>>> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
>>>>> From: Jon Derrick <jonathan.derrick@linux.dev>
>>>>>
>>>>> If the bitmap space has enough room, size the I/O for the last bitmap
>>>>> page write to the optimal I/O size for the storage device. The expanded
>>>>> write is checked that it won't overrun the data or metadata.
>>>>>
>>>>> This change helps increase performance by preventing unnecessary
>>>>> device-side read-mod-writes due to non-atomic write unit sizes.
>>>>>
>>>>> Ex biosnoop log. Device lba size 512, optimal size 4k:
>>>>> Before:
>>>>> Time        Process        PID     Device      LBA        Size      Lat
>>>>> 0.843734    md0_raid10     5267    nvme0n1   W 24         3584      1.17
>>>>> 0.843933    md0_raid10     5267    nvme1n1   W 24         3584      1.36
>>>>> 0.843968    md0_raid10     5267    nvme1n1   W 14207939968 4096      0.01
>>>>> 0.843979    md0_raid10     5267    nvme0n1   W 14207939968 4096      0.02
>>>>>
>>>>> After:
>>>>> Time        Process        PID     Device      LBA        Size      Lat
>>>>> 18.374244   md0_raid10     6559    nvme0n1   W 24         4096      0.01
>>>>> 18.374253   md0_raid10     6559    nvme1n1   W 24         4096      0.01
>>>>> 18.374300   md0_raid10     6559    nvme0n1   W 11020272296 4096      0.01
>>>>> 18.374306   md0_raid10     6559    nvme1n1   W 11020272296 4096      0.02
>>>
>>> Do we see significant improvements from io benchmarks?
>>
>> Yes. With lbaf=512, optimal i/o size=4k:
>>
>> Without patch:
>>    write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
>> With patch:
>>    write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets
>>
>> It's going to be different for different drives, but this was a drive where a
>> drive-side read-mod-write has huge penalty.
> 
> Nice. Can you share the drive model used for benchmarking. Also, maybe also add that to the commit message?
> 
It's an Intel/Solidigm P5520 but other drives show similar results.
The mdraid configuration had it having 2 bitmap pages, which meant that the 'last' bitmap page was
being exercised in the random workload roughly 50% of the time, leading to this high latency result.


Here's a test on another system.
Raid10, Chunk-size 64M, bitmap-size 57228 bits
Without patch:
  write: IOPS=1676, BW=6708KiB/s (6869kB/s)(393MiB/60001msec); 0 zone resets

With patch:
  write: IOPS=15.7k, BW=61.4MiB/s (64.4MB/s)(3683MiB/60001msec); 0 zone resets


Biosnoop:
Without patch:
Time        Process        PID     Device      LBA        Size      Lat
1.410377    md0_raid10     6900    nvme0n1   W 16         4096      0.02
1.410387    md0_raid10     6900    nvme2n1   W 16         4096      0.02
1.410374    md0_raid10     6900    nvme3n1   W 16         4096      0.01
1.410381    md0_raid10     6900    nvme1n1   W 16         4096      0.02
1.410411    md0_raid10     6900    nvme1n1   W 115346512  4096      0.01
1.410418    md0_raid10     6900    nvme0n1   W 115346512  4096      0.02
1.410915    md0_raid10     6900    nvme2n1   W 24         3584      0.43
1.410935    md0_raid10     6900    nvme3n1   W 24         3584      0.45
1.411124    md0_raid10     6900    nvme1n1   W 24         3584      0.64
1.411147    md0_raid10     6900    nvme0n1   W 24         3584      0.66
1.411176    md0_raid10     6900    nvme3n1   W 2019022184 4096      0.01
1.411189    md0_raid10     6900    nvme2n1   W 2019022184 4096      0.02

With patch:
Time        Process        PID     Device      LBA        Size      Lat
5.747193    md0_raid10     727     nvme0n1   W 16         4096      0.01
5.747192    md0_raid10     727     nvme1n1   W 16         4096      0.02
5.747195    md0_raid10     727     nvme3n1   W 16         4096      0.01
5.747202    md0_raid10     727     nvme2n1   W 16         4096      0.02
5.747229    md0_raid10     727     nvme3n1   W 1196223704 4096      0.02
5.747224    md0_raid10     727     nvme0n1   W 1196223704 4096      0.01
5.747279    md0_raid10     727     nvme0n1   W 24         4096      0.01
5.747279    md0_raid10     727     nvme1n1   W 24         4096      0.02
5.747284    md0_raid10     727     nvme3n1   W 24         4096      0.02
5.747291    md0_raid10     727     nvme2n1   W 24         4096      0.02
5.747314    md0_raid10     727     nvme2n1   W 2234636712 4096      0.01
5.747317    md0_raid10     727     nvme1n1   W 2234636712 4096      0.02



>>> IIUC, fewer future HDDs will use 512B LBA size.We probably don't need
>>> such optimizations in the future.
>>
>> Maybe. But many drives still ship with 512B formatting as default.
> 
> Is that the problem on “desktop” HDDs, where `PHY-SEC` and `LOG-SEC` are different?
> 
>     $ lsblk -o NAME,MODEL,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,ROTA,SCHED,RQ-SIZE,RA,WSAME -d /dev/sda
>     NAME MODEL              ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
>     sda  TOSHIBA DT01ACA050         0   4096      0    4096     512 1 bfq        64 128    0B
> 
> (Though these are not used in a RAID here.)
In the test scenario, both PHYS-SEC and LOG-SEC are 512, however MIN-IO and OPT-IO are 4k.

> 
> 
> Kind regards,
> 
> Paul
> 
> 
> [1]: https://toshiba.semicon-storage.com/content/dam/toshiba-ss-v3/master/en/storage/product/internal-specialty/cHDD-DT01ACAxxx-Product-Overview.pdf
> 
> 
>>>>> Signed-off-by: Jon Derrick <jonathan.derrick@linux.dev>
>>>>> ---
>>>>>   drivers/md/md-bitmap.c | 27 ++++++++++++++++++---------
>>>>>   1 file changed, 18 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
>>>>> index e7cc6ba1b657..569297ea9b99 100644
>>>>> --- a/drivers/md/md-bitmap.c
>>>>> +++ b/drivers/md/md-bitmap.c
>>>>> @@ -220,6 +220,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>>        rdev = NULL;
>>>>>        while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
>>>>>                int size = PAGE_SIZE;
>>>>> +             int optimal_size = PAGE_SIZE;
>>>>>                loff_t offset = mddev->bitmap_info.offset;
>>>>>
>>>>>                bdev = (rdev->meta_bdev) ? rdev->meta_bdev : rdev->bdev;
>>>>> @@ -228,9 +229,14 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>>                        int last_page_size = store->bytes & (PAGE_SIZE-1);
>>>>>                        if (last_page_size == 0)
>>>>>                                last_page_size = PAGE_SIZE;
>>>>> -                     size = roundup(last_page_size,
>>>>> -                                    bdev_logical_block_size(bdev));
>>>>> +                     size = roundup(last_page_size, bdev_logical_block_size(bdev));
>>>>> +                     if (bdev_io_opt(bdev) > bdev_logical_block_size(bdev))
>>>>> +                             optimal_size = roundup(last_page_size, bdev_io_opt(bdev));
>>>>> +                     else
>>>>> +                             optimal_size = size;
>>>>>                }
>>>>> +
>>>>> +
>>>>>                /* Just make sure we aren't corrupting data or
>>>>>                 * metadata
>>>>>                 */
>>>>> @@ -246,9 +252,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>>                                goto bad_alignment;
>>>>>                } else if (offset < 0) {
>>>>>                        /* DATA  BITMAP METADATA  */
>>>>> -                     if (offset
>>>>> -                         + (long)(page->index * (PAGE_SIZE/512))
>>>>> -                         + size/512 > 0)
>>>>> +                     loff_t off = offset + (long)(page->index * (PAGE_SIZE/512));
>>>>> +                     if (size != optimal_size &&
>>>>> +                         off + optimal_size/512 <= 0)
>>>>> +                             size = optimal_size;
>>>>> +                     else if (off + size/512 > 0)
>>>>>                                /* bitmap runs in to metadata */
>>>>>                                goto bad_alignment;
>>>>>                        if (rdev->data_offset + mddev->dev_sectors
>>>>> @@ -257,10 +265,11 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
>>>>>                                goto bad_alignment;
>>>>>                } else if (rdev->sb_start < rdev->data_offset) {
>>>>>                        /* METADATA BITMAP DATA */
>>>>> -                     if (rdev->sb_start
>>>>> -                         + offset
>>>>> -                         + page->index*(PAGE_SIZE/512) + size/512
>>>>> -                         > rdev->data_offset)
>>>>> +                     loff_t off = rdev->sb_start + offset + page->index*(PAGE_SIZE/512);
>>>>> +                     if (size != optimal_size &&
>>>>> +                         off + optimal_size/512 <= rdev->data_offset)
>>>>> +                             size = optimal_size;
>>>>> +                     else if (off + size/512 > rdev->data_offset)
>>>>>                                /* bitmap runs in to data */
>>>>>                                goto bad_alignment;
>>>>>                } else {

Song Liu Feb. 21, 2023, 5:27 a.m. UTC | #6

On Thu, Feb 16, 2023 at 3:52 PM Jonathan Derrick
<jonathan.derrick@linux.dev> wrote:
>
>
>
> On 2/10/2023 10:32 AM, Song Liu wrote:
> > Hi Jonathan,
> >
> > On Thu, Feb 9, 2023 at 12:38 PM Jonathan Derrick
> > <jonathan.derrick@linux.dev> wrote:
> >>
> >> Hi Song,
> >>
> >> Any thoughts on this?
> >
> > I am really sorry that I missed this patch.
> >
> >>
> >> On 1/17/2023 5:53 PM, Jonathan Derrick wrote:
> >>> From: Jon Derrick <jonathan.derrick@linux.dev>
> >>>
> >>> If the bitmap space has enough room, size the I/O for the last bitmap
> >>> page write to the optimal I/O size for the storage device. The expanded
> >>> write is checked that it won't overrun the data or metadata.
> >>>
> >>> This change helps increase performance by preventing unnecessary
> >>> device-side read-mod-writes due to non-atomic write unit sizes.
> >>>
> >>> Ex biosnoop log. Device lba size 512, optimal size 4k:
> >>> Before:
> >>> Time        Process        PID     Device      LBA        Size      Lat
> >>> 0.843734    md0_raid10     5267    nvme0n1   W 24         3584      1.17
> >>> 0.843933    md0_raid10     5267    nvme1n1   W 24         3584      1.36
> >>> 0.843968    md0_raid10     5267    nvme1n1   W 14207939968 4096      0.01
> >>> 0.843979    md0_raid10     5267    nvme0n1   W 14207939968 4096      0.02
> >>>
> >>> After:
> >>> Time        Process        PID     Device      LBA        Size      Lat
> >>> 18.374244   md0_raid10     6559    nvme0n1   W 24         4096      0.01
> >>> 18.374253   md0_raid10     6559    nvme1n1   W 24         4096      0.01
> >>> 18.374300   md0_raid10     6559    nvme0n1   W 11020272296 4096      0.01
> >>> 18.374306   md0_raid10     6559    nvme1n1   W 11020272296 4096      0.02
> >
> > Do we see significant improvements from io benchmarks?
> Yes. With lbaf=512, optimal i/o size=4k:
>
> Without patch:
>   write: IOPS=1570, BW=6283KiB/s (6434kB/s)(368MiB/60001msec); 0 zone resets
> With patch:
>   write: IOPS=59.7k, BW=233MiB/s (245MB/s)(13.7GiB/60001msec); 0 zone resets

The difference is much bigger than I expected. Given this big improvements, I
think we should ship this improvement. Unfortunately, we are too late for 6.3.
Let's plan to ship this in the 6.4 release. I am on vacation this
week. I will work
on this afterwards.

Thanks,
Song

md: Use optimal I/O size for last bitmap page

Commit Message

Comments

Patch