diff mbox series

md/raid5: Improve performance for sequential IO

Message ID 20230417171537.17899-1-jack@suse.cz (mailing list archive)
State New, archived
Delegated to: Song Liu
Headers show
Series md/raid5: Improve performance for sequential IO | expand

Commit Message

Jan Kara April 17, 2023, 5:15 p.m. UTC
Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
order in which requests for underlying disks are created. Since for
large sequential IO adding of requests frequently races with md_raid5
thread submitting bios to underlying disks, this results in a change in
IO pattern because intermediate states of new order of request creation
result in more smaller discontiguous requests. For RAID5 on top of three
rotational disks our performance testing revealed this results in
regression in write throughput:

iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R

before 7e55c60acfbb:
              KB  reclen   write rewrite    read    reread
       131072000       4  493670  525964   524575   513384
       131072000       8  540467  532880   512028   513703

after 7e55c60acfbb:
              KB  reclen   write rewrite    read    reread
       131072000       4  421785  456184   531278   509248
       131072000       8  459283  456354   528449   543834

To reduce the amount of discontiguous requests we can start generating
requests with the stripe with the lowest chunk offset as that has the
best chance of being adjacent to IO queued previously. This improves the
performance to:
              KB  reclen   write rewrite    read    reread
       131072000       4  497682  506317   518043   514559
       131072000       8  514048  501886   506453   504319

restoring big part of the regression.

Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

I'm by no means raid5 expert but this is what I was able to come up with. Any
opinion or ideas how to fix the problem in a better way?

Comments

Logan Gunthorpe April 19, 2023, 7:04 p.m. UTC | #1
On 2023-04-17 11:15, Jan Kara wrote:
> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
> order in which requests for underlying disks are created. Since for
> large sequential IO adding of requests frequently races with md_raid5
> thread submitting bios to underlying disks, this results in a change in
> IO pattern because intermediate states of new order of request creation
> result in more smaller discontiguous requests. For RAID5 on top of three
> rotational disks our performance testing revealed this results in
> regression in write throughput:
> 
> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R
> 
> before 7e55c60acfbb:
>               KB  reclen   write rewrite    read    reread
>        131072000       4  493670  525964   524575   513384
>        131072000       8  540467  532880   512028   513703
> 
> after 7e55c60acfbb:
>               KB  reclen   write rewrite    read    reread
>        131072000       4  421785  456184   531278   509248
>        131072000       8  459283  456354   528449   543834
> 
> To reduce the amount of discontiguous requests we can start generating
> requests with the stripe with the lowest chunk offset as that has the
> best chance of being adjacent to IO queued previously. This improves the
> performance to:
>               KB  reclen   write rewrite    read    reread
>        131072000       4  497682  506317   518043   514559
>        131072000       8  514048  501886   506453   504319
> 
> restoring big part of the regression.
> 
> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good to me. I ran it through some of the functional tests I used
to develop the patch in question and found no issues.

Reviewed-by: Logan Gunthorpe <logang@deltatee.com>

> ---
>  drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 44 insertions(+), 1 deletion(-)
> 
> I'm by no means raid5 expert but this is what I was able to come up with. Any
> opinion or ideas how to fix the problem in a better way?

The other option would be to revert to the old method for spinning disks
and use the pivot option only on SSDs. The pivot optimization really
only applies at IO speeds that can be achieved by fast solid state disks
anyway, and there isn't likely enough IOPS possible on spinning disks to
get enough lock contention that the optimization would provide any benefit.

So it could make sense to just fall back to the old method of preparing
the stripes in logical block order if we are running on spinning disks.
Though, that might be a bit more involved than what this patch does. So
I think this is probably a good approach, unless we want to recover more
of the lost throughput.

Logan
Song Liu April 20, 2023, 5:26 a.m. UTC | #2
On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2023-04-17 11:15, Jan Kara wrote:
> > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
> > order in which requests for underlying disks are created. Since for
> > large sequential IO adding of requests frequently races with md_raid5
> > thread submitting bios to underlying disks, this results in a change in
> > IO pattern because intermediate states of new order of request creation
> > result in more smaller discontiguous requests. For RAID5 on top of three
> > rotational disks our performance testing revealed this results in
> > regression in write throughput:
> >
> > iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R
> >
> > before 7e55c60acfbb:
> >               KB  reclen   write rewrite    read    reread
> >        131072000       4  493670  525964   524575   513384
> >        131072000       8  540467  532880   512028   513703
> >
> > after 7e55c60acfbb:
> >               KB  reclen   write rewrite    read    reread
> >        131072000       4  421785  456184   531278   509248
> >        131072000       8  459283  456354   528449   543834
> >
> > To reduce the amount of discontiguous requests we can start generating
> > requests with the stripe with the lowest chunk offset as that has the
> > best chance of being adjacent to IO queued previously. This improves the
> > performance to:
> >               KB  reclen   write rewrite    read    reread
> >        131072000       4  497682  506317   518043   514559
> >        131072000       8  514048  501886   506453   504319
> >
> > restoring big part of the regression.
> >
> > Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")
> > Signed-off-by: Jan Kara <jack@suse.cz>
>
> Looks good to me. I ran it through some of the functional tests I used
> to develop the patch in question and found no issues.
>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>

Thanks Jan and Logan! I will apply this to md-next. But it may not make
6.4 release, as we are already at rc7.

>
> > ---
> >  drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 44 insertions(+), 1 deletion(-)
> >
> > I'm by no means raid5 expert but this is what I was able to come up with. Any
> > opinion or ideas how to fix the problem in a better way?
>
> The other option would be to revert to the old method for spinning disks
> and use the pivot option only on SSDs. The pivot optimization really
> only applies at IO speeds that can be achieved by fast solid state disks
> anyway, and there isn't likely enough IOPS possible on spinning disks to
> get enough lock contention that the optimization would provide any benefit.
>
> So it could make sense to just fall back to the old method of preparing
> the stripes in logical block order if we are running on spinning disks.
> Though, that might be a bit more involved than what this patch does. So
> I think this is probably a good approach, unless we want to recover more
> of the lost throughput.

How about we only do the optimization in this patch for spinning disks?
Something like:

        if (likely(conf->reshape_progress == MaxSector) &&
            !blk_queue_nonrot(mddev->queue))
                logical_sector = raid5_bio_lowest_chunk_sector(conf, bi);

Thanks,
Song
Jan Kara April 20, 2023, 11:26 a.m. UTC | #3
On Wed 19-04-23 22:26:07, Song Liu wrote:
> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> > On 2023-04-17 11:15, Jan Kara wrote:
> > > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
> > > order in which requests for underlying disks are created. Since for
> > > large sequential IO adding of requests frequently races with md_raid5
> > > thread submitting bios to underlying disks, this results in a change in
> > > IO pattern because intermediate states of new order of request creation
> > > result in more smaller discontiguous requests. For RAID5 on top of three
> > > rotational disks our performance testing revealed this results in
> > > regression in write throughput:
> > >
> > > iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R
> > >
> > > before 7e55c60acfbb:
> > >               KB  reclen   write rewrite    read    reread
> > >        131072000       4  493670  525964   524575   513384
> > >        131072000       8  540467  532880   512028   513703
> > >
> > > after 7e55c60acfbb:
> > >               KB  reclen   write rewrite    read    reread
> > >        131072000       4  421785  456184   531278   509248
> > >        131072000       8  459283  456354   528449   543834
> > >
> > > To reduce the amount of discontiguous requests we can start generating
> > > requests with the stripe with the lowest chunk offset as that has the
> > > best chance of being adjacent to IO queued previously. This improves the
> > > performance to:
> > >               KB  reclen   write rewrite    read    reread
> > >        131072000       4  497682  506317   518043   514559
> > >        131072000       8  514048  501886   506453   504319
> > >
> > > restoring big part of the regression.
> > >
> > > Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> >
> > Looks good to me. I ran it through some of the functional tests I used
> > to develop the patch in question and found no issues.
> >
> > Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> 
> Thanks Jan and Logan! I will apply this to md-next. But it may not make
> 6.4 release, as we are already at rc7.

OK, sure, this is not a critical issue.

> > > ---
> > >  drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 44 insertions(+), 1 deletion(-)
> > >
> > > I'm by no means raid5 expert but this is what I was able to come up with. Any
> > > opinion or ideas how to fix the problem in a better way?
> >
> > The other option would be to revert to the old method for spinning disks
> > and use the pivot option only on SSDs. The pivot optimization really
> > only applies at IO speeds that can be achieved by fast solid state disks
> > anyway, and there isn't likely enough IOPS possible on spinning disks to
> > get enough lock contention that the optimization would provide any benefit.
> >
> > So it could make sense to just fall back to the old method of preparing
> > the stripes in logical block order if we are running on spinning disks.
> > Though, that might be a bit more involved than what this patch does. So
> > I think this is probably a good approach, unless we want to recover more
> > of the lost throughput.
> 
> How about we only do the optimization in this patch for spinning disks?
> Something like:
> 
>         if (likely(conf->reshape_progress == MaxSector) &&
>             !blk_queue_nonrot(mddev->queue))
>                 logical_sector = raid5_bio_lowest_chunk_sector(conf, bi);

Yeah, makes sense. On SSD disks I was not able to observe any adverse
effects of the different stripe order. Will you update the patch or should
I respin it?

								Honza
Logan Gunthorpe April 20, 2023, 3:54 p.m. UTC | #4
On 2023-04-20 05:26, Jan Kara wrote:
> On Wed 19-04-23 22:26:07, Song Liu wrote:
>> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>>> On 2023-04-17 11:15, Jan Kara wrote:
>>>> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
>>>> order in which requests for underlying disks are created. Since for
>>>> large sequential IO adding of requests frequently races with md_raid5
>>>> thread submitting bios to underlying disks, this results in a change in
>>>> IO pattern because intermediate states of new order of request creation
>>>> result in more smaller discontiguous requests. For RAID5 on top of three
>>>> rotational disks our performance testing revealed this results in
>>>> regression in write throughput:
>>>>
>>>> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R
>>>>
>>>> before 7e55c60acfbb:
>>>>               KB  reclen   write rewrite    read    reread
>>>>        131072000       4  493670  525964   524575   513384
>>>>        131072000       8  540467  532880   512028   513703
>>>>
>>>> after 7e55c60acfbb:
>>>>               KB  reclen   write rewrite    read    reread
>>>>        131072000       4  421785  456184   531278   509248
>>>>        131072000       8  459283  456354   528449   543834
>>>>
>>>> To reduce the amount of discontiguous requests we can start generating
>>>> requests with the stripe with the lowest chunk offset as that has the
>>>> best chance of being adjacent to IO queued previously. This improves the
>>>> performance to:
>>>>               KB  reclen   write rewrite    read    reread
>>>>        131072000       4  497682  506317   518043   514559
>>>>        131072000       8  514048  501886   506453   504319
>>>>
>>>> restoring big part of the regression.
>>>>
>>>> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")
>>>> Signed-off-by: Jan Kara <jack@suse.cz>
>>>
>>> Looks good to me. I ran it through some of the functional tests I used
>>> to develop the patch in question and found no issues.
>>>
>>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>
>> Thanks Jan and Logan! I will apply this to md-next. But it may not make
>> 6.4 release, as we are already at rc7.
> 
> OK, sure, this is not a critical issue.
> 
>>>> ---
>>>>  drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>>>>  1 file changed, 44 insertions(+), 1 deletion(-)
>>>>
>>>> I'm by no means raid5 expert but this is what I was able to come up with. Any
>>>> opinion or ideas how to fix the problem in a better way?
>>>
>>> The other option would be to revert to the old method for spinning disks
>>> and use the pivot option only on SSDs. The pivot optimization really
>>> only applies at IO speeds that can be achieved by fast solid state disks
>>> anyway, and there isn't likely enough IOPS possible on spinning disks to
>>> get enough lock contention that the optimization would provide any benefit.
>>>
>>> So it could make sense to just fall back to the old method of preparing
>>> the stripes in logical block order if we are running on spinning disks.
>>> Though, that might be a bit more involved than what this patch does. So
>>> I think this is probably a good approach, unless we want to recover more
>>> of the lost throughput.
>>
>> How about we only do the optimization in this patch for spinning disks?
>> Something like:
>>
>>         if (likely(conf->reshape_progress == MaxSector) &&
>>             !blk_queue_nonrot(mddev->queue))
>>                 logical_sector = raid5_bio_lowest_chunk_sector(conf, bi);
> 
> Yeah, makes sense. On SSD disks I was not able to observe any adverse
> effects of the different stripe order. Will you update the patch or should
> I respin it?

Does it make sense? If SSDs work fine with the new stripe order, why do
things different for them? So I don't see a  benefit of making the fix
only work with non-rotational devices. It's my original change which
could be made for non-rotationatial only, but that's much more involved.

Logan
Song Liu April 20, 2023, 8:07 p.m. UTC | #5
On Thu, Apr 20, 2023 at 8:54 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2023-04-20 05:26, Jan Kara wrote:
> > On Wed 19-04-23 22:26:07, Song Liu wrote:
> >> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> >>> On 2023-04-17 11:15, Jan Kara wrote:
> >>>> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
> >>>> order in which requests for underlying disks are created. Since for
> >>>> large sequential IO adding of requests frequently races with md_raid5
> >>>> thread submitting bios to underlying disks, this results in a change in
> >>>> IO pattern because intermediate states of new order of request creation
> >>>> result in more smaller discontiguous requests. For RAID5 on top of three
> >>>> rotational disks our performance testing revealed this results in
> >>>> regression in write throughput:
> >>>>
> >>>> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R
> >>>>
> >>>> before 7e55c60acfbb:
> >>>>               KB  reclen   write rewrite    read    reread
> >>>>        131072000       4  493670  525964   524575   513384
> >>>>        131072000       8  540467  532880   512028   513703
> >>>>
> >>>> after 7e55c60acfbb:
> >>>>               KB  reclen   write rewrite    read    reread
> >>>>        131072000       4  421785  456184   531278   509248
> >>>>        131072000       8  459283  456354   528449   543834
> >>>>
> >>>> To reduce the amount of discontiguous requests we can start generating
> >>>> requests with the stripe with the lowest chunk offset as that has the
> >>>> best chance of being adjacent to IO queued previously. This improves the
> >>>> performance to:
> >>>>               KB  reclen   write rewrite    read    reread
> >>>>        131072000       4  497682  506317   518043   514559
> >>>>        131072000       8  514048  501886   506453   504319
> >>>>
> >>>> restoring big part of the regression.
> >>>>
> >>>> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")
> >>>> Signed-off-by: Jan Kara <jack@suse.cz>
> >>>
> >>> Looks good to me. I ran it through some of the functional tests I used
> >>> to develop the patch in question and found no issues.
> >>>
> >>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> >>
> >> Thanks Jan and Logan! I will apply this to md-next. But it may not make
> >> 6.4 release, as we are already at rc7.
> >
> > OK, sure, this is not a critical issue.
> >
> >>>> ---
> >>>>  drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
> >>>>  1 file changed, 44 insertions(+), 1 deletion(-)
> >>>>
> >>>> I'm by no means raid5 expert but this is what I was able to come up with. Any
> >>>> opinion or ideas how to fix the problem in a better way?
> >>>
> >>> The other option would be to revert to the old method for spinning disks
> >>> and use the pivot option only on SSDs. The pivot optimization really
> >>> only applies at IO speeds that can be achieved by fast solid state disks
> >>> anyway, and there isn't likely enough IOPS possible on spinning disks to
> >>> get enough lock contention that the optimization would provide any benefit.
> >>>
> >>> So it could make sense to just fall back to the old method of preparing
> >>> the stripes in logical block order if we are running on spinning disks.
> >>> Though, that might be a bit more involved than what this patch does. So
> >>> I think this is probably a good approach, unless we want to recover more
> >>> of the lost throughput.
> >>
> >> How about we only do the optimization in this patch for spinning disks?
> >> Something like:
> >>
> >>         if (likely(conf->reshape_progress == MaxSector) &&
> >>             !blk_queue_nonrot(mddev->queue))
> >>                 logical_sector = raid5_bio_lowest_chunk_sector(conf, bi);
> >
> > Yeah, makes sense. On SSD disks I was not able to observe any adverse
> > effects of the different stripe order. Will you update the patch or should
> > I respin it?
>
> Does it make sense? If SSDs work fine with the new stripe order, why do
> things different for them? So I don't see a  benefit of making the fix
> only work with non-rotational devices. It's my original change which
> could be made for non-rotationatial only, but that's much more involved.

I am hoping to make raid5_make_request() a little faster for non-rotational
devices. We may not easily observe a difference in performance, but things
add up. Does this make sense?

Thanks,
Song
Logan Gunthorpe April 20, 2023, 8:10 p.m. UTC | #6
On 2023-04-20 14:07, Song Liu wrote:
> On Thu, Apr 20, 2023 at 8:54 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>>
>> On 2023-04-20 05:26, Jan Kara wrote:
>>> On Wed 19-04-23 22:26:07, Song Liu wrote:
>>>> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>>>>> On 2023-04-17 11:15, Jan Kara wrote:
>>>>>> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
>>>>>> order in which requests for underlying disks are created. Since for
>>>>>> large sequential IO adding of requests frequently races with md_raid5
>>>>>> thread submitting bios to underlying disks, this results in a change in
>>>>>> IO pattern because intermediate states of new order of request creation
>>>>>> result in more smaller discontiguous requests. For RAID5 on top of three
>>>>>> rotational disks our performance testing revealed this results in
>>>>>> regression in write throughput:
>>>>>>
>>>>>> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R
>>>>>>
>>>>>> before 7e55c60acfbb:
>>>>>>               KB  reclen   write rewrite    read    reread
>>>>>>        131072000       4  493670  525964   524575   513384
>>>>>>        131072000       8  540467  532880   512028   513703
>>>>>>
>>>>>> after 7e55c60acfbb:
>>>>>>               KB  reclen   write rewrite    read    reread
>>>>>>        131072000       4  421785  456184   531278   509248
>>>>>>        131072000       8  459283  456354   528449   543834
>>>>>>
>>>>>> To reduce the amount of discontiguous requests we can start generating
>>>>>> requests with the stripe with the lowest chunk offset as that has the
>>>>>> best chance of being adjacent to IO queued previously. This improves the
>>>>>> performance to:
>>>>>>               KB  reclen   write rewrite    read    reread
>>>>>>        131072000       4  497682  506317   518043   514559
>>>>>>        131072000       8  514048  501886   506453   504319
>>>>>>
>>>>>> restoring big part of the regression.
>>>>>>
>>>>>> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")
>>>>>> Signed-off-by: Jan Kara <jack@suse.cz>
>>>>>
>>>>> Looks good to me. I ran it through some of the functional tests I used
>>>>> to develop the patch in question and found no issues.
>>>>>
>>>>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
>>>>
>>>> Thanks Jan and Logan! I will apply this to md-next. But it may not make
>>>> 6.4 release, as we are already at rc7.
>>>
>>> OK, sure, this is not a critical issue.
>>>
>>>>>> ---
>>>>>>  drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
>>>>>>  1 file changed, 44 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> I'm by no means raid5 expert but this is what I was able to come up with. Any
>>>>>> opinion or ideas how to fix the problem in a better way?
>>>>>
>>>>> The other option would be to revert to the old method for spinning disks
>>>>> and use the pivot option only on SSDs. The pivot optimization really
>>>>> only applies at IO speeds that can be achieved by fast solid state disks
>>>>> anyway, and there isn't likely enough IOPS possible on spinning disks to
>>>>> get enough lock contention that the optimization would provide any benefit.
>>>>>
>>>>> So it could make sense to just fall back to the old method of preparing
>>>>> the stripes in logical block order if we are running on spinning disks.
>>>>> Though, that might be a bit more involved than what this patch does. So
>>>>> I think this is probably a good approach, unless we want to recover more
>>>>> of the lost throughput.
>>>>
>>>> How about we only do the optimization in this patch for spinning disks?
>>>> Something like:
>>>>
>>>>         if (likely(conf->reshape_progress == MaxSector) &&
>>>>             !blk_queue_nonrot(mddev->queue))
>>>>                 logical_sector = raid5_bio_lowest_chunk_sector(conf, bi);
>>>
>>> Yeah, makes sense. On SSD disks I was not able to observe any adverse
>>> effects of the different stripe order. Will you update the patch or should
>>> I respin it?
>>
>> Does it make sense? If SSDs work fine with the new stripe order, why do
>> things different for them? So I don't see a  benefit of making the fix
>> only work with non-rotational devices. It's my original change which
>> could be made for non-rotationatial only, but that's much more involved.
> 
> I am hoping to make raid5_make_request() a little faster for non-rotational
> devices. We may not easily observe a difference in performance, but things
> add up. Does this make sense?

I guess. But without a performance test that shows that it makes an
improvement, I'm hesitant about that. It could also be that it helps a
tiny bit for non-rotational disks, but we just don't know.

Unfortunately, I don't have the time right now to do these performance
tests.

Logan
Christoph Hellwig April 24, 2023, 6:45 a.m. UTC | #7
On Thu, Apr 20, 2023 at 02:10:02PM -0600, Logan Gunthorpe wrote:
> > I am hoping to make raid5_make_request() a little faster for non-rotational
> > devices. We may not easily observe a difference in performance, but things
> > add up. Does this make sense?
> 
> I guess. But without a performance test that shows that it makes an
> improvement, I'm hesitant about that. It could also be that it helps a
> tiny bit for non-rotational disks, but we just don't know.
> 
> Unfortunately, I don't have the time right now to do these performance
> tests.

FYI, SSDs in general do prefer sequential write streams.  For most you
won't see a different in write performance itself, but it will help with
reducing GC overhead later on.
Logan Gunthorpe April 24, 2023, 3:40 p.m. UTC | #8
On 2023-04-24 00:45, Christoph Hellwig wrote:
> On Thu, Apr 20, 2023 at 02:10:02PM -0600, Logan Gunthorpe wrote:
>>> I am hoping to make raid5_make_request() a little faster for non-rotational
>>> devices. We may not easily observe a difference in performance, but things
>>> add up. Does this make sense?
>>
>> I guess. But without a performance test that shows that it makes an
>> improvement, I'm hesitant about that. It could also be that it helps a
>> tiny bit for non-rotational disks, but we just don't know.
>>
>> Unfortunately, I don't have the time right now to do these performance
>> tests.
> 
> FYI, SSDs in general do prefer sequential write streams.  For most you
> won't see a different in write performance itself, but it will help with
> reducing GC overhead later on.

Thanks. Yes, my colleague was able to run performance testing on this
patch and didn't find any degradation with Jan's optimization turned on.
So I don't think it's worth doing this only for rotational disks and
Jan's original patch makes sense.

Logan
Carlos Carvalho April 24, 2023, 8:24 p.m. UTC | #9
Jan Kara (jack@suse.cz) wrote on Mon, Apr 17, 2023 at 02:15:37PM -03:
> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
> order in which requests for underlying disks are created.

In which version was this first applied?

We've observed a large drop in disk performance since 4.* to the point that
6.1.* is almost unusable for some tasks. For example, we have 2 backup
machines, one using raid6/ext4 and another using zfs. The backup of a third
machine with a filesystem with ~25M inodes takes 6x-8x longer on the one using
raid6/ext4 than on the one using zfs... This is during the rsync phase.
Afterwards it removes old trees and the rm with raid6 takes eons even though
the disks are not at all busy (as shown by sar). Running several rm in parallel
speeds things up a lot, showing the problem is not in the disks.
Song Liu April 24, 2023, 10:30 p.m. UTC | #10
On Mon, Apr 24, 2023 at 8:41 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2023-04-24 00:45, Christoph Hellwig wrote:
> > On Thu, Apr 20, 2023 at 02:10:02PM -0600, Logan Gunthorpe wrote:
> >>> I am hoping to make raid5_make_request() a little faster for non-rotational
> >>> devices. We may not easily observe a difference in performance, but things
> >>> add up. Does this make sense?
> >>
> >> I guess. But without a performance test that shows that it makes an
> >> improvement, I'm hesitant about that. It could also be that it helps a
> >> tiny bit for non-rotational disks, but we just don't know.
> >>
> >> Unfortunately, I don't have the time right now to do these performance
> >> tests.
> >
> > FYI, SSDs in general do prefer sequential write streams.  For most you
> > won't see a different in write performance itself, but it will help with
> > reducing GC overhead later on.

Yeah, this makes sense.

>
> Thanks. Yes, my colleague was able to run performance testing on this
> patch and didn't find any degradation with Jan's optimization turned on.
> So I don't think it's worth doing this only for rotational disks and
> Jan's original patch makes sense.

I will ship Jan's original patch.

Thanks,
Song
Song Liu April 24, 2023, 10:36 p.m. UTC | #11
On Mon, Apr 24, 2023 at 1:25 PM Carlos Carvalho <carlos@fisica.ufpr.br> wrote:
>
> Jan Kara (jack@suse.cz) wrote on Mon, Apr 17, 2023 at 02:15:37PM -03:
> > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the
> > order in which requests for underlying disks are created.
>
> In which version was this first applied?

7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") is in 6.0+ kernels.

>
> We've observed a large drop in disk performance since 4.* to the point that
> 6.1.* is almost unusable for some tasks. For example, we have 2 backup
> machines, one using raid6/ext4 and another using zfs. The backup of a third
> machine with a filesystem with ~25M inodes takes 6x-8x longer on the one using
> raid6/ext4 than on the one using zfs... This is during the rsync phase.
> Afterwards it removes old trees and the rm with raid6 takes eons even though
> the disks are not at all busy (as shown by sar). Running several rm in parallel
> speeds things up a lot, showing the problem is not in the disks.

I am hoping to look into raid performance soon. But I cannot promise anything
at the moment.

Thanks,
Song
diff mbox series

Patch

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7b820b81d8c2..f787c9e5b10e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6079,6 +6079,38 @@  static enum stripe_result make_stripe_request(struct mddev *mddev,
 	return ret;
 }
 
+/*
+ * If the bio covers multiple data disks, find sector within the bio that has
+ * the lowest chunk offset in the first chunk.
+ */
+static sector_t raid5_bio_lowest_chunk_sector(struct r5conf *conf,
+					      struct bio *bi)
+{
+	int sectors_per_chunk = conf->chunk_sectors;
+	int raid_disks = conf->raid_disks;
+	int dd_idx;
+	struct stripe_head sh;
+	unsigned int chunk_offset;
+	sector_t r_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1);
+	sector_t sector;
+
+	/* We pass in fake stripe_head to get back parity disk numbers */
+	sector = raid5_compute_sector(conf, r_sector, 0, &dd_idx, &sh);
+	chunk_offset = sector_div(sector, sectors_per_chunk);
+	if (sectors_per_chunk - chunk_offset >= bio_sectors(bi))
+		return r_sector;
+	/*
+	 * Bio crosses to the next data disk. Check whether it's in the same
+	 * chunk.
+	 */
+	dd_idx++;
+	while (dd_idx == sh.pd_idx || dd_idx == sh.qd_idx)
+		dd_idx++;
+	if (dd_idx >= raid_disks)
+		return r_sector;
+	return r_sector + sectors_per_chunk - chunk_offset;
+}
+
 static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 {
 	DEFINE_WAIT_FUNC(wait, woken_wake_function);
@@ -6150,6 +6182,17 @@  static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	md_account_bio(mddev, &bi);
 
+	/*
+	 * Lets start with the stripe with the lowest chunk offset in the first
+	 * chunk. That has the best chances of creating IOs adjacent to
+	 * previous IOs in case of sequential IO and thus creates the most
+	 * sequential IO pattern. We don't bother with the optimization when
+	 * reshaping as the performance benefit is not worth the complexity.
+	 */
+	if (likely(conf->reshape_progress == MaxSector))
+		logical_sector = raid5_bio_lowest_chunk_sector(conf, bi);
+	s = (logical_sector - ctx.first_sector) >> RAID5_STRIPE_SHIFT(conf);
+
 	add_wait_queue(&conf->wait_for_overlap, &wait);
 	while (1) {
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
@@ -6178,7 +6221,7 @@  static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			continue;
 		}
 
-		s = find_first_bit(ctx.sectors_to_do, stripe_cnt);
+		s = find_next_bit_wrap(ctx.sectors_to_do, stripe_cnt, s);
 		if (s == stripe_cnt)
 			break;