Message ID | 20230417171537.17899-1-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Song Liu |
Headers | show |
Series | md/raid5: Improve performance for sequential IO | expand |
On 2023-04-17 11:15, Jan Kara wrote: > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the > order in which requests for underlying disks are created. Since for > large sequential IO adding of requests frequently races with md_raid5 > thread submitting bios to underlying disks, this results in a change in > IO pattern because intermediate states of new order of request creation > result in more smaller discontiguous requests. For RAID5 on top of three > rotational disks our performance testing revealed this results in > regression in write throughput: > > iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R > > before 7e55c60acfbb: > KB reclen write rewrite read reread > 131072000 4 493670 525964 524575 513384 > 131072000 8 540467 532880 512028 513703 > > after 7e55c60acfbb: > KB reclen write rewrite read reread > 131072000 4 421785 456184 531278 509248 > 131072000 8 459283 456354 528449 543834 > > To reduce the amount of discontiguous requests we can start generating > requests with the stripe with the lowest chunk offset as that has the > best chance of being adjacent to IO queued previously. This improves the > performance to: > KB reclen write rewrite read reread > 131072000 4 497682 506317 518043 514559 > 131072000 8 514048 501886 506453 504319 > > restoring big part of the regression. > > Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") > Signed-off-by: Jan Kara <jack@suse.cz> Looks good to me. I ran it through some of the functional tests I used to develop the patch in question and found no issues. Reviewed-by: Logan Gunthorpe <logang@deltatee.com> > --- > drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 44 insertions(+), 1 deletion(-) > > I'm by no means raid5 expert but this is what I was able to come up with. Any > opinion or ideas how to fix the problem in a better way? The other option would be to revert to the old method for spinning disks and use the pivot option only on SSDs. The pivot optimization really only applies at IO speeds that can be achieved by fast solid state disks anyway, and there isn't likely enough IOPS possible on spinning disks to get enough lock contention that the optimization would provide any benefit. So it could make sense to just fall back to the old method of preparing the stripes in logical block order if we are running on spinning disks. Though, that might be a bit more involved than what this patch does. So I think this is probably a good approach, unless we want to recover more of the lost throughput. Logan
On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote: > > > > On 2023-04-17 11:15, Jan Kara wrote: > > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the > > order in which requests for underlying disks are created. Since for > > large sequential IO adding of requests frequently races with md_raid5 > > thread submitting bios to underlying disks, this results in a change in > > IO pattern because intermediate states of new order of request creation > > result in more smaller discontiguous requests. For RAID5 on top of three > > rotational disks our performance testing revealed this results in > > regression in write throughput: > > > > iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R > > > > before 7e55c60acfbb: > > KB reclen write rewrite read reread > > 131072000 4 493670 525964 524575 513384 > > 131072000 8 540467 532880 512028 513703 > > > > after 7e55c60acfbb: > > KB reclen write rewrite read reread > > 131072000 4 421785 456184 531278 509248 > > 131072000 8 459283 456354 528449 543834 > > > > To reduce the amount of discontiguous requests we can start generating > > requests with the stripe with the lowest chunk offset as that has the > > best chance of being adjacent to IO queued previously. This improves the > > performance to: > > KB reclen write rewrite read reread > > 131072000 4 497682 506317 518043 514559 > > 131072000 8 514048 501886 506453 504319 > > > > restoring big part of the regression. > > > > Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") > > Signed-off-by: Jan Kara <jack@suse.cz> > > Looks good to me. I ran it through some of the functional tests I used > to develop the patch in question and found no issues. > > Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Thanks Jan and Logan! I will apply this to md-next. But it may not make 6.4 release, as we are already at rc7. > > > --- > > drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 44 insertions(+), 1 deletion(-) > > > > I'm by no means raid5 expert but this is what I was able to come up with. Any > > opinion or ideas how to fix the problem in a better way? > > The other option would be to revert to the old method for spinning disks > and use the pivot option only on SSDs. The pivot optimization really > only applies at IO speeds that can be achieved by fast solid state disks > anyway, and there isn't likely enough IOPS possible on spinning disks to > get enough lock contention that the optimization would provide any benefit. > > So it could make sense to just fall back to the old method of preparing > the stripes in logical block order if we are running on spinning disks. > Though, that might be a bit more involved than what this patch does. So > I think this is probably a good approach, unless we want to recover more > of the lost throughput. How about we only do the optimization in this patch for spinning disks? Something like: if (likely(conf->reshape_progress == MaxSector) && !blk_queue_nonrot(mddev->queue)) logical_sector = raid5_bio_lowest_chunk_sector(conf, bi); Thanks, Song
On Wed 19-04-23 22:26:07, Song Liu wrote: > On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote: > > On 2023-04-17 11:15, Jan Kara wrote: > > > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the > > > order in which requests for underlying disks are created. Since for > > > large sequential IO adding of requests frequently races with md_raid5 > > > thread submitting bios to underlying disks, this results in a change in > > > IO pattern because intermediate states of new order of request creation > > > result in more smaller discontiguous requests. For RAID5 on top of three > > > rotational disks our performance testing revealed this results in > > > regression in write throughput: > > > > > > iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R > > > > > > before 7e55c60acfbb: > > > KB reclen write rewrite read reread > > > 131072000 4 493670 525964 524575 513384 > > > 131072000 8 540467 532880 512028 513703 > > > > > > after 7e55c60acfbb: > > > KB reclen write rewrite read reread > > > 131072000 4 421785 456184 531278 509248 > > > 131072000 8 459283 456354 528449 543834 > > > > > > To reduce the amount of discontiguous requests we can start generating > > > requests with the stripe with the lowest chunk offset as that has the > > > best chance of being adjacent to IO queued previously. This improves the > > > performance to: > > > KB reclen write rewrite read reread > > > 131072000 4 497682 506317 518043 514559 > > > 131072000 8 514048 501886 506453 504319 > > > > > > restoring big part of the regression. > > > > > > Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") > > > Signed-off-by: Jan Kara <jack@suse.cz> > > > > Looks good to me. I ran it through some of the functional tests I used > > to develop the patch in question and found no issues. > > > > Reviewed-by: Logan Gunthorpe <logang@deltatee.com> > > Thanks Jan and Logan! I will apply this to md-next. But it may not make > 6.4 release, as we are already at rc7. OK, sure, this is not a critical issue. > > > --- > > > drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 44 insertions(+), 1 deletion(-) > > > > > > I'm by no means raid5 expert but this is what I was able to come up with. Any > > > opinion or ideas how to fix the problem in a better way? > > > > The other option would be to revert to the old method for spinning disks > > and use the pivot option only on SSDs. The pivot optimization really > > only applies at IO speeds that can be achieved by fast solid state disks > > anyway, and there isn't likely enough IOPS possible on spinning disks to > > get enough lock contention that the optimization would provide any benefit. > > > > So it could make sense to just fall back to the old method of preparing > > the stripes in logical block order if we are running on spinning disks. > > Though, that might be a bit more involved than what this patch does. So > > I think this is probably a good approach, unless we want to recover more > > of the lost throughput. > > How about we only do the optimization in this patch for spinning disks? > Something like: > > if (likely(conf->reshape_progress == MaxSector) && > !blk_queue_nonrot(mddev->queue)) > logical_sector = raid5_bio_lowest_chunk_sector(conf, bi); Yeah, makes sense. On SSD disks I was not able to observe any adverse effects of the different stripe order. Will you update the patch or should I respin it? Honza
On 2023-04-20 05:26, Jan Kara wrote: > On Wed 19-04-23 22:26:07, Song Liu wrote: >> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote: >>> On 2023-04-17 11:15, Jan Kara wrote: >>>> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the >>>> order in which requests for underlying disks are created. Since for >>>> large sequential IO adding of requests frequently races with md_raid5 >>>> thread submitting bios to underlying disks, this results in a change in >>>> IO pattern because intermediate states of new order of request creation >>>> result in more smaller discontiguous requests. For RAID5 on top of three >>>> rotational disks our performance testing revealed this results in >>>> regression in write throughput: >>>> >>>> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R >>>> >>>> before 7e55c60acfbb: >>>> KB reclen write rewrite read reread >>>> 131072000 4 493670 525964 524575 513384 >>>> 131072000 8 540467 532880 512028 513703 >>>> >>>> after 7e55c60acfbb: >>>> KB reclen write rewrite read reread >>>> 131072000 4 421785 456184 531278 509248 >>>> 131072000 8 459283 456354 528449 543834 >>>> >>>> To reduce the amount of discontiguous requests we can start generating >>>> requests with the stripe with the lowest chunk offset as that has the >>>> best chance of being adjacent to IO queued previously. This improves the >>>> performance to: >>>> KB reclen write rewrite read reread >>>> 131072000 4 497682 506317 518043 514559 >>>> 131072000 8 514048 501886 506453 504319 >>>> >>>> restoring big part of the regression. >>>> >>>> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") >>>> Signed-off-by: Jan Kara <jack@suse.cz> >>> >>> Looks good to me. I ran it through some of the functional tests I used >>> to develop the patch in question and found no issues. >>> >>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> >> >> Thanks Jan and Logan! I will apply this to md-next. But it may not make >> 6.4 release, as we are already at rc7. > > OK, sure, this is not a critical issue. > >>>> --- >>>> drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- >>>> 1 file changed, 44 insertions(+), 1 deletion(-) >>>> >>>> I'm by no means raid5 expert but this is what I was able to come up with. Any >>>> opinion or ideas how to fix the problem in a better way? >>> >>> The other option would be to revert to the old method for spinning disks >>> and use the pivot option only on SSDs. The pivot optimization really >>> only applies at IO speeds that can be achieved by fast solid state disks >>> anyway, and there isn't likely enough IOPS possible on spinning disks to >>> get enough lock contention that the optimization would provide any benefit. >>> >>> So it could make sense to just fall back to the old method of preparing >>> the stripes in logical block order if we are running on spinning disks. >>> Though, that might be a bit more involved than what this patch does. So >>> I think this is probably a good approach, unless we want to recover more >>> of the lost throughput. >> >> How about we only do the optimization in this patch for spinning disks? >> Something like: >> >> if (likely(conf->reshape_progress == MaxSector) && >> !blk_queue_nonrot(mddev->queue)) >> logical_sector = raid5_bio_lowest_chunk_sector(conf, bi); > > Yeah, makes sense. On SSD disks I was not able to observe any adverse > effects of the different stripe order. Will you update the patch or should > I respin it? Does it make sense? If SSDs work fine with the new stripe order, why do things different for them? So I don't see a benefit of making the fix only work with non-rotational devices. It's my original change which could be made for non-rotationatial only, but that's much more involved. Logan
On Thu, Apr 20, 2023 at 8:54 AM Logan Gunthorpe <logang@deltatee.com> wrote: > > > > On 2023-04-20 05:26, Jan Kara wrote: > > On Wed 19-04-23 22:26:07, Song Liu wrote: > >> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote: > >>> On 2023-04-17 11:15, Jan Kara wrote: > >>>> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the > >>>> order in which requests for underlying disks are created. Since for > >>>> large sequential IO adding of requests frequently races with md_raid5 > >>>> thread submitting bios to underlying disks, this results in a change in > >>>> IO pattern because intermediate states of new order of request creation > >>>> result in more smaller discontiguous requests. For RAID5 on top of three > >>>> rotational disks our performance testing revealed this results in > >>>> regression in write throughput: > >>>> > >>>> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R > >>>> > >>>> before 7e55c60acfbb: > >>>> KB reclen write rewrite read reread > >>>> 131072000 4 493670 525964 524575 513384 > >>>> 131072000 8 540467 532880 512028 513703 > >>>> > >>>> after 7e55c60acfbb: > >>>> KB reclen write rewrite read reread > >>>> 131072000 4 421785 456184 531278 509248 > >>>> 131072000 8 459283 456354 528449 543834 > >>>> > >>>> To reduce the amount of discontiguous requests we can start generating > >>>> requests with the stripe with the lowest chunk offset as that has the > >>>> best chance of being adjacent to IO queued previously. This improves the > >>>> performance to: > >>>> KB reclen write rewrite read reread > >>>> 131072000 4 497682 506317 518043 514559 > >>>> 131072000 8 514048 501886 506453 504319 > >>>> > >>>> restoring big part of the regression. > >>>> > >>>> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") > >>>> Signed-off-by: Jan Kara <jack@suse.cz> > >>> > >>> Looks good to me. I ran it through some of the functional tests I used > >>> to develop the patch in question and found no issues. > >>> > >>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> > >> > >> Thanks Jan and Logan! I will apply this to md-next. But it may not make > >> 6.4 release, as we are already at rc7. > > > > OK, sure, this is not a critical issue. > > > >>>> --- > >>>> drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- > >>>> 1 file changed, 44 insertions(+), 1 deletion(-) > >>>> > >>>> I'm by no means raid5 expert but this is what I was able to come up with. Any > >>>> opinion or ideas how to fix the problem in a better way? > >>> > >>> The other option would be to revert to the old method for spinning disks > >>> and use the pivot option only on SSDs. The pivot optimization really > >>> only applies at IO speeds that can be achieved by fast solid state disks > >>> anyway, and there isn't likely enough IOPS possible on spinning disks to > >>> get enough lock contention that the optimization would provide any benefit. > >>> > >>> So it could make sense to just fall back to the old method of preparing > >>> the stripes in logical block order if we are running on spinning disks. > >>> Though, that might be a bit more involved than what this patch does. So > >>> I think this is probably a good approach, unless we want to recover more > >>> of the lost throughput. > >> > >> How about we only do the optimization in this patch for spinning disks? > >> Something like: > >> > >> if (likely(conf->reshape_progress == MaxSector) && > >> !blk_queue_nonrot(mddev->queue)) > >> logical_sector = raid5_bio_lowest_chunk_sector(conf, bi); > > > > Yeah, makes sense. On SSD disks I was not able to observe any adverse > > effects of the different stripe order. Will you update the patch or should > > I respin it? > > Does it make sense? If SSDs work fine with the new stripe order, why do > things different for them? So I don't see a benefit of making the fix > only work with non-rotational devices. It's my original change which > could be made for non-rotationatial only, but that's much more involved. I am hoping to make raid5_make_request() a little faster for non-rotational devices. We may not easily observe a difference in performance, but things add up. Does this make sense? Thanks, Song
On 2023-04-20 14:07, Song Liu wrote: > On Thu, Apr 20, 2023 at 8:54 AM Logan Gunthorpe <logang@deltatee.com> wrote: >> >> >> >> On 2023-04-20 05:26, Jan Kara wrote: >>> On Wed 19-04-23 22:26:07, Song Liu wrote: >>>> On Wed, Apr 19, 2023 at 12:04 PM Logan Gunthorpe <logang@deltatee.com> wrote: >>>>> On 2023-04-17 11:15, Jan Kara wrote: >>>>>> Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the >>>>>> order in which requests for underlying disks are created. Since for >>>>>> large sequential IO adding of requests frequently races with md_raid5 >>>>>> thread submitting bios to underlying disks, this results in a change in >>>>>> IO pattern because intermediate states of new order of request creation >>>>>> result in more smaller discontiguous requests. For RAID5 on top of three >>>>>> rotational disks our performance testing revealed this results in >>>>>> regression in write throughput: >>>>>> >>>>>> iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R >>>>>> >>>>>> before 7e55c60acfbb: >>>>>> KB reclen write rewrite read reread >>>>>> 131072000 4 493670 525964 524575 513384 >>>>>> 131072000 8 540467 532880 512028 513703 >>>>>> >>>>>> after 7e55c60acfbb: >>>>>> KB reclen write rewrite read reread >>>>>> 131072000 4 421785 456184 531278 509248 >>>>>> 131072000 8 459283 456354 528449 543834 >>>>>> >>>>>> To reduce the amount of discontiguous requests we can start generating >>>>>> requests with the stripe with the lowest chunk offset as that has the >>>>>> best chance of being adjacent to IO queued previously. This improves the >>>>>> performance to: >>>>>> KB reclen write rewrite read reread >>>>>> 131072000 4 497682 506317 518043 514559 >>>>>> 131072000 8 514048 501886 506453 504319 >>>>>> >>>>>> restoring big part of the regression. >>>>>> >>>>>> Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") >>>>>> Signed-off-by: Jan Kara <jack@suse.cz> >>>>> >>>>> Looks good to me. I ran it through some of the functional tests I used >>>>> to develop the patch in question and found no issues. >>>>> >>>>> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> >>>> >>>> Thanks Jan and Logan! I will apply this to md-next. But it may not make >>>> 6.4 release, as we are already at rc7. >>> >>> OK, sure, this is not a critical issue. >>> >>>>>> --- >>>>>> drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- >>>>>> 1 file changed, 44 insertions(+), 1 deletion(-) >>>>>> >>>>>> I'm by no means raid5 expert but this is what I was able to come up with. Any >>>>>> opinion or ideas how to fix the problem in a better way? >>>>> >>>>> The other option would be to revert to the old method for spinning disks >>>>> and use the pivot option only on SSDs. The pivot optimization really >>>>> only applies at IO speeds that can be achieved by fast solid state disks >>>>> anyway, and there isn't likely enough IOPS possible on spinning disks to >>>>> get enough lock contention that the optimization would provide any benefit. >>>>> >>>>> So it could make sense to just fall back to the old method of preparing >>>>> the stripes in logical block order if we are running on spinning disks. >>>>> Though, that might be a bit more involved than what this patch does. So >>>>> I think this is probably a good approach, unless we want to recover more >>>>> of the lost throughput. >>>> >>>> How about we only do the optimization in this patch for spinning disks? >>>> Something like: >>>> >>>> if (likely(conf->reshape_progress == MaxSector) && >>>> !blk_queue_nonrot(mddev->queue)) >>>> logical_sector = raid5_bio_lowest_chunk_sector(conf, bi); >>> >>> Yeah, makes sense. On SSD disks I was not able to observe any adverse >>> effects of the different stripe order. Will you update the patch or should >>> I respin it? >> >> Does it make sense? If SSDs work fine with the new stripe order, why do >> things different for them? So I don't see a benefit of making the fix >> only work with non-rotational devices. It's my original change which >> could be made for non-rotationatial only, but that's much more involved. > > I am hoping to make raid5_make_request() a little faster for non-rotational > devices. We may not easily observe a difference in performance, but things > add up. Does this make sense? I guess. But without a performance test that shows that it makes an improvement, I'm hesitant about that. It could also be that it helps a tiny bit for non-rotational disks, but we just don't know. Unfortunately, I don't have the time right now to do these performance tests. Logan
On Thu, Apr 20, 2023 at 02:10:02PM -0600, Logan Gunthorpe wrote: > > I am hoping to make raid5_make_request() a little faster for non-rotational > > devices. We may not easily observe a difference in performance, but things > > add up. Does this make sense? > > I guess. But without a performance test that shows that it makes an > improvement, I'm hesitant about that. It could also be that it helps a > tiny bit for non-rotational disks, but we just don't know. > > Unfortunately, I don't have the time right now to do these performance > tests. FYI, SSDs in general do prefer sequential write streams. For most you won't see a different in write performance itself, but it will help with reducing GC overhead later on.
On 2023-04-24 00:45, Christoph Hellwig wrote: > On Thu, Apr 20, 2023 at 02:10:02PM -0600, Logan Gunthorpe wrote: >>> I am hoping to make raid5_make_request() a little faster for non-rotational >>> devices. We may not easily observe a difference in performance, but things >>> add up. Does this make sense? >> >> I guess. But without a performance test that shows that it makes an >> improvement, I'm hesitant about that. It could also be that it helps a >> tiny bit for non-rotational disks, but we just don't know. >> >> Unfortunately, I don't have the time right now to do these performance >> tests. > > FYI, SSDs in general do prefer sequential write streams. For most you > won't see a different in write performance itself, but it will help with > reducing GC overhead later on. Thanks. Yes, my colleague was able to run performance testing on this patch and didn't find any degradation with Jan's optimization turned on. So I don't think it's worth doing this only for rotational disks and Jan's original patch makes sense. Logan
Jan Kara (jack@suse.cz) wrote on Mon, Apr 17, 2023 at 02:15:37PM -03: > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the > order in which requests for underlying disks are created. In which version was this first applied? We've observed a large drop in disk performance since 4.* to the point that 6.1.* is almost unusable for some tasks. For example, we have 2 backup machines, one using raid6/ext4 and another using zfs. The backup of a third machine with a filesystem with ~25M inodes takes 6x-8x longer on the one using raid6/ext4 than on the one using zfs... This is during the rsync phase. Afterwards it removes old trees and the rm with raid6 takes eons even though the disks are not at all busy (as shown by sar). Running several rm in parallel speeds things up a lot, showing the problem is not in the disks.
On Mon, Apr 24, 2023 at 8:41 AM Logan Gunthorpe <logang@deltatee.com> wrote: > > > > On 2023-04-24 00:45, Christoph Hellwig wrote: > > On Thu, Apr 20, 2023 at 02:10:02PM -0600, Logan Gunthorpe wrote: > >>> I am hoping to make raid5_make_request() a little faster for non-rotational > >>> devices. We may not easily observe a difference in performance, but things > >>> add up. Does this make sense? > >> > >> I guess. But without a performance test that shows that it makes an > >> improvement, I'm hesitant about that. It could also be that it helps a > >> tiny bit for non-rotational disks, but we just don't know. > >> > >> Unfortunately, I don't have the time right now to do these performance > >> tests. > > > > FYI, SSDs in general do prefer sequential write streams. For most you > > won't see a different in write performance itself, but it will help with > > reducing GC overhead later on. Yeah, this makes sense. > > Thanks. Yes, my colleague was able to run performance testing on this > patch and didn't find any degradation with Jan's optimization turned on. > So I don't think it's worth doing this only for rotational disks and > Jan's original patch makes sense. I will ship Jan's original patch. Thanks, Song
On Mon, Apr 24, 2023 at 1:25 PM Carlos Carvalho <carlos@fisica.ufpr.br> wrote: > > Jan Kara (jack@suse.cz) wrote on Mon, Apr 17, 2023 at 02:15:37PM -03: > > Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the > > order in which requests for underlying disks are created. > > In which version was this first applied? 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") is in 6.0+ kernels. > > We've observed a large drop in disk performance since 4.* to the point that > 6.1.* is almost unusable for some tasks. For example, we have 2 backup > machines, one using raid6/ext4 and another using zfs. The backup of a third > machine with a filesystem with ~25M inodes takes 6x-8x longer on the one using > raid6/ext4 than on the one using zfs... This is during the rsync phase. > Afterwards it removes old trees and the rm with raid6 takes eons even though > the disks are not at all busy (as shown by sar). Running several rm in parallel > speeds things up a lot, showing the problem is not in the disks. I am hoping to look into raid performance soon. But I cannot promise anything at the moment. Thanks, Song
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 7b820b81d8c2..f787c9e5b10e 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6079,6 +6079,38 @@ static enum stripe_result make_stripe_request(struct mddev *mddev, return ret; } +/* + * If the bio covers multiple data disks, find sector within the bio that has + * the lowest chunk offset in the first chunk. + */ +static sector_t raid5_bio_lowest_chunk_sector(struct r5conf *conf, + struct bio *bi) +{ + int sectors_per_chunk = conf->chunk_sectors; + int raid_disks = conf->raid_disks; + int dd_idx; + struct stripe_head sh; + unsigned int chunk_offset; + sector_t r_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1); + sector_t sector; + + /* We pass in fake stripe_head to get back parity disk numbers */ + sector = raid5_compute_sector(conf, r_sector, 0, &dd_idx, &sh); + chunk_offset = sector_div(sector, sectors_per_chunk); + if (sectors_per_chunk - chunk_offset >= bio_sectors(bi)) + return r_sector; + /* + * Bio crosses to the next data disk. Check whether it's in the same + * chunk. + */ + dd_idx++; + while (dd_idx == sh.pd_idx || dd_idx == sh.qd_idx) + dd_idx++; + if (dd_idx >= raid_disks) + return r_sector; + return r_sector + sectors_per_chunk - chunk_offset; +} + static bool raid5_make_request(struct mddev *mddev, struct bio * bi) { DEFINE_WAIT_FUNC(wait, woken_wake_function); @@ -6150,6 +6182,17 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) } md_account_bio(mddev, &bi); + /* + * Lets start with the stripe with the lowest chunk offset in the first + * chunk. That has the best chances of creating IOs adjacent to + * previous IOs in case of sequential IO and thus creates the most + * sequential IO pattern. We don't bother with the optimization when + * reshaping as the performance benefit is not worth the complexity. + */ + if (likely(conf->reshape_progress == MaxSector)) + logical_sector = raid5_bio_lowest_chunk_sector(conf, bi); + s = (logical_sector - ctx.first_sector) >> RAID5_STRIPE_SHIFT(conf); + add_wait_queue(&conf->wait_for_overlap, &wait); while (1) { res = make_stripe_request(mddev, conf, &ctx, logical_sector, @@ -6178,7 +6221,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) continue; } - s = find_first_bit(ctx.sectors_to_do, stripe_cnt); + s = find_next_bit_wrap(ctx.sectors_to_do, stripe_cnt, s); if (s == stripe_cnt) break;
Commit 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") changed the order in which requests for underlying disks are created. Since for large sequential IO adding of requests frequently races with md_raid5 thread submitting bios to underlying disks, this results in a change in IO pattern because intermediate states of new order of request creation result in more smaller discontiguous requests. For RAID5 on top of three rotational disks our performance testing revealed this results in regression in write throughput: iozone -a -s 131072000 -y 4 -q 8 -i 0 -i 1 -R before 7e55c60acfbb: KB reclen write rewrite read reread 131072000 4 493670 525964 524575 513384 131072000 8 540467 532880 512028 513703 after 7e55c60acfbb: KB reclen write rewrite read reread 131072000 4 421785 456184 531278 509248 131072000 8 459283 456354 528449 543834 To reduce the amount of discontiguous requests we can start generating requests with the stripe with the lowest chunk offset as that has the best chance of being adjacent to IO queued previously. This improves the performance to: KB reclen write rewrite read reread 131072000 4 497682 506317 518043 514559 131072000 8 514048 501886 506453 504319 restoring big part of the regression. Fixes: 7e55c60acfbb ("md/raid5: Pivot raid5_make_request()") Signed-off-by: Jan Kara <jack@suse.cz> --- drivers/md/raid5.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 44 insertions(+), 1 deletion(-) I'm by no means raid5 expert but this is what I was able to come up with. Any opinion or ideas how to fix the problem in a better way?