Message ID | 20250418010941.667138-5-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | md: fix is_mddev_idle() | expand |
On Fri 18 Apr 2025 at 09:09, Yu Kuai <yukuai1@huaweicloud.com> wrote: > From: Yu Kuai <yukuai3@huawei.com> > > If sync_speed is above speed_min, then is_mddev_idle() will be > called > for each sync IO to check if the array is idle, and inflihgt > sync_io > will be limited if the array is not idle. > > However, while mkfs.ext4 for a large raid5 array while recovery > is in > progress, it's found that sync_speed is already above speed_min > while > lots of stripes are used for sync IO, causing long delay for > mkfs.ext4. > > Root cause is the following checking from is_mddev_idle(): > > t1: submit sync IO: events1 = completed IO - issued sync IO > t2: submit next sync IO: events2 = completed IO - issued sync > IO > if (events2 - events1 > 64) > > For consequence, the more sync IO issued, the less likely > checking will > pass. And when completed normal IO is more than issued sync IO, > the > condition will finally pass and is_mddev_idle() will return > false, > however, last_events will be updated hence is_mddev_idle() can > only > return false once in a while. > > Fix this problem by changing the checking as following: > > 1) mddev doesn't have normal IO completed; > 2) mddev doesn't have normal IO inflight; > 3) if any member disks is partition, and all other partitions > doesn't > have IO completed. > > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > drivers/md/md.c | 84 > +++++++++++++++++++++++++++---------------------- > drivers/md/md.h | 3 +- > 2 files changed, 48 insertions(+), 39 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 52cadfce7e8d..dfd85a5d6112 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev > *mddev) > put_cluster_ops(mddev); > } > > -static int is_mddev_idle(struct mddev *mddev, int init) > +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool > init) > { > + unsigned long last_events = rdev->last_events; > + > + if (!bdev_is_partition(rdev->bdev)) > + return true; > + > + /* > + * If rdev is partition, and user doesn't issue IO to the > array, the > + * array is still not idle if user issues IO to other > partitions. > + */ > + rdev->last_events = > part_stat_read_accum(rdev->bdev->bd_disk->part0, > + sectors) - > + part_stat_read_accum(rdev->bdev, sectors); > + > + if (!init && rdev->last_events > last_events) > + return false; > + > + return true; > +} > + > +/* > + * mddev is idle if following conditions are match since last > check: > + * 1) mddev doesn't have normal IO completed; > + * 2) mddev doesn't have inflight normal IO; > + * 3) if any member disk is partition, and other partitions > doesn't have IO > + * completed; > + * > + * Noted this checking rely on IO accounting is enabled. > + */ > +static bool is_mddev_idle(struct mddev *mddev, int init) > +{ > + unsigned long last_events = mddev->last_events; > + struct gendisk *disk; > struct md_rdev *rdev; > - int idle; > - int curr_events; > + bool idle = true; > > - idle = 1; > - rcu_read_lock(); > - rdev_for_each_rcu(rdev, mddev) { > - struct gendisk *disk = rdev->bdev->bd_disk; > + disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : > mddev->gendisk; > + if (!disk) > + return true; > > - if (!init && !blk_queue_io_stat(disk->queue)) > - continue; > + mddev->last_events = part_stat_read_accum(disk->part0, > sectors); > + if (!init && (mddev->last_events > last_events || > + bdev_count_inflight(disk->part0))) > + idle = false; > Forgot return or goto here? -- Su > - curr_events = (int)part_stat_read_accum(disk->part0, > sectors) - > - atomic_read(&disk->sync_io); > - /* sync IO will cause sync_io to increase before the > disk_stats > - * as sync_io is counted when a request starts, and > - * disk_stats is counted when it completes. > - * So resync activity will cause curr_events to be smaller > than > - * when there was no such activity. > - * non-sync IO will cause disk_stat to increase without > - * increasing sync_io so curr_events will (eventually) > - * be larger than it was before. Once it becomes > - * substantially larger, the test below will cause > - * the array to appear non-idle, and resync will slow > - * down. > - * If there is a lot of outstanding resync activity when > - * we set last_event to curr_events, then all that > activity > - * completing might cause the array to appear non-idle > - * and resync will be slowed down even though there might > - * not have been non-resync activity. This will only > - * happen once though. 'last_events' will soon reflect > - * the state where there is little or no outstanding > - * resync requests, and further resync activity will > - * always make curr_events less than last_events. > - * > - */ > - if (init || curr_events - rdev->last_events > 64) { > - rdev->last_events = curr_events; > - idle = 0; > - } > - } > + rcu_read_lock(); > + rdev_for_each_rcu(rdev, mddev) > + if (!is_rdev_holder_idle(rdev, init)) > + idle = false; > rcu_read_unlock(); > + > return idle; > } > > diff --git a/drivers/md/md.h b/drivers/md/md.h > index b57842188f18..1d51c2405d3d 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -132,7 +132,7 @@ struct md_rdev { > > sector_t sectors; /* Device size (in 512bytes sectors) > */ > struct mddev *mddev; /* RAID array if running */ > - int last_events; /* IO event timestamp */ > + unsigned long last_events; /* IO event timestamp */ > > /* > * If meta_bdev is non-NULL, it means that a separate device > is > @@ -520,6 +520,7 @@ struct mddev { > * adding a spare > */ > > + unsigned long last_events; /* IO event timestamp > */ > atomic_t recovery_active; /* blocks scheduled, but > not written */ > wait_queue_head_t recovery_wait; > sector_t recovery_cp;
Hi, 在 2025/04/19 9:42, Su Yue 写道: > On Fri 18 Apr 2025 at 09:09, Yu Kuai <yukuai1@huaweicloud.com> wrote: > >> From: Yu Kuai <yukuai3@huawei.com> >> >> If sync_speed is above speed_min, then is_mddev_idle() will be called >> for each sync IO to check if the array is idle, and inflihgt sync_io >> will be limited if the array is not idle. >> >> However, while mkfs.ext4 for a large raid5 array while recovery is in >> progress, it's found that sync_speed is already above speed_min while >> lots of stripes are used for sync IO, causing long delay for mkfs.ext4. >> >> Root cause is the following checking from is_mddev_idle(): >> >> t1: submit sync IO: events1 = completed IO - issued sync IO >> t2: submit next sync IO: events2 = completed IO - issued sync IO >> if (events2 - events1 > 64) >> >> For consequence, the more sync IO issued, the less likely checking will >> pass. And when completed normal IO is more than issued sync IO, the >> condition will finally pass and is_mddev_idle() will return false, >> however, last_events will be updated hence is_mddev_idle() can only >> return false once in a while. >> >> Fix this problem by changing the checking as following: >> >> 1) mddev doesn't have normal IO completed; >> 2) mddev doesn't have normal IO inflight; >> 3) if any member disks is partition, and all other partitions doesn't >> have IO completed. >> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >> --- >> drivers/md/md.c | 84 +++++++++++++++++++++++++++---------------------- >> drivers/md/md.h | 3 +- >> 2 files changed, 48 insertions(+), 39 deletions(-) >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index 52cadfce7e8d..dfd85a5d6112 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev *mddev) >> put_cluster_ops(mddev); >> } >> >> -static int is_mddev_idle(struct mddev *mddev, int init) >> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init) >> { >> + unsigned long last_events = rdev->last_events; >> + >> + if (!bdev_is_partition(rdev->bdev)) >> + return true; >> + >> + /* >> + * If rdev is partition, and user doesn't issue IO to the array, the >> + * array is still not idle if user issues IO to other partitions. >> + */ >> + rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0, >> + sectors) - >> + part_stat_read_accum(rdev->bdev, sectors); >> + >> + if (!init && rdev->last_events > last_events) >> + return false; >> + >> + return true; >> +} >> + >> +/* >> + * mddev is idle if following conditions are match since last check: >> + * 1) mddev doesn't have normal IO completed; >> + * 2) mddev doesn't have inflight normal IO; >> + * 3) if any member disk is partition, and other partitions doesn't >> have IO >> + * completed; >> + * >> + * Noted this checking rely on IO accounting is enabled. >> + */ >> +static bool is_mddev_idle(struct mddev *mddev, int init) >> +{ >> + unsigned long last_events = mddev->last_events; >> + struct gendisk *disk; >> struct md_rdev *rdev; >> - int idle; >> - int curr_events; >> + bool idle = true; >> >> - idle = 1; >> - rcu_read_lock(); >> - rdev_for_each_rcu(rdev, mddev) { >> - struct gendisk *disk = rdev->bdev->bd_disk; >> + disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk; >> + if (!disk) >> + return true; >> >> - if (!init && !blk_queue_io_stat(disk->queue)) >> - continue; >> + mddev->last_events = part_stat_read_accum(disk->part0, sectors); >> + if (!init && (mddev->last_events > last_events || >> + bdev_count_inflight(disk->part0))) >> + idle = false; >> > > Forgot return or goto here? No, following still need to be executed to init or update rdev->last_events.k Thanks, Kuai > > -- > Su > >> - curr_events = (int)part_stat_read_accum(disk->part0, sectors) - >> - atomic_read(&disk->sync_io); >> - /* sync IO will cause sync_io to increase before the disk_stats >> - * as sync_io is counted when a request starts, and >> - * disk_stats is counted when it completes. >> - * So resync activity will cause curr_events to be smaller than >> - * when there was no such activity. >> - * non-sync IO will cause disk_stat to increase without >> - * increasing sync_io so curr_events will (eventually) >> - * be larger than it was before. Once it becomes >> - * substantially larger, the test below will cause >> - * the array to appear non-idle, and resync will slow >> - * down. >> - * If there is a lot of outstanding resync activity when >> - * we set last_event to curr_events, then all that activity >> - * completing might cause the array to appear non-idle >> - * and resync will be slowed down even though there might >> - * not have been non-resync activity. This will only >> - * happen once though. 'last_events' will soon reflect >> - * the state where there is little or no outstanding >> - * resync requests, and further resync activity will >> - * always make curr_events less than last_events. >> - * >> - */ >> - if (init || curr_events - rdev->last_events > 64) { >> - rdev->last_events = curr_events; >> - idle = 0; >> - } >> - } >> + rcu_read_lock(); >> + rdev_for_each_rcu(rdev, mddev) >> + if (!is_rdev_holder_idle(rdev, init)) >> + idle = false; >> rcu_read_unlock(); >> + >> return idle; >> } >> >> diff --git a/drivers/md/md.h b/drivers/md/md.h >> index b57842188f18..1d51c2405d3d 100644 >> --- a/drivers/md/md.h >> +++ b/drivers/md/md.h >> @@ -132,7 +132,7 @@ struct md_rdev { >> >> sector_t sectors; /* Device size (in 512bytes sectors) */ >> struct mddev *mddev; /* RAID array if running */ >> - int last_events; /* IO event timestamp */ >> + unsigned long last_events; /* IO event timestamp */ >> >> /* >> * If meta_bdev is non-NULL, it means that a separate device is >> @@ -520,6 +520,7 @@ struct mddev { >> * adding a spare >> */ >> >> + unsigned long last_events; /* IO event timestamp */ >> atomic_t recovery_active; /* blocks scheduled, but >> not written */ >> wait_queue_head_t recovery_wait; >> sector_t recovery_cp; > . >
On Sat 19 Apr 2025 at 10:00, Yu Kuai <yukuai1@huaweicloud.com> wrote: > Hi, > > 在 2025/04/19 9:42, Su Yue 写道: >> On Fri 18 Apr 2025 at 09:09, Yu Kuai <yukuai1@huaweicloud.com> >> wrote: >> >>> From: Yu Kuai <yukuai3@huawei.com> >>> >>> If sync_speed is above speed_min, then is_mddev_idle() will be >>> called >>> for each sync IO to check if the array is idle, and inflihgt >>> sync_io >>> will be limited if the array is not idle. >>> >>> However, while mkfs.ext4 for a large raid5 array while >>> recovery is in >>> progress, it's found that sync_speed is already above >>> speed_min while >>> lots of stripes are used for sync IO, causing long delay for >>> mkfs.ext4. >>> >>> Root cause is the following checking from is_mddev_idle(): >>> >>> t1: submit sync IO: events1 = completed IO - issued sync IO >>> t2: submit next sync IO: events2 = completed IO - issued sync >>> IO >>> if (events2 - events1 > 64) >>> >>> For consequence, the more sync IO issued, the less likely >>> checking will >>> pass. And when completed normal IO is more than issued sync >>> IO, the >>> condition will finally pass and is_mddev_idle() will return >>> false, >>> however, last_events will be updated hence is_mddev_idle() can >>> only >>> return false once in a while. >>> >>> Fix this problem by changing the checking as following: >>> >>> 1) mddev doesn't have normal IO completed; >>> 2) mddev doesn't have normal IO inflight; >>> 3) if any member disks is partition, and all other partitions >>> doesn't >>> have IO completed. >>> >>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>> --- >>> drivers/md/md.c | 84 >>> +++++++++++++++++++++++++++---------------------- >>> drivers/md/md.h | 3 +- >>> 2 files changed, 48 insertions(+), 39 deletions(-) >>> >>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>> index 52cadfce7e8d..dfd85a5d6112 100644 >>> --- a/drivers/md/md.c >>> +++ b/drivers/md/md.c >>> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev >>> *mddev) >>> put_cluster_ops(mddev); >>> } >>> >>> -static int is_mddev_idle(struct mddev *mddev, int init) >>> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool >>> init) >>> { >>> + unsigned long last_events = rdev->last_events; >>> + >>> + if (!bdev_is_partition(rdev->bdev)) >>> + return true; >>> + >>> + /* >>> + * If rdev is partition, and user doesn't issue IO to the >>> array, the >>> + * array is still not idle if user issues IO to other >>> partitions. >>> + */ >>> + rdev->last_events = >>> part_stat_read_accum(rdev->bdev->bd_disk->part0, >>> + sectors) - >>> + part_stat_read_accum(rdev->bdev, sectors); >>> + >>> + if (!init && rdev->last_events > last_events) >>> + return false; >>> + >>> + return true; >>> +} >>> + >>> +/* >>> + * mddev is idle if following conditions are match since last >>> check: >>> + * 1) mddev doesn't have normal IO completed; >>> + * 2) mddev doesn't have inflight normal IO; >>> + * 3) if any member disk is partition, and other partitions >>> doesn't have IO >>> + * completed; >>> + * >>> + * Noted this checking rely on IO accounting is enabled. >>> + */ >>> +static bool is_mddev_idle(struct mddev *mddev, int init) >>> +{ >>> + unsigned long last_events = mddev->last_events; >>> + struct gendisk *disk; >>> struct md_rdev *rdev; >>> - int idle; >>> - int curr_events; >>> + bool idle = true; >>> >>> - idle = 1; >>> - rcu_read_lock(); >>> - rdev_for_each_rcu(rdev, mddev) { >>> - struct gendisk *disk = rdev->bdev->bd_disk; >>> + disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : >>> mddev->gendisk; >>> + if (!disk) >>> + return true; >>> >>> - if (!init && !blk_queue_io_stat(disk->queue)) >>> - continue; >>> + mddev->last_events = part_stat_read_accum(disk->part0, >>> sectors); >>> + if (!init && (mddev->last_events > last_events || >>> + bdev_count_inflight(disk->part0))) >>> + idle = false; >>> >> Forgot return or goto here? > > No, following still need to be executed to init or update > rdev->last_events.k > Okay, I see. is_rdev_holder_idle does the work. -- Su > Thanks, > Kuai > >> -- Su >> >>> - curr_events = (int)part_stat_read_accum(disk->part0, >>> sectors) - >>> - atomic_read(&disk->sync_io); >>> - /* sync IO will cause sync_io to increase before the >>> disk_stats >>> - * as sync_io is counted when a request starts, and >>> - * disk_stats is counted when it completes. >>> - * So resync activity will cause curr_events to be >>> smaller than >>> - * when there was no such activity. >>> - * non-sync IO will cause disk_stat to increase >>> without >>> - * increasing sync_io so curr_events will >>> (eventually) >>> - * be larger than it was before. Once it becomes >>> - * substantially larger, the test below will cause >>> - * the array to appear non-idle, and resync will slow >>> - * down. >>> - * If there is a lot of outstanding resync activity >>> when >>> - * we set last_event to curr_events, then all that >>> activity >>> - * completing might cause the array to appear >>> non-idle >>> - * and resync will be slowed down even though there >>> might >>> - * not have been non-resync activity. This will only >>> - * happen once though. 'last_events' will soon >>> reflect >>> - * the state where there is little or no outstanding >>> - * resync requests, and further resync activity will >>> - * always make curr_events less than last_events. >>> - * >>> - */ >>> - if (init || curr_events - rdev->last_events > 64) { >>> - rdev->last_events = curr_events; >>> - idle = 0; >>> - } >>> - } >>> + rcu_read_lock(); >>> + rdev_for_each_rcu(rdev, mddev) >>> + if (!is_rdev_holder_idle(rdev, init)) >>> + idle = false; >>> rcu_read_unlock(); >>> + >>> return idle; >>> } >>> >>> diff --git a/drivers/md/md.h b/drivers/md/md.h >>> index b57842188f18..1d51c2405d3d 100644 >>> --- a/drivers/md/md.h >>> +++ b/drivers/md/md.h >>> @@ -132,7 +132,7 @@ struct md_rdev { >>> >>> sector_t sectors; /* Device size (in 512bytes >>> sectors) */ >>> struct mddev *mddev; /* RAID array if running */ >>> - int last_events; /* IO event timestamp */ >>> + unsigned long last_events; /* IO event timestamp */ >>> >>> /* >>> * If meta_bdev is non-NULL, it means that a separate >>> device is >>> @@ -520,6 +520,7 @@ struct mddev { >>> * adding a spare >>> */ >>> >>> + unsigned long last_events; /* IO event >>> timestamp */ >>> atomic_t recovery_active; /* blocks scheduled, >>> but not >>> written */ >>> wait_queue_head_t recovery_wait; >>> sector_t recovery_cp; >> . >>
diff --git a/drivers/md/md.c b/drivers/md/md.c index 52cadfce7e8d..dfd85a5d6112 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev *mddev) put_cluster_ops(mddev); } -static int is_mddev_idle(struct mddev *mddev, int init) +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init) { + unsigned long last_events = rdev->last_events; + + if (!bdev_is_partition(rdev->bdev)) + return true; + + /* + * If rdev is partition, and user doesn't issue IO to the array, the + * array is still not idle if user issues IO to other partitions. + */ + rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0, + sectors) - + part_stat_read_accum(rdev->bdev, sectors); + + if (!init && rdev->last_events > last_events) + return false; + + return true; +} + +/* + * mddev is idle if following conditions are match since last check: + * 1) mddev doesn't have normal IO completed; + * 2) mddev doesn't have inflight normal IO; + * 3) if any member disk is partition, and other partitions doesn't have IO + * completed; + * + * Noted this checking rely on IO accounting is enabled. + */ +static bool is_mddev_idle(struct mddev *mddev, int init) +{ + unsigned long last_events = mddev->last_events; + struct gendisk *disk; struct md_rdev *rdev; - int idle; - int curr_events; + bool idle = true; - idle = 1; - rcu_read_lock(); - rdev_for_each_rcu(rdev, mddev) { - struct gendisk *disk = rdev->bdev->bd_disk; + disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk; + if (!disk) + return true; - if (!init && !blk_queue_io_stat(disk->queue)) - continue; + mddev->last_events = part_stat_read_accum(disk->part0, sectors); + if (!init && (mddev->last_events > last_events || + bdev_count_inflight(disk->part0))) + idle = false; - curr_events = (int)part_stat_read_accum(disk->part0, sectors) - - atomic_read(&disk->sync_io); - /* sync IO will cause sync_io to increase before the disk_stats - * as sync_io is counted when a request starts, and - * disk_stats is counted when it completes. - * So resync activity will cause curr_events to be smaller than - * when there was no such activity. - * non-sync IO will cause disk_stat to increase without - * increasing sync_io so curr_events will (eventually) - * be larger than it was before. Once it becomes - * substantially larger, the test below will cause - * the array to appear non-idle, and resync will slow - * down. - * If there is a lot of outstanding resync activity when - * we set last_event to curr_events, then all that activity - * completing might cause the array to appear non-idle - * and resync will be slowed down even though there might - * not have been non-resync activity. This will only - * happen once though. 'last_events' will soon reflect - * the state where there is little or no outstanding - * resync requests, and further resync activity will - * always make curr_events less than last_events. - * - */ - if (init || curr_events - rdev->last_events > 64) { - rdev->last_events = curr_events; - idle = 0; - } - } + rcu_read_lock(); + rdev_for_each_rcu(rdev, mddev) + if (!is_rdev_holder_idle(rdev, init)) + idle = false; rcu_read_unlock(); + return idle; } diff --git a/drivers/md/md.h b/drivers/md/md.h index b57842188f18..1d51c2405d3d 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -132,7 +132,7 @@ struct md_rdev { sector_t sectors; /* Device size (in 512bytes sectors) */ struct mddev *mddev; /* RAID array if running */ - int last_events; /* IO event timestamp */ + unsigned long last_events; /* IO event timestamp */ /* * If meta_bdev is non-NULL, it means that a separate device is @@ -520,6 +520,7 @@ struct mddev { * adding a spare */ + unsigned long last_events; /* IO event timestamp */ atomic_t recovery_active; /* blocks scheduled, but not written */ wait_queue_head_t recovery_wait; sector_t recovery_cp;