[v2,4/5] md: fix is_mddev_idle()

Message ID	20250418010941.667138-5-yukuai1@huaweicloud.com (mailing list archive)
State	New
Headers	show Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C839A481A3; Fri, 18 Apr 2025 01:16:52 +0000 (UTC) From: Yu Kuai <yukuai1@huaweicloud.com> To: axboe@kernel.dk, xni@redhat.com, agk@redhat.com, snitzer@kernel.org, mpatocka@redhat.com, song@kernel.org, yukuai3@huawei.com, viro@zeniv.linux.org.uk, akpm@linux-foundation.org, nadav.amit@gmail.com, ubizjak@gmail.com, cl@linux.com Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, dm-devel@lists.linux.dev, linux-raid@vger.kernel.org, yukuai1@huaweicloud.com, yi.zhang@huawei.com, yangerkun@huawei.com, johnny.chenyi@huawei.com Subject: [PATCH v2 4/5] md: fix is_mddev_idle() Date: Fri, 18 Apr 2025 09:09:40 +0800 Message-Id: <20250418010941.667138-5-yukuai1@huaweicloud.com> In-Reply-To: <20250418010941.667138-1-yukuai1@huaweicloud.com> References: <20250418010941.667138-1-yukuai1@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	md: fix is_mddev_idle() \| expand [v2,0/5] md: fix is_mddev_idle() [v2,1/5] block: cleanup and export bdev IO inflight APIs [v2,2/5] md: record dm-raid gendisk in mddev [v2,3/5] md: add a new api sync_io_depth [v2,4/5] md: fix is_mddev_idle() [v2,5/5] md: cleanup accounting for issued sync IO

Message ID

20250418010941.667138-5-yukuai1@huaweicloud.com (mailing list archive)

State

New

Headers

From: Yu Kuai <yukuai1@huaweicloud.com>
To: axboe@kernel.dk,
	xni@redhat.com,
	agk@redhat.com,
	snitzer@kernel.org,
	mpatocka@redhat.com,
	song@kernel.org,
	yukuai3@huawei.com,
	viro@zeniv.linux.org.uk,
	akpm@linux-foundation.org,
	nadav.amit@gmail.com,
	ubizjak@gmail.com,
	cl@linux.com
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	dm-devel@lists.linux.dev,
	linux-raid@vger.kernel.org,
	yukuai1@huaweicloud.com,
	yi.zhang@huawei.com,
	yangerkun@huawei.com,
	johnny.chenyi@huawei.com
Subject: [PATCH v2 4/5] md: fix is_mddev_idle()
Date: Fri, 18 Apr 2025 09:09:40 +0800
Message-Id: <20250418010941.667138-5-yukuai1@huaweicloud.com>
In-Reply-To: <20250418010941.667138-1-yukuai1@huaweicloud.com>
References: <20250418010941.667138-1-yukuai1@huaweicloud.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

md: fix is_mddev_idle() | expand

Commit Message

Yu Kuai April 18, 2025, 1:09 a.m. UTC

From: Yu Kuai <yukuai3@huawei.com>

If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflihgt sync_io
will be limited if the array is not idle.

However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.

Root cause is the following checking from is_mddev_idle():

t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2  = completed IO - issued sync IO
if (events2 - events1 > 64)

For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.

Fix this problem by changing the checking as following:

1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
   have IO completed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/md.c | 84 +++++++++++++++++++++++++++----------------------
 drivers/md/md.h |  3 +-
 2 files changed, 48 insertions(+), 39 deletions(-)

Comments

Su Yue April 19, 2025, 1:42 a.m. UTC | #1

On Fri 18 Apr 2025 at 09:09, Yu Kuai <yukuai1@huaweicloud.com> 
wrote:

> From: Yu Kuai <yukuai3@huawei.com>
>
> If sync_speed is above speed_min, then is_mddev_idle() will be 
> called
> for each sync IO to check if the array is idle, and inflihgt 
> sync_io
> will be limited if the array is not idle.
>
> However, while mkfs.ext4 for a large raid5 array while recovery 
> is in
> progress, it's found that sync_speed is already above speed_min 
> while
> lots of stripes are used for sync IO, causing long delay for 
> mkfs.ext4.
>
> Root cause is the following checking from is_mddev_idle():
>
> t1: submit sync IO: events1 = completed IO - issued sync IO
> t2: submit next sync IO: events2  = completed IO - issued sync 
> IO
> if (events2 - events1 > 64)
>
> For consequence, the more sync IO issued, the less likely 
> checking will
> pass. And when completed normal IO is more than issued sync IO, 
> the
> condition will finally pass and is_mddev_idle() will return 
> false,
> however, last_events will be updated hence is_mddev_idle() can 
> only
> return false once in a while.
>
> Fix this problem by changing the checking as following:
>
> 1) mddev doesn't have normal IO completed;
> 2) mddev doesn't have normal IO inflight;
> 3) if any member disks is partition, and all other partitions 
> doesn't
>    have IO completed.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  drivers/md/md.c | 84 
>  +++++++++++++++++++++++++++----------------------
>  drivers/md/md.h |  3 +-
>  2 files changed, 48 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 52cadfce7e8d..dfd85a5d6112 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev 
> *mddev)
>  	put_cluster_ops(mddev);
>  }
>
> -static int is_mddev_idle(struct mddev *mddev, int init)
> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool 
> init)
>  {
> +	unsigned long last_events = rdev->last_events;
> +
> +	if (!bdev_is_partition(rdev->bdev))
> +		return true;
> +
> +	/*
> +	 * If rdev is partition, and user doesn't issue IO to the 
> array, the
> +	 * array is still not idle if user issues IO to other 
> partitions.
> +	 */
> +	rdev->last_events = 
> part_stat_read_accum(rdev->bdev->bd_disk->part0,
> +						 sectors) -
> +			    part_stat_read_accum(rdev->bdev, sectors);
> +
> +	if (!init && rdev->last_events > last_events)
> +		return false;
> +
> +	return true;
> +}
> +
> +/*
> + * mddev is idle if following conditions are match since last 
> check:
> + * 1) mddev doesn't have normal IO completed;
> + * 2) mddev doesn't have inflight normal IO;
> + * 3) if any member disk is partition, and other partitions 
> doesn't have IO
> + *    completed;
> + *
> + * Noted this checking rely on IO accounting is enabled.
> + */
> +static bool is_mddev_idle(struct mddev *mddev, int init)
> +{
> +	unsigned long last_events = mddev->last_events;
> +	struct gendisk *disk;
>  	struct md_rdev *rdev;
> -	int idle;
> -	int curr_events;
> +	bool idle = true;
>
> -	idle = 1;
> -	rcu_read_lock();
> -	rdev_for_each_rcu(rdev, mddev) {
> -		struct gendisk *disk = rdev->bdev->bd_disk;
> +	disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : 
> mddev->gendisk;
> +	if (!disk)
> +		return true;
>
> -		if (!init && !blk_queue_io_stat(disk->queue))
> -			continue;
> +	mddev->last_events = part_stat_read_accum(disk->part0, 
> sectors);
> +	if (!init && (mddev->last_events > last_events ||
> +		      bdev_count_inflight(disk->part0)))
> +		idle = false;
>

Forgot return or goto here?

--
Su

> -		curr_events = (int)part_stat_read_accum(disk->part0, 
> sectors) -
> -			      atomic_read(&disk->sync_io);
> -		/* sync IO will cause sync_io to increase before the 
> disk_stats
> -		 * as sync_io is counted when a request starts, and
> -		 * disk_stats is counted when it completes.
> -		 * So resync activity will cause curr_events to be smaller 
> than
> -		 * when there was no such activity.
> -		 * non-sync IO will cause disk_stat to increase without
> -		 * increasing sync_io so curr_events will (eventually)
> -		 * be larger than it was before.  Once it becomes
> -		 * substantially larger, the test below will cause
> -		 * the array to appear non-idle, and resync will slow
> -		 * down.
> -		 * If there is a lot of outstanding resync activity when
> -		 * we set last_event to curr_events, then all that 
> activity
> -		 * completing might cause the array to appear non-idle
> -		 * and resync will be slowed down even though there might
> -		 * not have been non-resync activity.  This will only
> -		 * happen once though.  'last_events' will soon reflect
> -		 * the state where there is little or no outstanding
> -		 * resync requests, and further resync activity will
> -		 * always make curr_events less than last_events.
> -		 *
> -		 */
> -		if (init || curr_events - rdev->last_events > 64) {
> -			rdev->last_events = curr_events;
> -			idle = 0;
> -		}
> -	}
> +	rcu_read_lock();
> +	rdev_for_each_rcu(rdev, mddev)
> +		if (!is_rdev_holder_idle(rdev, init))
> +			idle = false;
>  	rcu_read_unlock();
> +
>  	return idle;
>  }
>
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b57842188f18..1d51c2405d3d 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -132,7 +132,7 @@ struct md_rdev {
>
>  	sector_t sectors;		/* Device size (in 512bytes sectors) 
>  */
>  	struct mddev *mddev;		/* RAID array if running */
> -	int last_events;		/* IO event timestamp */
> +	unsigned long last_events;	/* IO event timestamp */
>
>  	/*
>  	 * If meta_bdev is non-NULL, it means that a separate device 
>  is
> @@ -520,6 +520,7 @@ struct mddev {
>  							 * adding a spare
>  							 */
>
> +	unsigned long			last_events;	/* IO event timestamp 
> */
>  	atomic_t			recovery_active; /* blocks scheduled, but 
>  not written */
>  	wait_queue_head_t		recovery_wait;
>  	sector_t			recovery_cp;

Yu Kuai April 19, 2025, 2 a.m. UTC | #2

Hi,

在 2025/04/19 9:42, Su Yue 写道:
> On Fri 18 Apr 2025 at 09:09, Yu Kuai <yukuai1@huaweicloud.com> wrote:
> 
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> If sync_speed is above speed_min, then is_mddev_idle() will be called
>> for each sync IO to check if the array is idle, and inflihgt sync_io
>> will be limited if the array is not idle.
>>
>> However, while mkfs.ext4 for a large raid5 array while recovery is in
>> progress, it's found that sync_speed is already above speed_min while
>> lots of stripes are used for sync IO, causing long delay for mkfs.ext4.
>>
>> Root cause is the following checking from is_mddev_idle():
>>
>> t1: submit sync IO: events1 = completed IO - issued sync IO
>> t2: submit next sync IO: events2  = completed IO - issued sync IO
>> if (events2 - events1 > 64)
>>
>> For consequence, the more sync IO issued, the less likely checking will
>> pass. And when completed normal IO is more than issued sync IO, the
>> condition will finally pass and is_mddev_idle() will return false,
>> however, last_events will be updated hence is_mddev_idle() can only
>> return false once in a while.
>>
>> Fix this problem by changing the checking as following:
>>
>> 1) mddev doesn't have normal IO completed;
>> 2) mddev doesn't have normal IO inflight;
>> 3) if any member disks is partition, and all other partitions doesn't
>>    have IO completed.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>  drivers/md/md.c | 84  +++++++++++++++++++++++++++----------------------
>>  drivers/md/md.h |  3 +-
>>  2 files changed, 48 insertions(+), 39 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 52cadfce7e8d..dfd85a5d6112 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev *mddev)
>>      put_cluster_ops(mddev);
>>  }
>>
>> -static int is_mddev_idle(struct mddev *mddev, int init)
>> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init)
>>  {
>> +    unsigned long last_events = rdev->last_events;
>> +
>> +    if (!bdev_is_partition(rdev->bdev))
>> +        return true;
>> +
>> +    /*
>> +     * If rdev is partition, and user doesn't issue IO to the array, the
>> +     * array is still not idle if user issues IO to other partitions.
>> +     */
>> +    rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
>> +                         sectors) -
>> +                part_stat_read_accum(rdev->bdev, sectors);
>> +
>> +    if (!init && rdev->last_events > last_events)
>> +        return false;
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * mddev is idle if following conditions are match since last check:
>> + * 1) mddev doesn't have normal IO completed;
>> + * 2) mddev doesn't have inflight normal IO;
>> + * 3) if any member disk is partition, and other partitions doesn't 
>> have IO
>> + *    completed;
>> + *
>> + * Noted this checking rely on IO accounting is enabled.
>> + */
>> +static bool is_mddev_idle(struct mddev *mddev, int init)
>> +{
>> +    unsigned long last_events = mddev->last_events;
>> +    struct gendisk *disk;
>>      struct md_rdev *rdev;
>> -    int idle;
>> -    int curr_events;
>> +    bool idle = true;
>>
>> -    idle = 1;
>> -    rcu_read_lock();
>> -    rdev_for_each_rcu(rdev, mddev) {
>> -        struct gendisk *disk = rdev->bdev->bd_disk;
>> +    disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk;
>> +    if (!disk)
>> +        return true;
>>
>> -        if (!init && !blk_queue_io_stat(disk->queue))
>> -            continue;
>> +    mddev->last_events = part_stat_read_accum(disk->part0, sectors);
>> +    if (!init && (mddev->last_events > last_events ||
>> +              bdev_count_inflight(disk->part0)))
>> +        idle = false;
>>
> 
> Forgot return or goto here?

No, following still need to be executed to init or update
rdev->last_events.k

Thanks,
Kuai

> 
> -- 
> Su
> 
>> -        curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
>> -                  atomic_read(&disk->sync_io);
>> -        /* sync IO will cause sync_io to increase before the disk_stats
>> -         * as sync_io is counted when a request starts, and
>> -         * disk_stats is counted when it completes.
>> -         * So resync activity will cause curr_events to be smaller than
>> -         * when there was no such activity.
>> -         * non-sync IO will cause disk_stat to increase without
>> -         * increasing sync_io so curr_events will (eventually)
>> -         * be larger than it was before.  Once it becomes
>> -         * substantially larger, the test below will cause
>> -         * the array to appear non-idle, and resync will slow
>> -         * down.
>> -         * If there is a lot of outstanding resync activity when
>> -         * we set last_event to curr_events, then all that activity
>> -         * completing might cause the array to appear non-idle
>> -         * and resync will be slowed down even though there might
>> -         * not have been non-resync activity.  This will only
>> -         * happen once though.  'last_events' will soon reflect
>> -         * the state where there is little or no outstanding
>> -         * resync requests, and further resync activity will
>> -         * always make curr_events less than last_events.
>> -         *
>> -         */
>> -        if (init || curr_events - rdev->last_events > 64) {
>> -            rdev->last_events = curr_events;
>> -            idle = 0;
>> -        }
>> -    }
>> +    rcu_read_lock();
>> +    rdev_for_each_rcu(rdev, mddev)
>> +        if (!is_rdev_holder_idle(rdev, init))
>> +            idle = false;
>>      rcu_read_unlock();
>> +
>>      return idle;
>>  }
>>
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index b57842188f18..1d51c2405d3d 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -132,7 +132,7 @@ struct md_rdev {
>>
>>      sector_t sectors;        /* Device size (in 512bytes sectors)  */
>>      struct mddev *mddev;        /* RAID array if running */
>> -    int last_events;        /* IO event timestamp */
>> +    unsigned long last_events;    /* IO event timestamp */
>>
>>      /*
>>       * If meta_bdev is non-NULL, it means that a separate device  is
>> @@ -520,6 +520,7 @@ struct mddev {
>>                               * adding a spare
>>                               */
>>
>> +    unsigned long            last_events;    /* IO event timestamp */
>>      atomic_t            recovery_active; /* blocks scheduled, but 
>>  not written */
>>      wait_queue_head_t        recovery_wait;
>>      sector_t            recovery_cp;
> .
>

Su Yue April 19, 2025, 5:03 a.m. UTC | #3

On Sat 19 Apr 2025 at 10:00, Yu Kuai <yukuai1@huaweicloud.com> 
wrote:

> Hi,
>
> 在 2025/04/19 9:42, Su Yue 写道:
>> On Fri 18 Apr 2025 at 09:09, Yu Kuai <yukuai1@huaweicloud.com> 
>> wrote:
>>
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> If sync_speed is above speed_min, then is_mddev_idle() will be 
>>> called
>>> for each sync IO to check if the array is idle, and inflihgt 
>>> sync_io
>>> will be limited if the array is not idle.
>>>
>>> However, while mkfs.ext4 for a large raid5 array while 
>>> recovery is in
>>> progress, it's found that sync_speed is already above 
>>> speed_min while
>>> lots of stripes are used for sync IO, causing long delay for 
>>> mkfs.ext4.
>>>
>>> Root cause is the following checking from is_mddev_idle():
>>>
>>> t1: submit sync IO: events1 = completed IO - issued sync IO
>>> t2: submit next sync IO: events2  = completed IO - issued sync 
>>> IO
>>> if (events2 - events1 > 64)
>>>
>>> For consequence, the more sync IO issued, the less likely 
>>> checking will
>>> pass. And when completed normal IO is more than issued sync 
>>> IO, the
>>> condition will finally pass and is_mddev_idle() will return 
>>> false,
>>> however, last_events will be updated hence is_mddev_idle() can 
>>> only
>>> return false once in a while.
>>>
>>> Fix this problem by changing the checking as following:
>>>
>>> 1) mddev doesn't have normal IO completed;
>>> 2) mddev doesn't have normal IO inflight;
>>> 3) if any member disks is partition, and all other partitions 
>>> doesn't
>>>    have IO completed.
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>  drivers/md/md.c | 84 
>>>  +++++++++++++++++++++++++++----------------------
>>>  drivers/md/md.h |  3 +-
>>>  2 files changed, 48 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 52cadfce7e8d..dfd85a5d6112 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -8625,50 +8625,58 @@ void md_cluster_stop(struct mddev 
>>> *mddev)
>>>      put_cluster_ops(mddev);
>>>  }
>>>
>>> -static int is_mddev_idle(struct mddev *mddev, int init)
>>> +static bool is_rdev_holder_idle(struct md_rdev *rdev, bool 
>>> init)
>>>  {
>>> +    unsigned long last_events = rdev->last_events;
>>> +
>>> +    if (!bdev_is_partition(rdev->bdev))
>>> +        return true;
>>> +
>>> +    /*
>>> +     * If rdev is partition, and user doesn't issue IO to the 
>>> array, the
>>> +     * array is still not idle if user issues IO to other 
>>> partitions.
>>> +     */
>>> +    rdev->last_events = 
>>> part_stat_read_accum(rdev->bdev->bd_disk->part0,
>>> +                         sectors) -
>>> +                part_stat_read_accum(rdev->bdev, sectors);
>>> +
>>> +    if (!init && rdev->last_events > last_events)
>>> +        return false;
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +/*
>>> + * mddev is idle if following conditions are match since last 
>>> check:
>>> + * 1) mddev doesn't have normal IO completed;
>>> + * 2) mddev doesn't have inflight normal IO;
>>> + * 3) if any member disk is partition, and other partitions 
>>> doesn't have IO
>>> + *    completed;
>>> + *
>>> + * Noted this checking rely on IO accounting is enabled.
>>> + */
>>> +static bool is_mddev_idle(struct mddev *mddev, int init)
>>> +{
>>> +    unsigned long last_events = mddev->last_events;
>>> +    struct gendisk *disk;
>>>      struct md_rdev *rdev;
>>> -    int idle;
>>> -    int curr_events;
>>> +    bool idle = true;
>>>
>>> -    idle = 1;
>>> -    rcu_read_lock();
>>> -    rdev_for_each_rcu(rdev, mddev) {
>>> -        struct gendisk *disk = rdev->bdev->bd_disk;
>>> +    disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : 
>>> mddev->gendisk;
>>> +    if (!disk)
>>> +        return true;
>>>
>>> -        if (!init && !blk_queue_io_stat(disk->queue))
>>> -            continue;
>>> +    mddev->last_events = part_stat_read_accum(disk->part0, 
>>> sectors);
>>> +    if (!init && (mddev->last_events > last_events ||
>>> +              bdev_count_inflight(disk->part0)))
>>> +        idle = false;
>>>
>> Forgot return or goto here?
>
> No, following still need to be executed to init or update
> rdev->last_events.k
>

Okay, I see. is_rdev_holder_idle does the work.

--
Su

> Thanks,
> Kuai
>
>> -- Su
>>
>>> -        curr_events = (int)part_stat_read_accum(disk->part0, 
>>> sectors) -
>>> -                  atomic_read(&disk->sync_io);
>>> -        /* sync IO will cause sync_io to increase before the 
>>> disk_stats
>>> -         * as sync_io is counted when a request starts, and
>>> -         * disk_stats is counted when it completes.
>>> -         * So resync activity will cause curr_events to be 
>>> smaller than
>>> -         * when there was no such activity.
>>> -         * non-sync IO will cause disk_stat to increase 
>>> without
>>> -         * increasing sync_io so curr_events will 
>>> (eventually)
>>> -         * be larger than it was before.  Once it becomes
>>> -         * substantially larger, the test below will cause
>>> -         * the array to appear non-idle, and resync will slow
>>> -         * down.
>>> -         * If there is a lot of outstanding resync activity 
>>> when
>>> -         * we set last_event to curr_events, then all that 
>>> activity
>>> -         * completing might cause the array to appear 
>>> non-idle
>>> -         * and resync will be slowed down even though there 
>>> might
>>> -         * not have been non-resync activity.  This will only
>>> -         * happen once though.  'last_events' will soon 
>>> reflect
>>> -         * the state where there is little or no outstanding
>>> -         * resync requests, and further resync activity will
>>> -         * always make curr_events less than last_events.
>>> -         *
>>> -         */
>>> -        if (init || curr_events - rdev->last_events > 64) {
>>> -            rdev->last_events = curr_events;
>>> -            idle = 0;
>>> -        }
>>> -    }
>>> +    rcu_read_lock();
>>> +    rdev_for_each_rcu(rdev, mddev)
>>> +        if (!is_rdev_holder_idle(rdev, init))
>>> +            idle = false;
>>>      rcu_read_unlock();
>>> +
>>>      return idle;
>>>  }
>>>
>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>> index b57842188f18..1d51c2405d3d 100644
>>> --- a/drivers/md/md.h
>>> +++ b/drivers/md/md.h
>>> @@ -132,7 +132,7 @@ struct md_rdev {
>>>
>>>      sector_t sectors;        /* Device size (in 512bytes 
>>> sectors)  */
>>>      struct mddev *mddev;        /* RAID array if running */
>>> -    int last_events;        /* IO event timestamp */
>>> +    unsigned long last_events;    /* IO event timestamp */
>>>
>>>      /*
>>>       * If meta_bdev is non-NULL, it means that a separate 
>>> device  is
>>> @@ -520,6 +520,7 @@ struct mddev {
>>>                               * adding a spare
>>>                               */
>>>
>>> +    unsigned long            last_events;    /* IO event 
>>> timestamp */
>>>      atomic_t            recovery_active; /* blocks scheduled, 
>>> but  not
>>> written */
>>>      wait_queue_head_t        recovery_wait;
>>>      sector_t            recovery_cp;
>> .
>>

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 52cadfce7e8d..dfd85a5d6112 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8625,50 +8625,58 @@  void md_cluster_stop(struct mddev *mddev)
 	put_cluster_ops(mddev);
 }
 
-static int is_mddev_idle(struct mddev *mddev, int init)
+static bool is_rdev_holder_idle(struct md_rdev *rdev, bool init)
 {
+	unsigned long last_events = rdev->last_events;
+
+	if (!bdev_is_partition(rdev->bdev))
+		return true;
+
+	/*
+	 * If rdev is partition, and user doesn't issue IO to the array, the
+	 * array is still not idle if user issues IO to other partitions.
+	 */
+	rdev->last_events = part_stat_read_accum(rdev->bdev->bd_disk->part0,
+						 sectors) -
+			    part_stat_read_accum(rdev->bdev, sectors);
+
+	if (!init && rdev->last_events > last_events)
+		return false;
+
+	return true;
+}
+
+/*
+ * mddev is idle if following conditions are match since last check:
+ * 1) mddev doesn't have normal IO completed;
+ * 2) mddev doesn't have inflight normal IO;
+ * 3) if any member disk is partition, and other partitions doesn't have IO
+ *    completed;
+ *
+ * Noted this checking rely on IO accounting is enabled.
+ */
+static bool is_mddev_idle(struct mddev *mddev, int init)
+{
+	unsigned long last_events = mddev->last_events;
+	struct gendisk *disk;
 	struct md_rdev *rdev;
-	int idle;
-	int curr_events;
+	bool idle = true;
 
-	idle = 1;
-	rcu_read_lock();
-	rdev_for_each_rcu(rdev, mddev) {
-		struct gendisk *disk = rdev->bdev->bd_disk;
+	disk = mddev_is_dm(mddev) ? mddev->dm_gendisk : mddev->gendisk;
+	if (!disk)
+		return true;
 
-		if (!init && !blk_queue_io_stat(disk->queue))
-			continue;
+	mddev->last_events = part_stat_read_accum(disk->part0, sectors);
+	if (!init && (mddev->last_events > last_events ||
+		      bdev_count_inflight(disk->part0)))
+		idle = false;
 
-		curr_events = (int)part_stat_read_accum(disk->part0, sectors) -
-			      atomic_read(&disk->sync_io);
-		/* sync IO will cause sync_io to increase before the disk_stats
-		 * as sync_io is counted when a request starts, and
-		 * disk_stats is counted when it completes.
-		 * So resync activity will cause curr_events to be smaller than
-		 * when there was no such activity.
-		 * non-sync IO will cause disk_stat to increase without
-		 * increasing sync_io so curr_events will (eventually)
-		 * be larger than it was before.  Once it becomes
-		 * substantially larger, the test below will cause
-		 * the array to appear non-idle, and resync will slow
-		 * down.
-		 * If there is a lot of outstanding resync activity when
-		 * we set last_event to curr_events, then all that activity
-		 * completing might cause the array to appear non-idle
-		 * and resync will be slowed down even though there might
-		 * not have been non-resync activity.  This will only
-		 * happen once though.  'last_events' will soon reflect
-		 * the state where there is little or no outstanding
-		 * resync requests, and further resync activity will
-		 * always make curr_events less than last_events.
-		 *
-		 */
-		if (init || curr_events - rdev->last_events > 64) {
-			rdev->last_events = curr_events;
-			idle = 0;
-		}
-	}
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, mddev)
+		if (!is_rdev_holder_idle(rdev, init))
+			idle = false;
 	rcu_read_unlock();
+
 	return idle;
 }
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b57842188f18..1d51c2405d3d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -132,7 +132,7 @@  struct md_rdev {
 
 	sector_t sectors;		/* Device size (in 512bytes sectors) */
 	struct mddev *mddev;		/* RAID array if running */
-	int last_events;		/* IO event timestamp */
+	unsigned long last_events;	/* IO event timestamp */
 
 	/*
 	 * If meta_bdev is non-NULL, it means that a separate device is
@@ -520,6 +520,7 @@  struct mddev {
 							 * adding a spare
 							 */
 
+	unsigned long			last_events;	/* IO event timestamp */
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
 	sector_t			recovery_cp;

[v2,4/5] md: fix is_mddev_idle()

Commit Message

Comments

Patch