diff mbox series

md: only unlock mddev from action_store

Message ID 20220602134509.16662-1-guoqing.jiang@linux.dev (mailing list archive)
State New, archived
Headers show
Series md: only unlock mddev from action_store | expand

Commit Message

Guoqing Jiang June 2, 2022, 1:45 p.m. UTC
The 07reshape5intr test is broke because of below path.

    md_reap_sync_thread
            -> mddev_unlock
            -> md_unregister_thread(&mddev->sync_thread)

And md_check_recovery is triggered by,

mddev_unlock -> md_wakeup_thread(mddev->thread)

then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
reshape_position can't be updated accordingly.

Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed, let's

1. only unlock mddev in md_reap_sync_thread if caller is action_store,
   so the parameter is renamed to reflect the change.
2. save some contexts (MD_RECOVERY_INTR and reshape_position) since they
   could be changed by other processes, then restore them after get lock
   again.

Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
I suppose the previous bug still can be fixed with the change, but it is
better to verify it. Donald, could you help to test the new code?

Thanks,
Guoqing

 drivers/md/md.c | 24 ++++++++++++++++++------
 drivers/md/md.h |  2 +-
 2 files changed, 19 insertions(+), 7 deletions(-)

Comments

Logan Gunthorpe June 2, 2022, 6:03 p.m. UTC | #1
On 2022-06-02 07:45, Guoqing Jiang wrote:
> The 07reshape5intr test is broke because of below path.
> 
>     md_reap_sync_thread
>             -> mddev_unlock
>             -> md_unregister_thread(&mddev->sync_thread)
> 
> And md_check_recovery is triggered by,
> 
> mddev_unlock -> md_wakeup_thread(mddev->thread)
> 
> then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
> since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
> feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
> reshape_position can't be updated accordingly.
> 
> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
> with reconfig_mutex held") fixed is related with action_store path, other
> callers which reap sync_thread didn't need to be changed, let's
> 
> 1. only unlock mddev in md_reap_sync_thread if caller is action_store,
>    so the parameter is renamed to reflect the change.
> 2. save some contexts (MD_RECOVERY_INTR and reshape_position) since they
>    could be changed by other processes, then restore them after get lock
>    again.
> 
> Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
> ---
> I suppose the previous bug still can be fixed with the change, but it is
> better to verify it. Donald, could you help to test the new code?
> 
> Thanks,
> Guoqing
> 
>  drivers/md/md.c | 24 ++++++++++++++++++------
>  drivers/md/md.h |  2 +-
>  2 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 5c8efef13881..3387260dd55b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6197,7 +6197,7 @@ static void __md_stop_writes(struct mddev *mddev)
>  		flush_workqueue(md_misc_wq);
>  	if (mddev->sync_thread) {
>  		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -		md_reap_sync_thread(mddev, true);
> +		md_reap_sync_thread(mddev, false);
>  	}
>  
>  	del_timer_sync(&mddev->safemode_timer);
> @@ -9303,7 +9303,7 @@ void md_check_recovery(struct mddev *mddev)
>  			 * ->spare_active and clear saved_raid_disk
>  			 */
>  			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -			md_reap_sync_thread(mddev, true);
> +			md_reap_sync_thread(mddev, false);
>  			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>  			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>  			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
> @@ -9338,7 +9338,7 @@ void md_check_recovery(struct mddev *mddev)
>  			goto unlock;
>  		}
>  		if (mddev->sync_thread) {
> -			md_reap_sync_thread(mddev, true);
> +			md_reap_sync_thread(mddev, false);
>  			goto unlock;
>  		}
>  		/* Set RUNNING before clearing NEEDED to avoid
> @@ -9411,18 +9411,30 @@ void md_check_recovery(struct mddev *mddev)
>  }
>  EXPORT_SYMBOL(md_check_recovery);
>  
> -void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
> +void md_reap_sync_thread(struct mddev *mddev, bool unlock_mddev)
>  {
>  	struct md_rdev *rdev;
>  	sector_t old_dev_sectors = mddev->dev_sectors;
> +	sector_t old_reshape_position;
>  	bool is_reshaped = false;
> +	bool is_interrupted = false;
>  
> -	if (reconfig_mutex_held)
> +	if (unlock_mddev) {
> +		old_reshape_position = mddev->reshape_position;
> +		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
> +			is_interrupted = true;
>  		mddev_unlock(mddev);
> +	}
>  	/* resync has finished, collect result */
>  	md_unregister_thread(&mddev->sync_thread);
> -	if (reconfig_mutex_held)
> +	if (unlock_mddev) {
>  		mddev_lock_nointr(mddev);
> +		/* restore the previous flag and position */
> +		mddev->reshape_position = old_reshape_position;
> +		if (is_interrupted)
> +			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> +	}

Maybe instead of the ugly boolean argument we could pull
md_unregister_thread() into all the callers and explicitly unlock in the
single call site that needs it?

Logan
Donald Buczek June 3, 2022, 6:24 a.m. UTC | #2
On 6/2/22 3:45 PM, Guoqing Jiang wrote:
> The 07reshape5intr test is broke because of below path.
> 
>      md_reap_sync_thread
>              -> mddev_unlock
>              -> md_unregister_thread(&mddev->sync_thread)
> 
> And md_check_recovery is triggered by,
> 
> mddev_unlock -> md_wakeup_thread(mddev->thread)
> 
> then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
> since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
> feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
> reshape_position can't be updated accordingly.
> 
> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
> with reconfig_mutex held") fixed is related with action_store path, other
> callers which reap sync_thread didn't need to be changed, let's
> 
> 1. only unlock mddev in md_reap_sync_thread if caller is action_store,
>     so the parameter is renamed to reflect the change.
> 2. save some contexts (MD_RECOVERY_INTR and reshape_position) since they
>     could be changed by other processes, then restore them after get lock
>     again.
> 
> Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
> ---
> I suppose the previous bug still can be fixed with the change, but it is
> better to verify it. Donald, could you help to test the new code?

Sure. But Probably not before next week and there is a holiday in Germany on Monday.

Best
   Donald

> 
> Thanks,
> Guoqing
> 
>   drivers/md/md.c | 24 ++++++++++++++++++------
>   drivers/md/md.h |  2 +-
>   2 files changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 5c8efef13881..3387260dd55b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6197,7 +6197,7 @@ static void __md_stop_writes(struct mddev *mddev)
>   		flush_workqueue(md_misc_wq);
>   	if (mddev->sync_thread) {
>   		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -		md_reap_sync_thread(mddev, true);
> +		md_reap_sync_thread(mddev, false);
>   	}
>   
>   	del_timer_sync(&mddev->safemode_timer);
> @@ -9303,7 +9303,7 @@ void md_check_recovery(struct mddev *mddev)
>   			 * ->spare_active and clear saved_raid_disk
>   			 */
>   			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -			md_reap_sync_thread(mddev, true);
> +			md_reap_sync_thread(mddev, false);
>   			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>   			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>   			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
> @@ -9338,7 +9338,7 @@ void md_check_recovery(struct mddev *mddev)
>   			goto unlock;
>   		}
>   		if (mddev->sync_thread) {
> -			md_reap_sync_thread(mddev, true);
> +			md_reap_sync_thread(mddev, false);
>   			goto unlock;
>   		}
>   		/* Set RUNNING before clearing NEEDED to avoid
> @@ -9411,18 +9411,30 @@ void md_check_recovery(struct mddev *mddev)
>   }
>   EXPORT_SYMBOL(md_check_recovery);
>   
> -void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
> +void md_reap_sync_thread(struct mddev *mddev, bool unlock_mddev)
>   {
>   	struct md_rdev *rdev;
>   	sector_t old_dev_sectors = mddev->dev_sectors;
> +	sector_t old_reshape_position;
>   	bool is_reshaped = false;
> +	bool is_interrupted = false;
>   
> -	if (reconfig_mutex_held)
> +	if (unlock_mddev) {
> +		old_reshape_position = mddev->reshape_position;
> +		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
> +			is_interrupted = true;
>   		mddev_unlock(mddev);
> +	}
>   	/* resync has finished, collect result */
>   	md_unregister_thread(&mddev->sync_thread);
> -	if (reconfig_mutex_held)
> +	if (unlock_mddev) {
>   		mddev_lock_nointr(mddev);
> +		/* restore the previous flag and position */
> +		mddev->reshape_position = old_reshape_position;
> +		if (is_interrupted)
> +			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> +	}
> +
>   	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
>   	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
>   	    mddev->degraded != mddev->raid_disks) {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 82465a4b1588..3e3ff3b20e81 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -719,7 +719,7 @@ extern struct md_thread *md_register_thread(
>   extern void md_unregister_thread(struct md_thread **threadp);
>   extern void md_wakeup_thread(struct md_thread *thread);
>   extern void md_check_recovery(struct mddev *mddev);
> -extern void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held);
> +extern void md_reap_sync_thread(struct mddev *mddev, bool unlock_mddev);
>   extern int mddev_init_writes_pending(struct mddev *mddev);
>   extern bool md_write_start(struct mddev *mddev, struct bio *bi);
>   extern void md_write_inc(struct mddev *mddev, struct bio *bi);
>
Guoqing Jiang June 6, 2022, 3:08 a.m. UTC | #3
On 6/3/22 2:03 AM, Logan Gunthorpe wrote:
>
>
> On 2022-06-02 07:45, Guoqing Jiang wrote:
>> The 07reshape5intr test is broke because of below path.
>>
>>      md_reap_sync_thread
>>              -> mddev_unlock
>>              -> md_unregister_thread(&mddev->sync_thread)
>>
>> And md_check_recovery is triggered by,
>>
>> mddev_unlock -> md_wakeup_thread(mddev->thread)
>>
>> then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
>> since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
>> feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
>> reshape_position can't be updated accordingly.
>>
>> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
>> with reconfig_mutex held") fixed is related with action_store path, other
>> callers which reap sync_thread didn't need to be changed, let's
>>
>> 1. only unlock mddev in md_reap_sync_thread if caller is action_store,
>>     so the parameter is renamed to reflect the change.
>> 2. save some contexts (MD_RECOVERY_INTR and reshape_position) since they
>>     could be changed by other processes, then restore them after get lock
>>     again.
>>
>> Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
>> Reported-by: Logan Gunthorpe <logang@deltatee.com>
>> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
>> ---
>> I suppose the previous bug still can be fixed with the change, but it is
>> better to verify it. Donald, could you help to test the new code?
>>
>> Thanks,
>> Guoqing
>>
>>   drivers/md/md.c | 24 ++++++++++++++++++------
>>   drivers/md/md.h |  2 +-
>>   2 files changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 5c8efef13881..3387260dd55b 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -6197,7 +6197,7 @@ static void __md_stop_writes(struct mddev *mddev)
>>   		flush_workqueue(md_misc_wq);
>>   	if (mddev->sync_thread) {
>>   		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> -		md_reap_sync_thread(mddev, true);
>> +		md_reap_sync_thread(mddev, false);
>>   	}
>>   
>>   	del_timer_sync(&mddev->safemode_timer);
>> @@ -9303,7 +9303,7 @@ void md_check_recovery(struct mddev *mddev)
>>   			 * ->spare_active and clear saved_raid_disk
>>   			 */
>>   			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> -			md_reap_sync_thread(mddev, true);
>> +			md_reap_sync_thread(mddev, false);
>>   			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>>   			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>   			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
>> @@ -9338,7 +9338,7 @@ void md_check_recovery(struct mddev *mddev)
>>   			goto unlock;
>>   		}
>>   		if (mddev->sync_thread) {
>> -			md_reap_sync_thread(mddev, true);
>> +			md_reap_sync_thread(mddev, false);
>>   			goto unlock;
>>   		}
>>   		/* Set RUNNING before clearing NEEDED to avoid
>> @@ -9411,18 +9411,30 @@ void md_check_recovery(struct mddev *mddev)
>>   }
>>   EXPORT_SYMBOL(md_check_recovery);
>>   
>> -void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
>> +void md_reap_sync_thread(struct mddev *mddev, bool unlock_mddev)
>>   {
>>   	struct md_rdev *rdev;
>>   	sector_t old_dev_sectors = mddev->dev_sectors;
>> +	sector_t old_reshape_position;
>>   	bool is_reshaped = false;
>> +	bool is_interrupted = false;
>>   
>> -	if (reconfig_mutex_held)
>> +	if (unlock_mddev) {
>> +		old_reshape_position = mddev->reshape_position;
>> +		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
>> +			is_interrupted = true;

Hmm, this checking is not needed as it is always true from action_store.

>>   		mddev_unlock(mddev);
>> +	}
>>   	/* resync has finished, collect result */
>>   	md_unregister_thread(&mddev->sync_thread);
>> -	if (reconfig_mutex_held)
>> +	if (unlock_mddev) {
>>   		mddev_lock_nointr(mddev);
>> +		/* restore the previous flag and position */
>> +		mddev->reshape_position = old_reshape_position;
>> +		if (is_interrupted)
>> +			set_bit(MD_RECOVERY_INTR, &mddev->recovery);

So is this one.

>> +	}
> Maybe instead of the ugly boolean argument we could pull
> md_unregister_thread() into all the callers and explicitly unlock in the
> single call site that needs it?

Sounds good, let me revert previous commit then pull it.

Thanks,
Guoqing
Guoqing Jiang June 6, 2022, 3:29 a.m. UTC | #4
On 6/3/22 2:03 AM, Logan Gunthorpe wrote:
>
>
> On 2022-06-02 07:45, Guoqing Jiang wrote:
>> The 07reshape5intr test is broke because of below path.
>>
>>      md_reap_sync_thread
>>              -> mddev_unlock
>>              -> md_unregister_thread(&mddev->sync_thread)
>>
>> And md_check_recovery is triggered by,
>>
>> mddev_unlock -> md_wakeup_thread(mddev->thread)
>>
>> then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
>> since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
>> feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
>> reshape_position can't be updated accordingly.
>>
>> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
>> with reconfig_mutex held") fixed is related with action_store path, other
>> callers which reap sync_thread didn't need to be changed, let's
>>
>> 1. only unlock mddev in md_reap_sync_thread if caller is action_store,
>>     so the parameter is renamed to reflect the change.
>> 2. save some contexts (MD_RECOVERY_INTR and reshape_position) since they
>>     could be changed by other processes, then restore them after get lock
>>     again.
>>
>> Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
>> Reported-by: Logan Gunthorpe <logang@deltatee.com>
>> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
>> ---
>> I suppose the previous bug still can be fixed with the change, but it is
>> better to verify it. Donald, could you help to test the new code?
>>
>> Thanks,
>> Guoqing
>>
>>   drivers/md/md.c | 24 ++++++++++++++++++------
>>   drivers/md/md.h |  2 +-
>>   2 files changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 5c8efef13881..3387260dd55b 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -6197,7 +6197,7 @@ static void __md_stop_writes(struct mddev *mddev)
>>   		flush_workqueue(md_misc_wq);
>>   	if (mddev->sync_thread) {
>>   		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> -		md_reap_sync_thread(mddev, true);
>> +		md_reap_sync_thread(mddev, false);
>>   	}
>>   
>>   	del_timer_sync(&mddev->safemode_timer);
>> @@ -9303,7 +9303,7 @@ void md_check_recovery(struct mddev *mddev)
>>   			 * ->spare_active and clear saved_raid_disk
>>   			 */
>>   			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> -			md_reap_sync_thread(mddev, true);
>> +			md_reap_sync_thread(mddev, false);
>>   			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>>   			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>   			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
>> @@ -9338,7 +9338,7 @@ void md_check_recovery(struct mddev *mddev)
>>   			goto unlock;
>>   		}
>>   		if (mddev->sync_thread) {
>> -			md_reap_sync_thread(mddev, true);
>> +			md_reap_sync_thread(mddev, false);
>>   			goto unlock;
>>   		}
>>   		/* Set RUNNING before clearing NEEDED to avoid
>> @@ -9411,18 +9411,30 @@ void md_check_recovery(struct mddev *mddev)
>>   }
>>   EXPORT_SYMBOL(md_check_recovery);
>>   
>> -void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
>> +void md_reap_sync_thread(struct mddev *mddev, bool unlock_mddev)
>>   {
>>   	struct md_rdev *rdev;
>>   	sector_t old_dev_sectors = mddev->dev_sectors;
>> +	sector_t old_reshape_position;
>>   	bool is_reshaped = false;
>> +	bool is_interrupted = false;
>>   
>> -	if (reconfig_mutex_held)
>> +	if (unlock_mddev) {
>> +		old_reshape_position = mddev->reshape_position;
>> +		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
>> +			is_interrupted = true;
>>   		mddev_unlock(mddev);
>> +	}
>>   	/* resync has finished, collect result */
>>   	md_unregister_thread(&mddev->sync_thread);
>> -	if (reconfig_mutex_held)
>> +	if (unlock_mddev) {
>>   		mddev_lock_nointr(mddev);
>> +		/* restore the previous flag and position */
>> +		mddev->reshape_position = old_reshape_position;
>> +		if (is_interrupted)
>> +			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> +	}
> Maybe instead of the ugly boolean argument we could pull
> md_unregister_thread() into all the callers and explicitly unlock in the
> single call site that needs it?

After move "md_unregister_thread(&mddev->sync_thread)", then we need to
rename md_reap_sync_thread given it doesn't unregister sync_thread, any
suggestion about the new name? md_behind_reap_sync_thread?

Thanks,
Guoqing
Xiao Ni June 6, 2022, 8:39 a.m. UTC | #5
On Mon, Jun 6, 2022 at 11:08 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> >> +    }
> > Maybe instead of the ugly boolean argument we could pull
> > md_unregister_thread() into all the callers and explicitly unlock in the
> > single call site that needs it?
>
> Sounds good, let me revert previous commit then pull it.
>

Hi all

Now md_reap_sync_thread is called by __md_stop_writes, action_store
and md_check_recovery.
If we move md_unregister_thread out of md_reap_sync_thread and unlock
reconfig_mutex before
calling md_unregister_thread, it means we break the atomic action for
these three functions mentioned
above. Not sure if it can introduce other problems, especially for the
change of md_check_recovery.

How about suspend I/O when changing the sync action? It should not
hurt the performance, because
it only interrupts the sync action and flush the in memory stripes.
For this way, it needs to revert
7e6ba434cc60 (md: don't unregister sync_thread with reconfig_mutex
held) and changes like this:

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ce9d2845d3ac..af28cdeaf7e1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4827,12 +4827,14 @@ action_store(struct mddev *mddev, const char
*page, size_t len)
                        clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
                if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
                    mddev_lock(mddev) == 0) {
+                       mddev_suspend(mddev);
                        if (work_pending(&mddev->del_work))
                                flush_workqueue(md_misc_wq);
                        if (mddev->sync_thread) {
                                set_bit(MD_RECOVERY_INTR, &mddev->recovery);
                                md_reap_sync_thread(mddev);
                        }
+                       mddev_resume(mddev);
                        mddev_unlock(mddev);
                }
        } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

mddev_suspend can stop the I/O and the superblock can be updated too. Because
MD_ALLOW_SB_UPDATE is set. So raid5d can go on handling stripes.

Best Regards
Xiao
Guoqing Jiang June 6, 2022, 9 a.m. UTC | #6
On 6/6/22 4:39 PM, Xiao Ni wrote:
> On Mon, Jun 6, 2022 at 11:08 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>>
>>
>>>> +    }
>>> Maybe instead of the ugly boolean argument we could pull
>>> md_unregister_thread() into all the callers and explicitly unlock in the
>>> single call site that needs it?
>> Sounds good, let me revert previous commit then pull it.
>>
> Hi all
>
> Now md_reap_sync_thread is called by __md_stop_writes, action_store
> and md_check_recovery.
> If we move md_unregister_thread out of md_reap_sync_thread and unlock
> reconfig_mutex before
> calling md_unregister_thread, it means we break the atomic action for
> these three functions mentioned
> above.

No, only action_store need to be changed, other callers keep the original
behavior.

>   Not sure if it can introduce other problems, especially for the
> change of md_check_recovery.
>
> How about suspend I/O when changing the sync action? It should not
> hurt the performance, because
> it only interrupts the sync action and flush the in memory stripes.
> For this way, it needs to revert
> 7e6ba434cc60 (md: don't unregister sync_thread with reconfig_mutex
> held) and changes like this:
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ce9d2845d3ac..af28cdeaf7e1 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4827,12 +4827,14 @@ action_store(struct mddev *mddev, const char
> *page, size_t len)
>                          clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>                  if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>                      mddev_lock(mddev) == 0) {
> +                       mddev_suspend(mddev);
>                          if (work_pending(&mddev->del_work))
>                                  flush_workqueue(md_misc_wq);
>                          if (mddev->sync_thread) {
>                                  set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>                                  md_reap_sync_thread(mddev);
>                          }
> +                       mddev_resume(mddev);
>                          mddev_unlock(mddev);
>                  }
>          } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

The action actually means sync action which are for internal IO instead
of external IO, I suppose the semantic is different with above change.

And there are lots of sync/locking actions in suspend path, I am not
sure it will cause other locking issues, you may need to investigate it.

Thanks,
Guoqing
Xiao Ni June 6, 2022, 9:36 a.m. UTC | #7
On Mon, Jun 6, 2022 at 5:00 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 6/6/22 4:39 PM, Xiao Ni wrote:
> > On Mon, Jun 6, 2022 at 11:08 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> >>
> >>
> >>>> +    }
> >>> Maybe instead of the ugly boolean argument we could pull
> >>> md_unregister_thread() into all the callers and explicitly unlock in the
> >>> single call site that needs it?
> >> Sounds good, let me revert previous commit then pull it.
> >>
> > Hi all
> >
> > Now md_reap_sync_thread is called by __md_stop_writes, action_store
> > and md_check_recovery.
> > If we move md_unregister_thread out of md_reap_sync_thread and unlock
> > reconfig_mutex before
> > calling md_unregister_thread, it means we break the atomic action for
> > these three functions mentioned
> > above.
>
> No, only action_store need to be changed, other callers keep the original
> behavior.

Hi Guoqing

Thanks for pointing out this.

>
> >   Not sure if it can introduce other problems, especially for the
> > change of md_check_recovery.
> >
> > How about suspend I/O when changing the sync action? It should not
> > hurt the performance, because
> > it only interrupts the sync action and flush the in memory stripes.
> > For this way, it needs to revert
> > 7e6ba434cc60 (md: don't unregister sync_thread with reconfig_mutex
> > held) and changes like this:
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index ce9d2845d3ac..af28cdeaf7e1 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -4827,12 +4827,14 @@ action_store(struct mddev *mddev, const char
> > *page, size_t len)
> >                          clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >                  if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> >                      mddev_lock(mddev) == 0) {
> > +                       mddev_suspend(mddev);
> >                          if (work_pending(&mddev->del_work))
> >                                  flush_workqueue(md_misc_wq);
> >                          if (mddev->sync_thread) {
> >                                  set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >                                  md_reap_sync_thread(mddev);
> >                          }
> > +                       mddev_resume(mddev);
> >                          mddev_unlock(mddev);
> >                  }
> >          } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>
> The action actually means sync action which are for internal IO instead
> of external IO, I suppose the semantic is different with above change.

The original problem should be i/o happen when interrupting the sync thread?
So the external i/o calls md_write_start and sets MD_SB_CHANGE_PENDING.
Then raid5d->md_check_recovery can't update superblock and handle internal
stripes. So the `echo idle` action is stuck.

>
> And there are lots of sync/locking actions in suspend path, I am not
> sure it will cause other locking issues, you may need to investigate it.

Yes, mddev_supsend needs those sync methods to keep the raid in a quiescent
state. First, it needs to get reconfig_mutex before calling mddev_supsend.
So it can avoid racing with all the actions that want to change the raid. Then
it waits mddev->active_io to 0 which means all submit bio processes stop
submitting io. Then it waits until all internal i/os finish by
pers->quiesce. Last it waits
until the superblock is updated.

From the logic, it looks safe. By the way, the patch 35bfc52187f
(md: allow metadata update while suspending) which introduces
MD_UPDATING_SB and MD_UPDATING_SB fixes the similar deadlock
problem. So in this problem, it looks like mddev_suspend is a good choice.

As you mentioned, it needs to consider and do tests more because of
the complex sync/locking actions.

Best Regards
Xiao
Guoqing Jiang June 6, 2022, 9:55 a.m. UTC | #8
On 6/6/22 5:36 PM, Xiao Ni wrote:
>>> @@ -4827,12 +4827,14 @@ action_store(struct mddev *mddev, const char
>>> *page, size_t len)
>>>                           clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>                   if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>                       mddev_lock(mddev) == 0) {
>>> +                       mddev_suspend(mddev);
>>>                           if (work_pending(&mddev->del_work))
>>>                                   flush_workqueue(md_misc_wq);
>>>                           if (mddev->sync_thread) {
>>>                                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>                                   md_reap_sync_thread(mddev);
>>>                           }
>>> +                       mddev_resume(mddev);
>>>                           mddev_unlock(mddev);
>>>                   }
>>>           } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>> The action actually means sync action which are for internal IO instead
>> of external IO, I suppose the semantic is different with above change.
> The original problem should be i/o happen when interrupting the sync thread?
> So the external i/o calls md_write_start and sets MD_SB_CHANGE_PENDING.
> Then raid5d->md_check_recovery can't update superblock and handle internal
> stripes. So the `echo idle` action is stuck.

My point is action_store shouldn't disturb external IO, it should only 
deal with
sync IO and relevant stuffs as the function is for sync_action node, no?

Thanks,
Guoqing
Xiao Ni June 6, 2022, 11:54 a.m. UTC | #9
On Mon, Jun 6, 2022 at 5:55 PM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 6/6/22 5:36 PM, Xiao Ni wrote:
> >>> @@ -4827,12 +4827,14 @@ action_store(struct mddev *mddev, const char
> >>> *page, size_t len)
> >>>                           clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >>>                   if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> >>>                       mddev_lock(mddev) == 0) {
> >>> +                       mddev_suspend(mddev);
> >>>                           if (work_pending(&mddev->del_work))
> >>>                                   flush_workqueue(md_misc_wq);
> >>>                           if (mddev->sync_thread) {
> >>>                                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> >>>                                   md_reap_sync_thread(mddev);
> >>>                           }
> >>> +                       mddev_resume(mddev);
> >>>                           mddev_unlock(mddev);
> >>>                   }
> >>>           } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> >> The action actually means sync action which are for internal IO instead
> >> of external IO, I suppose the semantic is different with above change.
> > The original problem should be i/o happen when interrupting the sync thread?
> > So the external i/o calls md_write_start and sets MD_SB_CHANGE_PENDING.
> > Then raid5d->md_check_recovery can't update superblock and handle internal
> > stripes. So the `echo idle` action is stuck.
>
> My point is action_store shouldn't disturb external IO, it should only
> deal with
> sync IO and relevant stuffs as the function is for sync_action node, no?

Yes, it's a change that was mentioned above. It depends on which way
we want to go.
In some sysfs file store functions, it also calls mddev_suspend. For
some situations,
it needs to choose if we need to stop the i/o temporarily.

Anyway, it's only a possible method I want to discuss. It's very good
for me to dig into this problem.
Now I have a better understanding of those codes :)

Best Regards
Xiao
Logan Gunthorpe June 6, 2022, 3:48 p.m. UTC | #10
On 2022-06-05 21:29, Guoqing Jiang wrote:
> 
> 
> On 6/3/22 2:03 AM, Logan Gunthorpe wrote:
>>
>>
>> On 2022-06-02 07:45, Guoqing Jiang wrote:
>>> The 07reshape5intr test is broke because of below path.
>>>
>>>      md_reap_sync_thread
>>>              -> mddev_unlock
>>>              -> md_unregister_thread(&mddev->sync_thread)
>>>
>>> And md_check_recovery is triggered by,
>>>
>>> mddev_unlock -> md_wakeup_thread(mddev->thread)
>>>
>>> then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
>>> since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
>>> feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
>>> reshape_position can't be updated accordingly.
>>>
>>> Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
>>> with reconfig_mutex held") fixed is related with action_store path, other
>>> callers which reap sync_thread didn't need to be changed, let's
>>>
>>> 1. only unlock mddev in md_reap_sync_thread if caller is action_store,
>>>     so the parameter is renamed to reflect the change.
>>> 2. save some contexts (MD_RECOVERY_INTR and reshape_position) since they
>>>     could be changed by other processes, then restore them after get lock
>>>     again.
>>>
>>> Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held")
>>> Reported-by: Logan Gunthorpe <logang@deltatee.com>
>>> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
>>> ---
>>> I suppose the previous bug still can be fixed with the change, but it is
>>> better to verify it. Donald, could you help to test the new code?
>>>
>>> Thanks,
>>> Guoqing
>>>
>>>   drivers/md/md.c | 24 ++++++++++++++++++------
>>>   drivers/md/md.h |  2 +-
>>>   2 files changed, 19 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 5c8efef13881..3387260dd55b 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -6197,7 +6197,7 @@ static void __md_stop_writes(struct mddev *mddev)
>>>   		flush_workqueue(md_misc_wq);
>>>   	if (mddev->sync_thread) {
>>>   		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>> -		md_reap_sync_thread(mddev, true);
>>> +		md_reap_sync_thread(mddev, false);
>>>   	}
>>>   
>>>   	del_timer_sync(&mddev->safemode_timer);
>>> @@ -9303,7 +9303,7 @@ void md_check_recovery(struct mddev *mddev)
>>>   			 * ->spare_active and clear saved_raid_disk
>>>   			 */
>>>   			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>> -			md_reap_sync_thread(mddev, true);
>>> +			md_reap_sync_thread(mddev, false);
>>>   			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>>>   			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>>   			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
>>> @@ -9338,7 +9338,7 @@ void md_check_recovery(struct mddev *mddev)
>>>   			goto unlock;
>>>   		}
>>>   		if (mddev->sync_thread) {
>>> -			md_reap_sync_thread(mddev, true);
>>> +			md_reap_sync_thread(mddev, false);
>>>   			goto unlock;
>>>   		}
>>>   		/* Set RUNNING before clearing NEEDED to avoid
>>> @@ -9411,18 +9411,30 @@ void md_check_recovery(struct mddev *mddev)
>>>   }
>>>   EXPORT_SYMBOL(md_check_recovery);
>>>   
>>> -void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
>>> +void md_reap_sync_thread(struct mddev *mddev, bool unlock_mddev)
>>>   {
>>>   	struct md_rdev *rdev;
>>>   	sector_t old_dev_sectors = mddev->dev_sectors;
>>> +	sector_t old_reshape_position;
>>>   	bool is_reshaped = false;
>>> +	bool is_interrupted = false;
>>>   
>>> -	if (reconfig_mutex_held)
>>> +	if (unlock_mddev) {
>>> +		old_reshape_position = mddev->reshape_position;
>>> +		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
>>> +			is_interrupted = true;
>>>   		mddev_unlock(mddev);
>>> +	}
>>>   	/* resync has finished, collect result */
>>>   	md_unregister_thread(&mddev->sync_thread);
>>> -	if (reconfig_mutex_held)
>>> +	if (unlock_mddev) {
>>>   		mddev_lock_nointr(mddev);
>>> +		/* restore the previous flag and position */
>>> +		mddev->reshape_position = old_reshape_position;
>>> +		if (is_interrupted)
>>> +			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>> +	}
>> Maybe instead of the ugly boolean argument we could pull
>> md_unregister_thread() into all the callers and explicitly unlock in the
>> single call site that needs it?
> 
> After move "md_unregister_thread(&mddev->sync_thread)", then we need to
> rename md_reap_sync_thread given it doesn't unregister sync_thread, any
> suggestion about the new name? md_behind_reap_sync_thread?

I don't like the "behind"... Which would be a name suggesting when the
function should be called, not what the function does.

I'd maybe go with something like md_cleanup_sync_thread()

Logan
Guoqing Jiang June 7, 2022, 1:20 a.m. UTC | #11
On 6/6/22 11:48 PM, Logan Gunthorpe wrote:
>> After move "md_unregister_thread(&mddev->sync_thread)", then we need to
>> rename md_reap_sync_thread given it doesn't unregister sync_thread, any
>> suggestion about the new name? md_behind_reap_sync_thread?
> I don't like the "behind"... Which would be a name suggesting when the
> function should be called, not what the function does.

Right, naming is hard.

> I'd maybe go with something like md_cleanup_sync_thread()

It is confusing since it doesn't related with sync_thread anymore after 
the pull.
I will keep the name unchanged for now.

Thanks,
Guoqing
diff mbox series

Patch

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5c8efef13881..3387260dd55b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6197,7 +6197,7 @@  static void __md_stop_writes(struct mddev *mddev)
 		flush_workqueue(md_misc_wq);
 	if (mddev->sync_thread) {
 		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-		md_reap_sync_thread(mddev, true);
+		md_reap_sync_thread(mddev, false);
 	}
 
 	del_timer_sync(&mddev->safemode_timer);
@@ -9303,7 +9303,7 @@  void md_check_recovery(struct mddev *mddev)
 			 * ->spare_active and clear saved_raid_disk
 			 */
 			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-			md_reap_sync_thread(mddev, true);
+			md_reap_sync_thread(mddev, false);
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 			clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 			clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags);
@@ -9338,7 +9338,7 @@  void md_check_recovery(struct mddev *mddev)
 			goto unlock;
 		}
 		if (mddev->sync_thread) {
-			md_reap_sync_thread(mddev, true);
+			md_reap_sync_thread(mddev, false);
 			goto unlock;
 		}
 		/* Set RUNNING before clearing NEEDED to avoid
@@ -9411,18 +9411,30 @@  void md_check_recovery(struct mddev *mddev)
 }
 EXPORT_SYMBOL(md_check_recovery);
 
-void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
+void md_reap_sync_thread(struct mddev *mddev, bool unlock_mddev)
 {
 	struct md_rdev *rdev;
 	sector_t old_dev_sectors = mddev->dev_sectors;
+	sector_t old_reshape_position;
 	bool is_reshaped = false;
+	bool is_interrupted = false;
 
-	if (reconfig_mutex_held)
+	if (unlock_mddev) {
+		old_reshape_position = mddev->reshape_position;
+		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+			is_interrupted = true;
 		mddev_unlock(mddev);
+	}
 	/* resync has finished, collect result */
 	md_unregister_thread(&mddev->sync_thread);
-	if (reconfig_mutex_held)
+	if (unlock_mddev) {
 		mddev_lock_nointr(mddev);
+		/* restore the previous flag and position */
+		mddev->reshape_position = old_reshape_position;
+		if (is_interrupted)
+			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+	}
+
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
 	    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
 	    mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 82465a4b1588..3e3ff3b20e81 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -719,7 +719,7 @@  extern struct md_thread *md_register_thread(
 extern void md_unregister_thread(struct md_thread **threadp);
 extern void md_wakeup_thread(struct md_thread *thread);
 extern void md_check_recovery(struct mddev *mddev);
-extern void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held);
+extern void md_reap_sync_thread(struct mddev *mddev, bool unlock_mddev);
 extern int mddev_init_writes_pending(struct mddev *mddev);
 extern bool md_write_start(struct mddev *mddev, struct bio *bi);
 extern void md_write_inc(struct mddev *mddev, struct bio *bi);