diff mbox series

[V3,2/2] Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"

Message ID 20231108182216.73611-2-junxiao.bi@oracle.com (mailing list archive)
State Accepted, archived
Delegated to: Song Liu
Headers show
Series [V3,1/2] md: bypass block throttle for superblock update | expand

Commit Message

Junxiao Bi Nov. 8, 2023, 6:22 p.m. UTC
This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74.

That commit introduced the following race and can cause system hung.

 md_write_start:             raid5d:
 // mddev->in_sync == 1
 set "MD_SB_CHANGE_PENDING"
                            // running before md_write_start wakeup it
                             waiting "MD_SB_CHANGE_PENDING" cleared
                             >>>>>>>>> hung
 wakeup mddev->thread
 ...
 waiting "MD_SB_CHANGE_PENDING" cleared
 >>>> hung, raid5d should clear this flag
 but get hung by same flag.

The issue reverted commit fixing is fixed by last patch in a new way.

Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
---
 drivers/md/raid5.c | 12 ------------
 1 file changed, 12 deletions(-)

Comments

Yu Kuai Nov. 9, 2023, 6:28 a.m. UTC | #1
在 2023/11/09 2:22, Junxiao Bi 写道:
> This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74.
> 
> That commit introduced the following race and can cause system hung.
> 
>   md_write_start:             raid5d:
>   // mddev->in_sync == 1
>   set "MD_SB_CHANGE_PENDING"
>                              // running before md_write_start wakeup it
>                               waiting "MD_SB_CHANGE_PENDING" cleared
>                               >>>>>>>>> hung
>   wakeup mddev->thread
>   ...
>   waiting "MD_SB_CHANGE_PENDING" cleared
>   >>>> hung, raid5d should clear this flag
>   but get hung by same flag.
> 
> The issue reverted commit fixing is fixed by last patch in a new way.
> 
> Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

LGTM
Reviewed-by: Yu Kuai <yukuai3@huawei.com>

> ---
>   drivers/md/raid5.c | 12 ------------
>   1 file changed, 12 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index dc031d42f53b..fcc8a44dd4fd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -36,7 +36,6 @@
>    */
>   
>   #include <linux/blkdev.h>
> -#include <linux/delay.h>
>   #include <linux/kthread.h>
>   #include <linux/raid/pq.h>
>   #include <linux/async_tx.h>
> @@ -6820,18 +6819,7 @@ static void raid5d(struct md_thread *thread)
>   			spin_unlock_irq(&conf->device_lock);
>   			md_check_recovery(mddev);
>   			spin_lock_irq(&conf->device_lock);
> -
> -			/*
> -			 * Waiting on MD_SB_CHANGE_PENDING below may deadlock
> -			 * seeing md_check_recovery() is needed to clear
> -			 * the flag when using mdmon.
> -			 */
> -			continue;
>   		}
> -
> -		wait_event_lock_irq(mddev->sb_wait,
> -			!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
> -			conf->device_lock);
>   	}
>   	pr_debug("%d stripes handled\n", handled);
>   
>
Junxiao Bi Nov. 15, 2023, 1 a.m. UTC | #2
Hi Song,

These two patches get reviewed-by from Kuai and Logan, I didn't see them 
in your tree yet, would you like review it and pick it up?

Thanks,

Junxiao.

On 11/8/23 10:28 PM, Yu Kuai wrote:
> 在 2023/11/09 2:22, Junxiao Bi 写道:
>> This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74.
>>
>> That commit introduced the following race and can cause system hung.
>>
>>   md_write_start:             raid5d:
>>   // mddev->in_sync == 1
>>   set "MD_SB_CHANGE_PENDING"
>>                              // running before md_write_start wakeup it
>>                               waiting "MD_SB_CHANGE_PENDING" cleared
>> >>>>>>>>> hung
>>   wakeup mddev->thread
>>   ...
>>   waiting "MD_SB_CHANGE_PENDING" cleared
>>   >>>> hung, raid5d should clear this flag
>>   but get hung by same flag.
>>
>> The issue reverted commit fixing is fixed by last patch in a new way.
>>
>> Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>> raid5d")
>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
>
> LGTM
> Reviewed-by: Yu Kuai <yukuai3@huawei.com>
>
>> ---
>>   drivers/md/raid5.c | 12 ------------
>>   1 file changed, 12 deletions(-)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index dc031d42f53b..fcc8a44dd4fd 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -36,7 +36,6 @@
>>    */
>>     #include <linux/blkdev.h>
>> -#include <linux/delay.h>
>>   #include <linux/kthread.h>
>>   #include <linux/raid/pq.h>
>>   #include <linux/async_tx.h>
>> @@ -6820,18 +6819,7 @@ static void raid5d(struct md_thread *thread)
>>               spin_unlock_irq(&conf->device_lock);
>>               md_check_recovery(mddev);
>>               spin_lock_irq(&conf->device_lock);
>> -
>> -            /*
>> -             * Waiting on MD_SB_CHANGE_PENDING below may deadlock
>> -             * seeing md_check_recovery() is needed to clear
>> -             * the flag when using mdmon.
>> -             */
>> -            continue;
>>           }
>> -
>> -        wait_event_lock_irq(mddev->sb_wait,
>> -            !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
>> -            conf->device_lock);
>>       }
>>       pr_debug("%d stripes handled\n", handled);
>>
>
Song Liu Nov. 18, 2023, 1:14 a.m. UTC | #3
On Tue, Nov 14, 2023 at 5:00 PM <junxiao.bi@oracle.com> wrote:
>
> Hi Song,
>
> These two patches get reviewed-by from Kuai and Logan, I didn't see them
> in your tree yet, would you like review it and pick it up?

Sorry for the delay. I was traveling for LPC. I will look into these soon.

Thanks,
Song
Junxiao Bi Nov. 21, 2023, 8:11 p.m. UTC | #4
On 11/17/23 5:14 PM, Song Liu wrote:

> On Tue, Nov 14, 2023 at 5:00 PM <junxiao.bi@oracle.com> wrote:
>> Hi Song,
>>
>> These two patches get reviewed-by from Kuai and Logan, I didn't see them
>> in your tree yet, would you like review it and pick it up?
> Sorry for the delay. I was traveling for LPC. I will look into these soon.

No problem. Really appreciate your help.

Thanks,

Junxiao.

>
> Thanks,
> Song
Song Liu Nov. 24, 2023, 5:29 p.m. UTC | #5
On Wed, Nov 8, 2023 at 10:22 AM Junxiao Bi <junxiao.bi@oracle.com> wrote:
>
> This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74.
>
> That commit introduced the following race and can cause system hung.
>
>  md_write_start:             raid5d:
>  // mddev->in_sync == 1
>  set "MD_SB_CHANGE_PENDING"
>                             // running before md_write_start wakeup it
>                              waiting "MD_SB_CHANGE_PENDING" cleared
>                              >>>>>>>>> hung
>  wakeup mddev->thread
>  ...
>  waiting "MD_SB_CHANGE_PENDING" cleared
>  >>>> hung, raid5d should clear this flag
>  but get hung by same flag.
>
> The issue reverted commit fixing is fixed by last patch in a new way.
>
> Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

The set looks good to me. Thanks!

Quick question: from the earlier thread, the issue was observed in
production. Have you reproduced the issue and thus verified the fix
works as expected?

Thanks,
Song
Junxiao Bi Nov. 24, 2023, 11:18 p.m. UTC | #6
On 11/24/23 9:29 AM, Song Liu wrote:

> On Wed, Nov 8, 2023 at 10:22 AM Junxiao Bi <junxiao.bi@oracle.com> wrote:
>> This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74.
>>
>> That commit introduced the following race and can cause system hung.
>>
>>   md_write_start:             raid5d:
>>   // mddev->in_sync == 1
>>   set "MD_SB_CHANGE_PENDING"
>>                              // running before md_write_start wakeup it
>>                               waiting "MD_SB_CHANGE_PENDING" cleared
>>                               >>>>>>>>> hung
>>   wakeup mddev->thread
>>   ...
>>   waiting "MD_SB_CHANGE_PENDING" cleared
>>   >>>> hung, raid5d should clear this flag
>>   but get hung by same flag.
>>
>> The issue reverted commit fixing is fixed by last patch in a new way.
>>
>> Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
>> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
> The set looks good to me. Thanks!
Thanks for the review.
>
> Quick question: from the earlier thread, the issue was observed in
> production. Have you reproduced the issue and thus verified the fix
> works as expected?

I didn't try reproducing this since the system hung on the code where 
the bad commit added, after revert it, this issue will not reproduce any 
more.

Thanks,

Junxiao.

>
> Thanks,
> Song
Song Liu Nov. 28, 2023, 3:50 a.m. UTC | #7
On Fri, Nov 24, 2023 at 3:18 PM <junxiao.bi@oracle.com> wrote:
>
> On 11/24/23 9:29 AM, Song Liu wrote:
>
> > On Wed, Nov 8, 2023 at 10:22 AM Junxiao Bi <junxiao.bi@oracle.com> wrote:
> >> This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74.
> >>
> >> That commit introduced the following race and can cause system hung.
> >>
> >>   md_write_start:             raid5d:
> >>   // mddev->in_sync == 1
> >>   set "MD_SB_CHANGE_PENDING"
> >>                              // running before md_write_start wakeup it
> >>                               waiting "MD_SB_CHANGE_PENDING" cleared
> >>                               >>>>>>>>> hung
> >>   wakeup mddev->thread
> >>   ...
> >>   waiting "MD_SB_CHANGE_PENDING" cleared
> >>   >>>> hung, raid5d should clear this flag
> >>   but get hung by same flag.
> >>
> >> The issue reverted commit fixing is fixed by last patch in a new way.
> >>
> >> Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d")
> >> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
> > The set looks good to me. Thanks!
> Thanks for the review.
> >
> > Quick question: from the earlier thread, the issue was observed in
> > production. Have you reproduced the issue and thus verified the fix
> > works as expected?
>
> I didn't try reproducing this since the system hung on the code where
> the bad commit added, after revert it, this issue will not reproduce any
> more.

Thanks for the information. I applied the set to md-next.

Song
diff mbox series

Patch

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index dc031d42f53b..fcc8a44dd4fd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -36,7 +36,6 @@ 
  */
 
 #include <linux/blkdev.h>
-#include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/raid/pq.h>
 #include <linux/async_tx.h>
@@ -6820,18 +6819,7 @@  static void raid5d(struct md_thread *thread)
 			spin_unlock_irq(&conf->device_lock);
 			md_check_recovery(mddev);
 			spin_lock_irq(&conf->device_lock);
-
-			/*
-			 * Waiting on MD_SB_CHANGE_PENDING below may deadlock
-			 * seeing md_check_recovery() is needed to clear
-			 * the flag when using mdmon.
-			 */
-			continue;
 		}
-
-		wait_event_lock_irq(mddev->sb_wait,
-			!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
-			conf->device_lock);
 	}
 	pr_debug("%d stripes handled\n", handled);