Message ID | 20231108182216.73611-2-junxiao.bi@oracle.com (mailing list archive) |
---|---|
State | Accepted, archived |
Delegated to: | Song Liu |
Headers | show |
Series | [V3,1/2] md: bypass block throttle for superblock update | expand |
在 2023/11/09 2:22, Junxiao Bi 写道: > This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74. > > That commit introduced the following race and can cause system hung. > > md_write_start: raid5d: > // mddev->in_sync == 1 > set "MD_SB_CHANGE_PENDING" > // running before md_write_start wakeup it > waiting "MD_SB_CHANGE_PENDING" cleared > >>>>>>>>> hung > wakeup mddev->thread > ... > waiting "MD_SB_CHANGE_PENDING" cleared > >>>> hung, raid5d should clear this flag > but get hung by same flag. > > The issue reverted commit fixing is fixed by last patch in a new way. > > Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") > Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> LGTM Reviewed-by: Yu Kuai <yukuai3@huawei.com> > --- > drivers/md/raid5.c | 12 ------------ > 1 file changed, 12 deletions(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index dc031d42f53b..fcc8a44dd4fd 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -36,7 +36,6 @@ > */ > > #include <linux/blkdev.h> > -#include <linux/delay.h> > #include <linux/kthread.h> > #include <linux/raid/pq.h> > #include <linux/async_tx.h> > @@ -6820,18 +6819,7 @@ static void raid5d(struct md_thread *thread) > spin_unlock_irq(&conf->device_lock); > md_check_recovery(mddev); > spin_lock_irq(&conf->device_lock); > - > - /* > - * Waiting on MD_SB_CHANGE_PENDING below may deadlock > - * seeing md_check_recovery() is needed to clear > - * the flag when using mdmon. > - */ > - continue; > } > - > - wait_event_lock_irq(mddev->sb_wait, > - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags), > - conf->device_lock); > } > pr_debug("%d stripes handled\n", handled); > >
Hi Song, These two patches get reviewed-by from Kuai and Logan, I didn't see them in your tree yet, would you like review it and pick it up? Thanks, Junxiao. On 11/8/23 10:28 PM, Yu Kuai wrote: > 在 2023/11/09 2:22, Junxiao Bi 写道: >> This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74. >> >> That commit introduced the following race and can cause system hung. >> >> md_write_start: raid5d: >> // mddev->in_sync == 1 >> set "MD_SB_CHANGE_PENDING" >> // running before md_write_start wakeup it >> waiting "MD_SB_CHANGE_PENDING" cleared >> >>>>>>>>> hung >> wakeup mddev->thread >> ... >> waiting "MD_SB_CHANGE_PENDING" cleared >> >>>> hung, raid5d should clear this flag >> but get hung by same flag. >> >> The issue reverted commit fixing is fixed by last patch in a new way. >> >> Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in >> raid5d") >> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> > > LGTM > Reviewed-by: Yu Kuai <yukuai3@huawei.com> > >> --- >> drivers/md/raid5.c | 12 ------------ >> 1 file changed, 12 deletions(-) >> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >> index dc031d42f53b..fcc8a44dd4fd 100644 >> --- a/drivers/md/raid5.c >> +++ b/drivers/md/raid5.c >> @@ -36,7 +36,6 @@ >> */ >> #include <linux/blkdev.h> >> -#include <linux/delay.h> >> #include <linux/kthread.h> >> #include <linux/raid/pq.h> >> #include <linux/async_tx.h> >> @@ -6820,18 +6819,7 @@ static void raid5d(struct md_thread *thread) >> spin_unlock_irq(&conf->device_lock); >> md_check_recovery(mddev); >> spin_lock_irq(&conf->device_lock); >> - >> - /* >> - * Waiting on MD_SB_CHANGE_PENDING below may deadlock >> - * seeing md_check_recovery() is needed to clear >> - * the flag when using mdmon. >> - */ >> - continue; >> } >> - >> - wait_event_lock_irq(mddev->sb_wait, >> - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags), >> - conf->device_lock); >> } >> pr_debug("%d stripes handled\n", handled); >> >
On Tue, Nov 14, 2023 at 5:00 PM <junxiao.bi@oracle.com> wrote: > > Hi Song, > > These two patches get reviewed-by from Kuai and Logan, I didn't see them > in your tree yet, would you like review it and pick it up? Sorry for the delay. I was traveling for LPC. I will look into these soon. Thanks, Song
On 11/17/23 5:14 PM, Song Liu wrote: > On Tue, Nov 14, 2023 at 5:00 PM <junxiao.bi@oracle.com> wrote: >> Hi Song, >> >> These two patches get reviewed-by from Kuai and Logan, I didn't see them >> in your tree yet, would you like review it and pick it up? > Sorry for the delay. I was traveling for LPC. I will look into these soon. No problem. Really appreciate your help. Thanks, Junxiao. > > Thanks, > Song
On Wed, Nov 8, 2023 at 10:22 AM Junxiao Bi <junxiao.bi@oracle.com> wrote: > > This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74. > > That commit introduced the following race and can cause system hung. > > md_write_start: raid5d: > // mddev->in_sync == 1 > set "MD_SB_CHANGE_PENDING" > // running before md_write_start wakeup it > waiting "MD_SB_CHANGE_PENDING" cleared > >>>>>>>>> hung > wakeup mddev->thread > ... > waiting "MD_SB_CHANGE_PENDING" cleared > >>>> hung, raid5d should clear this flag > but get hung by same flag. > > The issue reverted commit fixing is fixed by last patch in a new way. > > Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") > Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> The set looks good to me. Thanks! Quick question: from the earlier thread, the issue was observed in production. Have you reproduced the issue and thus verified the fix works as expected? Thanks, Song
On 11/24/23 9:29 AM, Song Liu wrote: > On Wed, Nov 8, 2023 at 10:22 AM Junxiao Bi <junxiao.bi@oracle.com> wrote: >> This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74. >> >> That commit introduced the following race and can cause system hung. >> >> md_write_start: raid5d: >> // mddev->in_sync == 1 >> set "MD_SB_CHANGE_PENDING" >> // running before md_write_start wakeup it >> waiting "MD_SB_CHANGE_PENDING" cleared >> >>>>>>>>> hung >> wakeup mddev->thread >> ... >> waiting "MD_SB_CHANGE_PENDING" cleared >> >>>> hung, raid5d should clear this flag >> but get hung by same flag. >> >> The issue reverted commit fixing is fixed by last patch in a new way. >> >> Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") >> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> > The set looks good to me. Thanks! Thanks for the review. > > Quick question: from the earlier thread, the issue was observed in > production. Have you reproduced the issue and thus verified the fix > works as expected? I didn't try reproducing this since the system hung on the code where the bad commit added, after revert it, this issue will not reproduce any more. Thanks, Junxiao. > > Thanks, > Song
On Fri, Nov 24, 2023 at 3:18 PM <junxiao.bi@oracle.com> wrote: > > On 11/24/23 9:29 AM, Song Liu wrote: > > > On Wed, Nov 8, 2023 at 10:22 AM Junxiao Bi <junxiao.bi@oracle.com> wrote: > >> This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74. > >> > >> That commit introduced the following race and can cause system hung. > >> > >> md_write_start: raid5d: > >> // mddev->in_sync == 1 > >> set "MD_SB_CHANGE_PENDING" > >> // running before md_write_start wakeup it > >> waiting "MD_SB_CHANGE_PENDING" cleared > >> >>>>>>>>> hung > >> wakeup mddev->thread > >> ... > >> waiting "MD_SB_CHANGE_PENDING" cleared > >> >>>> hung, raid5d should clear this flag > >> but get hung by same flag. > >> > >> The issue reverted commit fixing is fixed by last patch in a new way. > >> > >> Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") > >> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> > > The set looks good to me. Thanks! > Thanks for the review. > > > > Quick question: from the earlier thread, the issue was observed in > > production. Have you reproduced the issue and thus verified the fix > > works as expected? > > I didn't try reproducing this since the system hung on the code where > the bad commit added, after revert it, this issue will not reproduce any > more. Thanks for the information. I applied the set to md-next. Song
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index dc031d42f53b..fcc8a44dd4fd 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -36,7 +36,6 @@ */ #include <linux/blkdev.h> -#include <linux/delay.h> #include <linux/kthread.h> #include <linux/raid/pq.h> #include <linux/async_tx.h> @@ -6820,18 +6819,7 @@ static void raid5d(struct md_thread *thread) spin_unlock_irq(&conf->device_lock); md_check_recovery(mddev); spin_lock_irq(&conf->device_lock); - - /* - * Waiting on MD_SB_CHANGE_PENDING below may deadlock - * seeing md_check_recovery() is needed to clear - * the flag when using mdmon. - */ - continue; } - - wait_event_lock_irq(mddev->sb_wait, - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags), - conf->device_lock); } pr_debug("%d stripes handled\n", handled);