diff mbox series

[v2,1/2] md/raid5: freeze reshape when encountering a bad read

Message ID 20250127090049.7952-1-dougvj@dougvj.net (mailing list archive)
State New
Headers show
Series [v2,1/2] md/raid5: freeze reshape when encountering a bad read | expand

Commit Message

Doug V Johnson Jan. 27, 2025, 9 a.m. UTC
While adding an additional drive to a raid6 array, the reshape stalled
at about 13% complete and any I/O operations on the array hung,
creating an effective soft lock. The kernel reported a hung task in
mdXX_reshape thread and I had to use magic sysrq to recover as systemd
hung as well.

I first suspected an issue with one of the underlying block devices and
as precaution I recovered the data in read only mode to a new array, but
it turned out to be in the RAID layer as I was able to recreate the
issue from a superblock dump in sparse files.

After poking around some I discovered that I had somehow propagated the
bad block list to several devices in the array such that a few blocks
were unreable. The bad read reported correctly in userspace during
recovery, but it wasn't obvious that it was from a bad block list
metadata at the time and instead confirmed my bias suspecting hardware
issues

I was able to reproduce the issue with a minimal test case using small
loopback devices. I put a script for this in a github repository:

https://github.com/dougvj/md_badblock_reshape_stall_test

This patch handles bad reads during a reshape by introducing a
handle_failed_reshape function in a similar manner to
handle_failed_resync. The function aborts the current stripe by
unmarking STRIPE_EXPANDING and STRIP_EXPAND_READY, sets the
MD_RECOVERY_FROZEN bit, reverts the head of the reshape to the safe
position, and reports the situation in dmesg.

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

Comments

Yu Kuai Feb. 8, 2025, 2:49 a.m. UTC | #1
Hi,

在 2025/01/27 17:00, Doug V Johnson 写道:
> While adding an additional drive to a raid6 array, the reshape stalled
> at about 13% complete and any I/O operations on the array hung,
> creating an effective soft lock. The kernel reported a hung task in
> mdXX_reshape thread and I had to use magic sysrq to recover as systemd
> hung as well.
> 
> I first suspected an issue with one of the underlying block devices and
> as precaution I recovered the data in read only mode to a new array, but
> it turned out to be in the RAID layer as I was able to recreate the
> issue from a superblock dump in sparse files.
> 
> After poking around some I discovered that I had somehow propagated the
> bad block list to several devices in the array such that a few blocks
> were unreable. The bad read reported correctly in userspace during
> recovery, but it wasn't obvious that it was from a bad block list
> metadata at the time and instead confirmed my bias suspecting hardware
> issues
> 
> I was able to reproduce the issue with a minimal test case using small
> loopback devices. I put a script for this in a github repository:
> 
> https://github.com/dougvj/md_badblock_reshape_stall_test
> 
> This patch handles bad reads during a reshape by introducing a
> handle_failed_reshape function in a similar manner to
> handle_failed_resync. The function aborts the current stripe by
> unmarking STRIPE_EXPANDING and STRIP_EXPAND_READY, sets the
> MD_RECOVERY_FROZEN bit, reverts the head of the reshape to the safe
> position, and reports the situation in dmesg.
> 
> Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
> ---
>   drivers/md/raid5.c | 23 +++++++++++++++++++++++
>   1 file changed, 23 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 5c79429acc64..bc0b0c2540f0 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3738,6 +3738,27 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
>   	md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), !abort);
>   }
>   
> +static void
> +handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
> +		      struct stripe_head_state *s)
> +{
> +	// Abort the current stripe
> +	clear_bit(STRIPE_EXPANDING, &sh->state);
> +	clear_bit(STRIPE_EXPAND_READY, &sh->state);
> +	pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu, cannot progress",
> +			    mdname(conf->mddev),
> +			    (unsigned long)sh->sector);
pr_err() is fine here.

> +	// Freeze the reshape
> +	set_bit(MD_RECOVERY_FROZEN, &conf->mddev->recovery);
> +	// Revert progress to safe position
> +	spin_lock_irq(&conf->device_lock);
> +	conf->reshape_progress = conf->reshape_safe;
> +	spin_unlock_irq(&conf->device_lock);
> +	// report failed md sync
> +	md_done_sync(conf->mddev, 0, 0);
> +	wake_up(&conf->wait_for_reshape);

Just wonder, this will leave the array in the state that reshape is
still in progress, and forbid any other sync action(details in
md_choose_sync_action()). It's better to avoid that, not quite
sure yet how. :(

In theory, if user provide a new disk, this werid state can be resolved
by doing a recovery concurrent with the reshape. However, this is too
complicated and doesn't seem worth it.

Perhaps, check badblocks before starting reshape in the first place can
solve most problems in this case? We can keep this patch just in case
new badblocks are set while performing reshape.

Thanks,
Kuai

> +}
> +
>   static int want_replace(struct stripe_head *sh, int disk_idx)
>   {
>   	struct md_rdev *rdev;
> @@ -4987,6 +5008,8 @@ static void handle_stripe(struct stripe_head *sh)
>   			handle_failed_stripe(conf, sh, &s, disks);
>   		if (s.syncing + s.replacing)
>   			handle_failed_sync(conf, sh, &s);
> +		if (test_bit(STRIPE_EXPANDING, &sh->state))
> +			handle_failed_reshape(conf, sh, &s);
>   	}
>   
>   	/* Now we check to see if any write operations have recently
>
Doug V Johnson Feb. 18, 2025, 7:20 a.m. UTC | #2
> Perhaps, check badblocks before starting reshape in the first place can
> solve most problems in this case? We can keep this patch just in case
> new badblocks are set while performing reshape.
> 
> Thanks,
> Kuai
> 
This is a really good idea and should mitigate the risk of getting into 
these weird states. I started work on a patch that seems to work, I will 
submit it after I have done some testing.

Thanks for your feedback,
Doug
diff mbox series

Patch

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5c79429acc64..bc0b0c2540f0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3738,6 +3738,27 @@  handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
 	md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), !abort);
 }
 
+static void
+handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
+		      struct stripe_head_state *s)
+{
+	// Abort the current stripe
+	clear_bit(STRIPE_EXPANDING, &sh->state);
+	clear_bit(STRIPE_EXPAND_READY, &sh->state);
+	pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu, cannot progress",
+			    mdname(conf->mddev),
+			    (unsigned long)sh->sector);
+	// Freeze the reshape
+	set_bit(MD_RECOVERY_FROZEN, &conf->mddev->recovery);
+	// Revert progress to safe position
+	spin_lock_irq(&conf->device_lock);
+	conf->reshape_progress = conf->reshape_safe;
+	spin_unlock_irq(&conf->device_lock);
+	// report failed md sync
+	md_done_sync(conf->mddev, 0, 0);
+	wake_up(&conf->wait_for_reshape);
+}
+
 static int want_replace(struct stripe_head *sh, int disk_idx)
 {
 	struct md_rdev *rdev;
@@ -4987,6 +5008,8 @@  static void handle_stripe(struct stripe_head *sh)
 			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
+		if (test_bit(STRIPE_EXPANDING, &sh->state))
+			handle_failed_reshape(conf, sh, &s);
 	}
 
 	/* Now we check to see if any write operations have recently