diff mbox series

[1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall

Message ID 20250125012702.18773-1-dougvj@dougvj.net (mailing list archive)
State New
Headers show
Series [1/2] md/raid5: skip stripes with bad reads during reshape to avoid stall | expand

Commit Message

Doug V Johnson Jan. 25, 2025, 1:26 a.m. UTC
While adding an additional drive to a raid6 array, the reshape stalled
at about 13% complete and any I/O operations on the array hung,
creating an effective soft lock. The kernel reported a hung task in
mdXX_reshape thread and I had to use magic sysrq to recover as systemd
hung as well.

I first suspected an issue with one of the underlying block devices and
as precaution I recovered the data in read only mode to a new array, but
it turned out to be in the RAID layer as I was able to recreate the
issue from a superblock dump in sparse files.

After poking around some I discovered that I had somehow propagated the
bad block list to several devices in the array such that a few blocks
were unreable. The bad read reported correctly in userspace during
recovery, but it wasn't obvious that it was from a bad block list
metadata at the time and instead confirmed my bias suspecting hardware
issues

I was able to reproduce the issue with a minimal test case using small
loopback devices. I put a script for this in a github repository:

https://github.com/dougvj/md_badblock_reshape_stall_test

This patch handles bad reads during a reshape by unmarking the
STRIPE_EXPANDING and STRIPE_EXPAND_READY bits effectively skipping the
stripe and then reports the issue in dmesg.

Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
---
 drivers/md/raid5.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Yu Kuai Jan. 26, 2025, 6:27 a.m. UTC | #1
Hi,

在 2025/01/25 9:26, Doug V Johnson 写道:
> While adding an additional drive to a raid6 array, the reshape stalled
> at about 13% complete and any I/O operations on the array hung,
> creating an effective soft lock. The kernel reported a hung task in
> mdXX_reshape thread and I had to use magic sysrq to recover as systemd
> hung as well.
> 
> I first suspected an issue with one of the underlying block devices and
> as precaution I recovered the data in read only mode to a new array, but
> it turned out to be in the RAID layer as I was able to recreate the
> issue from a superblock dump in sparse files.
> 
> After poking around some I discovered that I had somehow propagated the
> bad block list to several devices in the array such that a few blocks
> were unreable. The bad read reported correctly in userspace during
> recovery, but it wasn't obvious that it was from a bad block list
> metadata at the time and instead confirmed my bias suspecting hardware
> issues
> 
> I was able to reproduce the issue with a minimal test case using small
> loopback devices. I put a script for this in a github repository:
> 
> https://github.com/dougvj/md_badblock_reshape_stall_test
> 
> This patch handles bad reads during a reshape by unmarking the
> STRIPE_EXPANDING and STRIPE_EXPAND_READY bits effectively skipping the
> stripe and then reports the issue in dmesg.
> 
> Signed-off-by: Doug V Johnson <dougvj@dougvj.net>
> ---
>   drivers/md/raid5.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 5c79429acc64..0ae9ac695d8e 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4987,6 +4987,14 @@ static void handle_stripe(struct stripe_head *sh)
>   			handle_failed_stripe(conf, sh, &s, disks);
>   		if (s.syncing + s.replacing)
>   			handle_failed_sync(conf, sh, &s);
> +		if (test_bit(STRIPE_EXPANDING, &sh->state)) {
> +			pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu",
> +					    mdname(conf->mddev),
> +					    (unsigned long)sh->sector);
> +			/* Abort the current stripe */
> +			clear_bit(STRIPE_EXPANDING, &sh->state);
> +			clear_bit(STRIPE_EXPAND_READY, &sh->state);
> +		}
>   	}

Thanks for the patch, however, for example:

before reshape, three disks raid5:
rdev0           rdev1           rdev2
chunk0          chunk1          P0
chunk2(BB)      P1(BB)          chunk3
P2              chunk4          chunk5
chunk6          chunk7          P3
chunk8          P4              chunk9
P5              chunk10         chunk11

after reshape, four disks raid5:
rdev0           rdev1           rdev2           rdev3
chunk0          chunk1          chunk2(lost)    P0
chunk3(BB)      chunk4(BB)      P1              chunk5
chunk6          P2              chunk7          chunk8
P3              chunk9          chunk10         chunk11

In this case, before reshape, data from chunk2 is lost, however,
after reshape, chunk2 is lost as well, need to set badblocks to rdev2 to
prevent user get wrong data. Meanwhile, chunk3 and chunk4 are all lost,
because rdev0 and rdev1 has badblocks.

So, perhaps just abort the reshape will make more sense to me, because
user will lost more data if so.

Thanks,
Kuai

>   
>   	/* Now we check to see if any write operations have recently
>
Doug V Johnson Jan. 27, 2025, 8:51 a.m. UTC | #2
> So, perhaps just abort the reshape will make more sense to me, because
> user will lost more data if so.

Ah yes, I agree that does make more sense. I appreciate the prompt 
feedback and explanation. I will submit a v2 soon.
diff mbox series

Patch

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5c79429acc64..0ae9ac695d8e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4987,6 +4987,14 @@  static void handle_stripe(struct stripe_head *sh)
 			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
+		if (test_bit(STRIPE_EXPANDING, &sh->state)) {
+			pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu",
+					    mdname(conf->mddev),
+					    (unsigned long)sh->sector);
+			/* Abort the current stripe */
+			clear_bit(STRIPE_EXPANDING, &sh->state);
+			clear_bit(STRIPE_EXPAND_READY, &sh->state);
+		}
 	}
 
 	/* Now we check to see if any write operations have recently