diff mbox series

[REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag

Message ID 20240706143038.7253-1-mat.jonczyk@o2.pl (mailing list archive)
State Superseded
Headers show
Series [REGRESSION] Cannot start degraded RAID1 array with device with write-mostly flag | expand

Checks

Context Check Description
mdraidci/vmtest-md-6_11-PR success PR summary
mdraidci/vmtest-md-6_11-VM_Test-0 success Logs for build-kernel

Commit Message

Mateusz Jończyk July 6, 2024, 2:30 p.m. UTC
Hello,

Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
device has the write-mostly flag set. Linux 6.8.0 works fine, as does
6.1.96.

#regzbot introduced: v6.8.0..v6.9.0

In my laptop, I used to have two RAID1 arrays on top of NVMe and SATA
SSD drives: /dev/md0 for /boot, /dev/md1 for remaining data. For
performance, I have marked the RAID component devices on the SATA SSD
drive write-mostly, which "means that the 'md' driver will avoid reading
from these devices if at all possible".

Recently, the NVMe drive started failing, so I removed it from the arrays:

    $ cat /proc/mdstat
    Personalities : [raid1]
    md1 : active raid1 sdb5[1](W)
          471727104 blocks super 1.2 [2/1] [_U]
          bitmap: 4/4 pages [16KB], 65536KB chunk

    md0 : active raid1 sdb4[1](W)
          2094080 blocks super 1.2 [2/1] [_U]
         
    unused devices: <none>

and wiped it. Since then, Linux 6.9+ fails to assemble the arrays on startup
with the following stacktraces in dmesg:

    md/raid1:md0: active with 1 out of 2 mirrors
    md0: detected capacity change from 0 to 4188160
    ------------[ cut here ]------------
    kernel BUG at block/bio.c:1659!
    Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 174 Comm: mdadm Not tainted 6.10.0-rc6unif33 #493
    Hardware name: HP HP Laptop 17-by0xxx/84CA, BIOS F.72 05/31/2024
    RIP: 0010:bio_split+0x96/0xb0
    Code: df ff ff 41 f6 45 14 80 74 08 66 41 81 4c 24 14 80 00 5b 4c 89 e0 41 5c 41 5d 5d c3 cc cc cc cc 41 c7 45 28 00 00 00 00 eb d9 <0f> 0b 0f 0b 0f 0b 45 31 e4 eb dd 66 66 2e 0f 1f 84 00 00 00 00 00
    RSP: 0018:ffffa7588041b330 EFLAGS: 00010246
    RAX: 0000000000000008 RBX: 0000000000000001 RCX: ffff9f22cb08f938
    RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9f22c1199400
    RBP: ffffa7588041b420 R08: ffff9f22c3587b30 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000008 R12: ffff9f22cc9da700
    R13: ffff9f22cb08f800 R14: ffff9f22c6a35fa0 R15: ffff9f22c1846800
    FS:  00007f5f88404740(0000) GS:ffff9f2621e00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000056299cb95000 CR3: 000000010c82a002 CR4: 00000000003706f0
    Call Trace:
     <TASK>
     ? show_regs+0x67/0x70
     ? __die_body+0x20/0x70
     ? die+0x3e/0x60
     ? do_trap+0xd6/0xf0
     ? do_error_trap+0x71/0x90
     ? bio_split+0x96/0xb0
     ? exc_invalid_op+0x53/0x70
     ? bio_split+0x96/0xb0
     ? asm_exc_invalid_op+0x1b/0x20
     ? bio_split+0x96/0xb0
     ? raid1_read_request+0x890/0xd20
     ? __call_rcu_common.constprop.0+0x97/0x260
     raid1_make_request+0x81/0xce0
     ? __get_random_u32_below+0x17/0x70    // is not present in other stacktraces
     ? new_slab+0x2b3/0x580            // is not present in other stacktraces
     md_handle_request+0x77/0x210
     md_submit_bio+0x62/0xa0
     __submit_bio+0x17b/0x230
     submit_bio_noacct_nocheck+0x18e/0x3c0
     submit_bio_noacct+0x244/0x670
     submit_bio+0xac/0xe0
     submit_bh_wbc+0x168/0x190
     block_read_full_folio+0x203/0x420
     ? __mod_memcg_lruvec_state+0xcd/0x210
     ? __pfx_blkdev_get_block+0x10/0x10
     ? __lruvec_stat_mod_folio+0x63/0xb0
     ? __filemap_add_folio+0x24d/0x450
     ? __pfx_blkdev_read_folio+0x10/0x10
     blkdev_read_folio+0x18/0x20
     filemap_read_folio+0x45/0x290
     ? __pfx_workingset_update_node+0x10/0x10
     ? folio_add_lru+0x5a/0x80
     ? filemap_add_folio+0xba/0xe0
     ? __pfx_blkdev_read_folio+0x10/0x10
     do_read_cache_folio+0x10a/0x3c0
     read_cache_folio+0x12/0x20
     read_part_sector+0x36/0xc0
     read_lba+0x96/0x1b0
     find_valid_gpt+0xe8/0x770
     ? get_page_from_freelist+0x615/0x12e0
     ? __pfx_efi_partition+0x10/0x10
     efi_partition+0x80/0x4e0
     ? vsnprintf+0x297/0x4f0
     ? snprintf+0x49/0x70
     ? __pfx_efi_partition+0x10/0x10
     bdev_disk_changed+0x270/0x760
     blkdev_get_whole+0x8b/0xb0
     bdev_open+0x2bd/0x390
     ? __pfx_blkdev_open+0x10/0x10
     blkdev_open+0x8f/0xc0
     do_dentry_open+0x174/0x570
     vfs_open+0x2b/0x40
     path_openat+0xb20/0x1150
     do_filp_open+0xa8/0x120
     ? alloc_fd+0xc2/0x180
     do_sys_openat2+0x250/0x2a0
     do_sys_open+0x46/0x80
     __x64_sys_openat+0x20/0x30
     x64_sys_call+0xe55/0x20d0
     do_syscall_64+0x47/0x110
     entry_SYSCALL_64_after_hwframe+0x76/0x7e
    RIP: 0033:0x7f5f88514f5b
    Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
    RSP: 002b:00007ffd8839cbe0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
    RAX: ffffffffffffffda RBX: 00007ffd8839dbe0 RCX: 00007f5f88514f5b
    RDX: 0000000000004000 RSI: 00007ffd8839cc70 RDI: 00000000ffffff9c
    RBP: 00007ffd8839cc70 R08: 0000000000000000 R09: 00007ffd8839cae0
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000004000
    R13: 0000000000004000 R14: 00007ffd8839cc68 R15: 000055942d9dabe0
     </TASK>
    Modules linked in: crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 drm_buddy r8169 i2c_algo_bit psmouse i2c_i801 drm_display_helper i2c_mux video i2c_smbus
xhci_pci realtek cec xhci_pci_renesas i2c_hid_acpi i2c_hid hid wmi aesni_intel crypto_simd cryptd
    ---[ end trace 0000000000000000 ]---

which were logged twice (for two arrays).

The line
    kernel BUG at block/bio.c:1659!
corresponds to
    BUG_ON(sectors <= 0);
in bio_split().

After some investigation, I have determined that the bug is most likely in
choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
before returning early. A test patch (below) seems to fix this issue (Linux
boots and appears to be working correctly with it, but I didn't do any more
advanced experiments yet).

This points to
commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
as the most likely culprit. However, I was running into other bugs in mdadm when
trying to test this commit directly.

Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.

Greetings,

Mateusz

---------------------------------------------------

>From e19348bc62eea385459ca1df67bd7c7c2afd7538 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mateusz=20Jo=C5=84czyk?= <mat.jonczyk@o2.pl>
Date: Sat, 6 Jul 2024 11:21:03 +0200
Subject: [RFC PATCH] md/raid1: fill in max_sectors

Not yet fully tested or carefully investigated.

Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>

---
 drivers/md/raid1.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Mateusz Jończyk July 7, 2024, 7:50 p.m. UTC | #1
W dniu 6.07.2024 o 16:30, Mateusz Jończyk pisze:
> Hello,
>
> Linux 6.9+ cannot start a degraded RAID1 array when the only remaining
> device has the write-mostly flag set. Linux 6.8.0 works fine, as does
> 6.1.96.
[snip]
> After some investigation, I have determined that the bug is most likely in
> choose_slow_rdev() in drivers/md/raid1.c, which doesn't set max_sectors
> before returning early. A test patch (below) seems to fix this issue (Linux
> boots and appears to be working correctly with it, but I didn't do any more
> advanced experiments yet).
>
> This points to
> commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()")
> as the most likely culprit. However, I was running into other bugs in mdadm when
> trying to test this commit directly.
>
> Distribution: Ubuntu 20.04, hardware: a HP 17-by0001nw laptop.

I have been testing this patch carefully:

1. I have been reliably getting deadlocks when adding / removing devices
on an array that contains a component with the write-mostly flag set
- while the array was loaded with fsstress. When the array was idle,
no such deadlocks happened. This occurred also on Linux 6.8.0
though, but not on 6.1.97-rc1, so this is likely an independent regression.

2. When adding a device to the array (/dev/sda1), I once got the following warnings in dmesg on patched 6.10-rc6:

        [ 8253.337816] md: could not open device unknown-block(8,1).
        [ 8253.337832] md: md_import_device returned -16
        [ 8253.338152] md: could not open device unknown-block(8,1).
        [ 8253.338169] md: md_import_device returned -16
        [ 8253.674751] md: recovery of RAID array md2

(/dev/sda1 has device major/minor numbers = 8,1). This may be caused by some interaction with udev, though.
I have also seen this on Linux 6.8.

Additionally, on an unpatched 6.1.97-rc1 (which was handy for testing), I got a deadlock
when removing a bitmap from such an array while it was loaded with fsstress.

I'll file independent reports, but wanted to give a head's up.

Greetings,

Mateusz
Yu Kuai July 8, 2024, 1:54 a.m. UTC | #2
Hi,

在 2024/07/06 22:30, Mateusz Jończyk 写道:
> Subject: [RFC PATCH] md/raid1: fill in max_sectors
> 
> 
> 
> Not yet fully tested or carefully investigated.
> 
> 
> 
> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
> 
> 
> 
> ---
> 
>   drivers/md/raid1.c | 1 +
> 
>   1 file changed, 1 insertion(+)
> 
> 
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> 
> index 7b8a71ca66dd..82f70a4ce6ed 100644
> 
> --- a/drivers/md/raid1.c
> 
> +++ b/drivers/md/raid1.c
> 
> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
> 
>   		len = r1_bio->sectors;
> 
>   		read_len = raid1_check_read_range(rdev, this_sector, &len);
> 
>   		if (read_len == r1_bio->sectors) {
> 
> +			*max_sectors = read_len;
> 
>   			update_read_sectors(conf, disk, this_sector, read_len);
> 
>   			return disk;
> 
>   		}

This looks correct, can you give it a test and cook a patch?

Thanks,
Kuai
Mateusz Jończyk July 8, 2024, 8:09 p.m. UTC | #3
W dniu 8.07.2024 o 03:54, Yu Kuai pisze:
> Hi,
>
> 在 2024/07/06 22:30, Mateusz Jończyk 写道:
>> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>>
>>
>>
>> Not yet fully tested or carefully investigated.
>>
>>
>>
>> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>>
>>
>>
>> ---
>>
>>   drivers/md/raid1.c | 1 +
>>
>>   1 file changed, 1 insertion(+)
>>
>>
>>
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>
>> index 7b8a71ca66dd..82f70a4ce6ed 100644
>>
>> --- a/drivers/md/raid1.c
>>
>> +++ b/drivers/md/raid1.c
>>
>> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>
>>           len = r1_bio->sectors;
>>
>>           read_len = raid1_check_read_range(rdev, this_sector, &len);
>>
>>           if (read_len == r1_bio->sectors) {
>>
>> +            *max_sectors = read_len;
>>
>>               update_read_sectors(conf, disk, this_sector, read_len);
>>
>>               return disk;
>>
>>           }
>
> This looks correct, can you give it a test and cook a patch?
>
> Thanks,
> Kuai
Hello,

Yes, I'm working on it. Patch description is nearly done.
Kernel with this patch works well with normal usage and
fsstress, except when modifying the array, as I have written
in my previous email. Will test some more.

I'm feeling nervous working on such sensitive code as md, though.
I'm not an experienced kernel dev.

Greetings,

Mateusz
Yu Kuai July 9, 2024, 2:57 a.m. UTC | #4
Hi,

在 2024/07/09 4:09, Mateusz Jończyk 写道:
> W dniu 8.07.2024 o 03:54, Yu Kuai pisze:
>> Hi,
>>
>> 在 2024/07/06 22:30, Mateusz Jończyk 写道:
>>> Subject: [RFC PATCH] md/raid1: fill in max_sectors
>>>
>>>
>>>
>>> Not yet fully tested or carefully investigated.
>>>
>>>
>>>
>>> Signed-off-by: Mateusz Jo艅czyk<mat.jonczyk@o2.pl>
>>>
>>>
>>>
>>> ---
>>>
>>>    drivers/md/raid1.c | 1 +
>>>
>>>    1 file changed, 1 insertion(+)
>>>
>>>
>>>
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>
>>> index 7b8a71ca66dd..82f70a4ce6ed 100644
>>>
>>> --- a/drivers/md/raid1.c
>>>
>>> +++ b/drivers/md/raid1.c
>>>
>>> @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
>>>
>>>            len = r1_bio->sectors;
>>>
>>>            read_len = raid1_check_read_range(rdev, this_sector, &len);
>>>
>>>            if (read_len == r1_bio->sectors) {
>>>
>>> +            *max_sectors = read_len;
>>>
>>>                update_read_sectors(conf, disk, this_sector, read_len);
>>>
>>>                return disk;
>>>
>>>            }
>>
>> This looks correct, can you give it a test and cook a patch?
>>
>> Thanks,
>> Kuai
> Hello,
> 
> Yes, I'm working on it. Patch description is nearly done.
> Kernel with this patch works well with normal usage and
> fsstress, except when modifying the array, as I have written
> in my previous email. Will test some more.

Please run mdadm tests at least. And we may need to add a new test.

https://kernel.googlesource.com/pub/scm/utils/mdadm/mdadm.git

./test --dev=loop

Thanks,
Kuai

> 
> I'm feeling nervous working on such sensitive code as md, though.
> I'm not an experienced kernel dev.
> 
> Greetings,
> 
> Mateusz
> 
> .
>
Mariusz Tkaczyk July 9, 2024, 6:49 a.m. UTC | #5
On Mon, 8 Jul 2024 22:09:51 +0200
Mateusz Jończyk <mat.jonczyk@o2.pl> wrote:
> > This looks correct, can you give it a test and cook a patch?
> >
> > Thanks,
> > Kuai  
> Hello,
> 
> Yes, I'm working on it. Patch description is nearly done.
> Kernel with this patch works well with normal usage and
> fsstress, except when modifying the array, as I have written
> in my previous email. Will test some more.
> 
> I'm feeling nervous working on such sensitive code as md, though.
> I'm not an experienced kernel dev.
> 
> Greetings,
> 
> Mateusz
> 
> 

Hi Mateusz,
If there is something I can help with, fell free to ask (even in Polish).
You can reach me by the mail I sent it or mariusz.tkaczyk@intel.com

I cannot answer you directly (this is the first problem you have to solve):
The following message to <mat.jonczyk@o2.pl> was undeliverable.
The reason for the problem:
5.1.0 - Unknown address error 554-'sorry, refused mailfrom because return MX
does not exist'

Please consider using different mail provider (so far I know, gmail works well).

Thanks,
Mariusz
diff mbox series

Patch

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b8a71ca66dd..82f70a4ce6ed 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -680,6 +680,7 @@  static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio,
 		len = r1_bio->sectors;
 		read_len = raid1_check_read_range(rdev, this_sector, &len);
 		if (read_len == r1_bio->sectors) {
+			*max_sectors = read_len;
 			update_read_sectors(conf, disk, this_sector, read_len);
 			return disk;
 		}