Message ID | e5e8afe2-e9a8-49a2-5ab0-958d4065c55e@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | MD fixes for the LVM2 testsuite | expand |
Hi Mikulas, On Wed, Jan 17, 2024 at 10:17 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > Hi > > Here I'm sending MD patches that fix the LVM2 testsuite for the kernels > 6.7 and 6.8. The testsuite was broken in the 6.6 -> 6.7 window, there are > multiple tests that deadlock. > > I fixed some of the bugs. And I reverted some patches that claim to be > fixing bugs but they break the testsuite. > > I'd like to ask you - please, next time when you are going to commit > something into the MD subsystem, download the LVM2 package from > git://sourceware.org/git/lvm2.git and run the testsuite ("./configure && > make && make check") to make sure that your bugfix doesn't introduce > another bug. You can run a specific test with the "T" parameter - for > example "make check T=shell/integrity-caching.sh" Thanks for the report and fixes! We will look into the fixes and process them ASAP. We will also add the lvm2 testsuite to the test framework. Song
Hi, 在 2024/01/18 3:27, Song Liu 写道: > Hi Mikulas, > > On Wed, Jan 17, 2024 at 10:17 AM Mikulas Patocka <mpatocka@redhat.com> wrote: >> >> Hi >> >> Here I'm sending MD patches that fix the LVM2 testsuite for the kernels >> 6.7 and 6.8. The testsuite was broken in the 6.6 -> 6.7 window, there are >> multiple tests that deadlock. >> >> I fixed some of the bugs. And I reverted some patches that claim to be >> fixing bugs but they break the testsuite. >> >> I'd like to ask you - please, next time when you are going to commit >> something into the MD subsystem, download the LVM2 package from >> git://sourceware.org/git/lvm2.git and run the testsuite ("./configure && >> make && make check") to make sure that your bugfix doesn't introduce >> another bug. You can run a specific test with the "T" parameter - for >> example "make check T=shell/integrity-caching.sh" > > Thanks for the report and fixes! > > We will look into the fixes and process them ASAP. > > We will also add the lvm2 testsuite to the test framework. Yes, we absolutely must make sure that dm-raid is not broken... I'll try to run lvm2 testsuite and figure out the root cause. Thanks, Kuai > > Song > . >
Hi, Mikulas 在 2024/01/18 2:16, Mikulas Patocka 写道: > Hi > > Here I'm sending MD patches that fix the LVM2 testsuite for the kernels > 6.7 and 6.8. The testsuite was broken in the 6.6 -> 6.7 window, there are > multiple tests that deadlock. > > I fixed some of the bugs. And I reverted some patches that claim to be > fixing bugs but they break the testsuite. > > I'd like to ask you - please, next time when you are going to commit > something into the MD subsystem, download the LVM2 package from > git://sourceware.org/git/lvm2.git and run the testsuite ("./configure && > make && make check") to make sure that your bugfix doesn't introduce > another bug. You can run a specific test with the "T" parameter - for > example "make check T=shell/integrity-caching.sh" I tried to found broken test by myself, but I have to ask now... While verify my fixes[1], other than the test you mentioned in this patchset: shell/integrity-caching.sh shell/lvconvert-raid-reshape-linear_to_raid6-single-type.sh shell/lvconvert-raid-reshape.sh I verified in my VM that before my fixes, they do hang/fail easily and they are indeed related to md/raid changes recently. However, I still meet some problems that I can't make progress for now. Is there other tests that you know are broken in the v6.6->v6.7 window? And is there know broken test in v6.6? Because I found some tests will fail occasionally in v6.8-rc1 with my fixes, and they will fail in v6.6 as well. For example: shell/lvchange-raid1-writemostly.sh ## ERROR: The test started dmeventd (2064) unexpectedly. shell/select-report.sh >>> NUMBER OF ITEMS EXPECTED: 6 vol1 vol2 abc abc orig xyz #select-report.sh:67+ echo ' >>> NUMBER OF ITEMS FOUND: 7 ( vol1 vol2 abc abc orig snap xyz )' And I also met a following BUG during test: [12504.959682] BUG bio-296 (Not tainted): Object already free [12504.960239] ----------------------------------------------------------------------------- [12504.960239] [12504.961209] Allocated in mempool_alloc+0xe8/0x270 age=30 cpu=1 pid=203288 [12504.961905] kmem_cache_alloc+0x36a/0x3b0 [12504.962324] mempool_alloc+0xe8/0x270 [12504.962712] bio_alloc_bioset+0x3b5/0x920 [12504.963129] bio_alloc_clone+0x3e/0x160 [12504.963533] alloc_io+0x3d/0x1f0 [12504.963876] dm_submit_bio+0x12f/0xa30 [12504.964267] __submit_bio+0x9c/0xe0 [12504.964639] submit_bio_noacct_nocheck+0x25a/0x570 [12504.965136] submit_bio_wait+0xc2/0x160 [12504.965535] blkdev_issue_zeroout+0x19b/0x2e0 [12504.965991] ext4_init_inode_table+0x246/0x560 [12504.966462] ext4_lazyinit_thread+0x750/0xbe0 [12504.966922] kthread+0x1b4/0x1f0 And a lockdep waring: [ 1229.452306] ============================================ [ 1229.452838] WARNING: possible recursive locking detected [ 1229.453344] 6.8.0-rc1+ #941 Not tainted [ 1229.453711] -------------------------------------------- [ 1229.454242] lvm/18080 is trying to acquire lock: [ 1229.454687] ffff888112abc1d0 (&pmd->root_lock){++++}-{3:3}, at: dm_thin_find_block+0x9f/0x0 [ 1229.455543] [ 1229.455543] but task is already holding lock: [ 1229.456122] ffff8881058bf1d0 (&pmd->root_lock){++++}-{3:3}, at: dm_pool_commit_metadata+0x0 [ 1229.456992] [ 1229.456992] other info that might help us debug this: [ 1229.457628] Possible unsafe locking scenario: [ 1229.457628] [ 1229.458218] CPU0 [ 1229.458469] ---- [ 1229.458726] lock(&pmd->root_lock); [ 1229.459093] lock(&pmd->root_lock); [ 1229.459455] [ 1229.459455] *** DEADLOCK *** [ 1229.459455] [ 1229.460045] May be due to missing lock nesting notation [ 1229.460045] [ 1229.460697] 3 locks held by lvm/18080: [ 1229.461074] #0: ffff888153306870 (&md->suspend_lock/1){+.+.}-{3:3}, at: dm_resume+0x24/0x0 [ 1229.461935] #1: ffff8881058bf1d0 (&pmd->root_lock){++++}-{3:3}, at: dm_pool_commit_metada0 [ 1229.462857] #2: ffff88810f3abf10 (&md->io_barrier){.+.+}-{0:0}, at: dm_get_live_table+0x50 [ 1229.463731] [ 1229.463731] [ 1229.463731] stack backtrace: [ 1229.464165] CPU: 3 PID: 18080 Comm: lvm Not tainted 6.8.0-rc1+ #941 [ 1229.464780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/04 [ 1229.465618] Call Trace: [ 1229.465884] <TASK> [ 1229.466110] dump_stack_lvl+0x4a/0x80 [ 1229.466491] __lock_acquire+0x1ad4/0x3540 [ 1229.467369] lock_acquire+0x16a/0x400 [ 1229.470276] down_read+0xa3/0x380 [ 1229.471877] dm_thin_find_block+0x9f/0x1e0 [ 1229.474901] thin_map+0x28b/0x5f0 [ 1229.476116] __map_bio+0x237/0x260 [ 1229.476469] dm_submit_bio+0x321/0xa30 [ 1229.478546] __submit_bio+0x9c/0xe0 [ 1229.478913] submit_bio_noacct_nocheck+0x25a/0x570 [ 1229.480807] __flush_write_list+0x115/0x1a0 [ 1229.481725] dm_bufio_write_dirty_buffers+0xb9/0x600 [ 1229.483642] __commit_transaction+0x2f3/0x4e0 [ 1229.486185] dm_pool_commit_metadata+0x3c/0x70 [ 1229.486636] commit+0x8c/0x1b0 [ 1229.487757] pool_preresume+0x235/0x550 [ 1229.489010] dm_table_resume_targets+0xa6/0x1b0 [ 1229.489467] dm_resume+0x120/0x210 [ 1229.489820] dev_suspend+0x269/0x3e0 [ 1229.490187] ctl_ioctl+0x447/0x740 [ 1229.492539] dm_ctl_ioctl+0xe/0x20 [ 1229.492895] __x64_sys_ioctl+0xc9/0x100 [ 1229.493284] do_syscall_64+0x7d/0x1a0 [ 1229.493667] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 1229.494171] RIP: 0033:0x7f86667400ab [ 1229.494537] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 8 [ 1229.496340] RSP: 002b:00007fff312bc1f8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 1229.497084] RAX: ffffffffffffffda RBX: 0000561323f78320 RCX: 00007f86667400ab [ 1229.497796] RDX: 000056132506a8e0 RSI: 00000000c138fd06 RDI: 0000000000000004 [ 1229.498494] RBP: 000056132402614e R08: 0000000000000000 R09: 00000000000f4240 [ 1229.499195] R10: 0000000000000001 R11: 0000000000000206 R12: 000056132506a990 [ 1229.499901] R13: 0000000000000001 R14: 0000561325060460 R15: 000056132506a8e0 [ 1229.500615] </TASK> [ 1230.525934] device-mapper: thin: Data device (dm-8) discard unsupported: Disabling discard. Currently, I'm not sure if there are still new regressions related to md/raid changes. Do you have any suggestions? Please let me know what you think, I really need some help here. :( BTW, as Xiao Ni replied in the other thread[2], he also tests my fixes and reported that there are 72 failed tests in total, which is unbelievable for me. And we must dig deeper for root cause... [1] https://lore.kernel.org/all/20240127074754.2380890-1-yukuai1@huaweicloud.com/ [2] https://lore.kernel.org/all/CALTww2_f_orkTXPDtA4AJsbX-UmwhAb-AF_tujH4Gw3cX3ObWg@mail.gmail.com/ Thanks, Kuai > > Mikulas > > . >