Message ID | eaeec3b6-75c2-4b65-8c50-2d37450ccdd9@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] Block updates for 6.9-rc1 | expand |
The pull request you sent on Sun, 10 Mar 2024 14:30:57 -0600:
> git://git.kernel.dk/linux.git tags/for-6.9/block-20240310
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/1ddeeb2a058d7b2a58ed9e820396b4ceb715d529
Thank you!
On Sun, Mar 10, 2024 at 02:30:57PM -0600, Jens Axboe wrote: > Hi Linus, > > Here are the core block changes queued for the 6.9-rc1 kernel. This pull > request contains: > > - MD pull requests via Song: > - Cleanup redundant checks, by Yu Kuai. > - Remove deprecated headers, by Marc Zyngier and Song Liu. > - Concurrency fixes, by Li Lingfeng. > - Memory leak fix, by Li Nan. > - Refactor raid1 read_balance, by Yu Kuai and Paul Luse. > - Clean up and fix for md_ioctl, by Li Nan. > - Other small fixes, by Gui-Dong Han and Heming Zhao. > - MD atomic limits (Christoph) My desktop fails to decrypt /home on boot with this: [ 12.152489] WARNING: CPU: 0 PID: 626 at block/blk-settings.c:192 blk_validate_limits+0x1da/0x1f0 [ 12.152493] Modules linked in: amdgpu drm_ttm_helper ttm drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper btusb btintel [ 12.152498] CPU: 0 PID: 626 Comm: systemd-cryptse Not tainted 6.8.0-00855-gd08c407f715f #25 c6b9e287c2730f07982c9e0e4ed9225e8333a29f [ 12.152499] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023 [ 12.152500] RIP: 0010:blk_validate_limits+0x1da/0x1f0 [ 12.152502] Code: ff 0f 00 00 0f 87 2d ff ff ff 0f 0b eb 02 0f 0b ba ea ff ff ff e9 7a ff ff ff 0f 0b eb f2 0f 0b eb ee 0f 0b eb ea 0f 0b eb e6 <0f> 0b eb e2 0f 0b eb de 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 [ 12.152503] RSP: 0018:ffff9c41065b3b68 EFLAGS: 00010203 [ 12.152503] RAX: ffff9c41065b3bc0 RBX: ffff9c41065b3bc0 RCX: 00000000ffffffff [ 12.152504] RDX: 0000000000000fff RSI: 0000000000000200 RDI: 0000000000000100 [ 12.152504] RBP: ffff8a11c0d28350 R08: 0000000000000100 R09: 0000000000000001 [ 12.152505] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9c41065b3bc0 [ 12.152505] R13: ffff8a11c0d285c8 R14: ffff9c41065b3bc0 R15: ffff8a122eedc138 [ 12.152505] FS: 00007faa969214c0(0000) GS:ffff8a18dde00000(0000) knlGS:0000000000000000 [ 12.152506] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 12.152506] CR2: 00007f11d8a2a910 CR3: 00000001059d0000 CR4: 0000000000350ef0 [ 12.152507] Call Trace: [ 12.152508] <TASK> [ 12.152508] ? __warn+0x6f/0xd0 [ 12.152511] ? blk_validate_limits+0x1da/0x1f0 [ 12.152512] ? report_bug+0x147/0x190 [ 12.152514] ? handle_bug+0x36/0x70 [ 12.152516] ? exc_invalid_op+0x17/0x60 [ 12.152516] ? asm_exc_invalid_op+0x1a/0x20 [ 12.152519] ? blk_validate_limits+0x1da/0x1f0 [ 12.152520] queue_limits_set+0x27/0x130 [ 12.152521] dm_table_set_restrictions+0x1bb/0x440 [ 12.152525] dm_setup_md_queue+0x9a/0x1e0 [ 12.152527] table_load+0x251/0x400 [ 12.152528] ? dev_suspend+0x2d0/0x2d0 [ 12.152529] ctl_ioctl+0x305/0x5e0 [ 12.152531] dm_ctl_ioctl+0x9/0x10 [ 12.152532] __x64_sys_ioctl+0x89/0xb0 [ 12.152534] do_syscall_64+0x7f/0x160 [ 12.152536] ? syscall_exit_to_user_mode+0x6b/0x1a0 [ 12.152537] ? do_syscall_64+0x8b/0x160 [ 12.152538] ? do_syscall_64+0x8b/0x160 [ 12.152538] ? do_syscall_64+0x8b/0x160 [ 12.152539] ? do_syscall_64+0x8b/0x160 [ 12.152540] ? irq_exit_rcu+0x4a/0xb0 [ 12.152541] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 12.152542] RIP: 0033:0x7faa9632319b [ 12.152543] Code: 00 48 89 44 24 18 31 c0 c7 04 24 10 00 00 00 48 8d 44 24 60 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 12.152543] RSP: 002b:00007ffd8ac496d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 12.152544] RAX: ffffffffffffffda RBX: 0000564061a630c0 RCX: 00007faa9632319b [ 12.152544] RDX: 0000564061a630c0 RSI: 00000000c138fd09 RDI: 0000000000000004 [ 12.152545] RBP: 00007ffd8ac498d0 R08: 0000000000000007 R09: 0000000000000006 [ 12.152545] R10: 0000000000000007 R11: 0000000000000246 R12: 00005640619fcbd0 [ 12.152545] R13: 0000000000000003 R14: 0000564061a63170 R15: 00007faa95ea4b2f [ 12.152546] </TASK> [ 12.152546] ---[ end trace 0000000000000000 ]--- [ 12.152547] device-mapper: ioctl: unable to set up device queue for new table. Reverting 8e0ef4128694 ("dm: use queue_limits_set") makes it work. Happy to provide more debugging info and/or test patches!
On 3/11/24 5:50 PM, Johannes Weiner wrote: > On Sun, Mar 10, 2024 at 02:30:57PM -0600, Jens Axboe wrote: >> Hi Linus, >> >> Here are the core block changes queued for the 6.9-rc1 kernel. This pull >> request contains: >> >> - MD pull requests via Song: >> - Cleanup redundant checks, by Yu Kuai. >> - Remove deprecated headers, by Marc Zyngier and Song Liu. >> - Concurrency fixes, by Li Lingfeng. >> - Memory leak fix, by Li Nan. >> - Refactor raid1 read_balance, by Yu Kuai and Paul Luse. >> - Clean up and fix for md_ioctl, by Li Nan. >> - Other small fixes, by Gui-Dong Han and Heming Zhao. >> - MD atomic limits (Christoph) > > My desktop fails to decrypt /home on boot with this: > > [ 12.152489] WARNING: CPU: 0 PID: 626 at block/blk-settings.c:192 blk_validate_limits+0x1da/0x1f0 > [ 12.152493] Modules linked in: amdgpu drm_ttm_helper ttm drm_exec drm_suballoc_helper amdxcp drm_buddy gpu_sched drm_display_helper btusb btintel > [ 12.152498] CPU: 0 PID: 626 Comm: systemd-cryptse Not tainted 6.8.0-00855-gd08c407f715f #25 c6b9e287c2730f07982c9e0e4ed9225e8333a29f > [ 12.152499] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023 > [ 12.152500] RIP: 0010:blk_validate_limits+0x1da/0x1f0 > [ 12.152502] Code: ff 0f 00 00 0f 87 2d ff ff ff 0f 0b eb 02 0f 0b ba ea ff ff ff e9 7a ff ff ff 0f 0b eb f2 0f 0b eb ee 0f 0b eb ea 0f 0b eb e6 <0f> 0b eb e2 0f 0b eb de 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 > [ 12.152503] RSP: 0018:ffff9c41065b3b68 EFLAGS: 00010203 > [ 12.152503] RAX: ffff9c41065b3bc0 RBX: ffff9c41065b3bc0 RCX: 00000000ffffffff > [ 12.152504] RDX: 0000000000000fff RSI: 0000000000000200 RDI: 0000000000000100 > [ 12.152504] RBP: ffff8a11c0d28350 R08: 0000000000000100 R09: 0000000000000001 > [ 12.152505] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9c41065b3bc0 > [ 12.152505] R13: ffff8a11c0d285c8 R14: ffff9c41065b3bc0 R15: ffff8a122eedc138 > [ 12.152505] FS: 00007faa969214c0(0000) GS:ffff8a18dde00000(0000) knlGS:0000000000000000 > [ 12.152506] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 12.152506] CR2: 00007f11d8a2a910 CR3: 00000001059d0000 CR4: 0000000000350ef0 > [ 12.152507] Call Trace: > [ 12.152508] <TASK> > [ 12.152508] ? __warn+0x6f/0xd0 > [ 12.152511] ? blk_validate_limits+0x1da/0x1f0 > [ 12.152512] ? report_bug+0x147/0x190 > [ 12.152514] ? handle_bug+0x36/0x70 > [ 12.152516] ? exc_invalid_op+0x17/0x60 > [ 12.152516] ? asm_exc_invalid_op+0x1a/0x20 > [ 12.152519] ? blk_validate_limits+0x1da/0x1f0 > [ 12.152520] queue_limits_set+0x27/0x130 > [ 12.152521] dm_table_set_restrictions+0x1bb/0x440 > [ 12.152525] dm_setup_md_queue+0x9a/0x1e0 > [ 12.152527] table_load+0x251/0x400 > [ 12.152528] ? dev_suspend+0x2d0/0x2d0 > [ 12.152529] ctl_ioctl+0x305/0x5e0 > [ 12.152531] dm_ctl_ioctl+0x9/0x10 > [ 12.152532] __x64_sys_ioctl+0x89/0xb0 > [ 12.152534] do_syscall_64+0x7f/0x160 > [ 12.152536] ? syscall_exit_to_user_mode+0x6b/0x1a0 > [ 12.152537] ? do_syscall_64+0x8b/0x160 > [ 12.152538] ? do_syscall_64+0x8b/0x160 > [ 12.152538] ? do_syscall_64+0x8b/0x160 > [ 12.152539] ? do_syscall_64+0x8b/0x160 > [ 12.152540] ? irq_exit_rcu+0x4a/0xb0 > [ 12.152541] entry_SYSCALL_64_after_hwframe+0x46/0x4e > [ 12.152542] RIP: 0033:0x7faa9632319b > [ 12.152543] Code: 00 48 89 44 24 18 31 c0 c7 04 24 10 00 00 00 48 8d 44 24 60 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 > [ 12.152543] RSP: 002b:00007ffd8ac496d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > [ 12.152544] RAX: ffffffffffffffda RBX: 0000564061a630c0 RCX: 00007faa9632319b > [ 12.152544] RDX: 0000564061a630c0 RSI: 00000000c138fd09 RDI: 0000000000000004 > [ 12.152545] RBP: 00007ffd8ac498d0 R08: 0000000000000007 R09: 0000000000000006 > [ 12.152545] R10: 0000000000000007 R11: 0000000000000246 R12: 00005640619fcbd0 > [ 12.152545] R13: 0000000000000003 R14: 0000564061a63170 R15: 00007faa95ea4b2f > [ 12.152546] </TASK> > [ 12.152546] ---[ end trace 0000000000000000 ]--- > [ 12.152547] device-mapper: ioctl: unable to set up device queue for new table. > > Reverting 8e0ef4128694 ("dm: use queue_limits_set") makes it work. Gah! Sorry about that. Does: https://lore.kernel.org/linux-block/20240309164140.719752-1-hch@lst.de/ help?
On Mon, 11 Mar 2024 at 16:50, Johannes Weiner <hannes@cmpxchg.org> wrote: > > My desktop fails to decrypt /home on boot with this: Yup. Same here. I'm actually in the middle of bisect, just got to the block pull, and was going to report that the pull was broken. I don't have a full bisect done yet. Linus
On 3/11/24 5:58 PM, Linus Torvalds wrote: > On Mon, 11 Mar 2024 at 16:50, Johannes Weiner <hannes@cmpxchg.org> wrote: >> >> My desktop fails to decrypt /home on boot with this: > > Yup. Same here. I'm actually in the middle of bisect, just got to the > block pull, and was going to report that the pull was broken. > > I don't have a full bisect done yet. Just revert that commit it for now. Christoph has a pending fix, but it wasn't reported against this pretty standard use case. Very odd that we haven't seen that yet. Sorry about that!
On Mon, 11 Mar 2024 at 17:02, Jens Axboe <axboe@kernel.dk> wrote: > > Just revert that commit it for now. Done. I have to say, this is *not* some odd config here. It's literally a default Fedora setup with encrypted volumes. So the fact that this got reported after I merged this shows a complete lack of testing. It also makes me suspect that you do all your performance-testing on something that may show great performance, but isn't perhaps the best thing to actually use. May I suggest you start looking at encrypted volumes, and do your performance work on those for a while? Yes, it will suck to see the crypto overhead, but hey, it's also real life for real loads, so... Linus
On Mon, Mar 11 2024 at 8:02P -0400, Jens Axboe <axboe@kernel.dk> wrote: > On 3/11/24 5:58 PM, Linus Torvalds wrote: > > On Mon, 11 Mar 2024 at 16:50, Johannes Weiner <hannes@cmpxchg.org> wrote: > >> > >> My desktop fails to decrypt /home on boot with this: > > > > Yup. Same here. I'm actually in the middle of bisect, just got to the > > block pull, and was going to report that the pull was broken. > > > > I don't have a full bisect done yet. > > Just revert that commit it for now. Christoph has a pending fix, but it > wasn't reported against this pretty standard use case. That fix is specific to discards being larger than supported (and FYI, I did include it in the dm-6.9 pull request). But Hannes' backtrace points to block/blk-settings.c:192 which is: if (WARN_ON_ONCE(lim->max_segment_size && lim->max_segment_size != UINT_MAX)) return -EINVAL; lim->max_segment_size = UINT_MAX; > Very odd that we haven't seen that yet. It is odd. dm-6.9 is based on the block tree for 6.9, which included 8e0ef412869 ("dm: use queue_limits_set"). And I ran the full cryptsetup testsuite against my for-next branch to validate dm-crypt and dm-verity working with Tejun's BH workqueue changes. I agree with reverting commit 8e0ef412869 -- but again hch's fix was for something else and really can stand on its own even independent of commit 8e0ef412869 Mike
On Mon, Mar 11 2024 at 8:21P -0400, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 11 Mar 2024 at 17:02, Jens Axboe <axboe@kernel.dk> wrote: > > > > Just revert that commit it for now. > > Done. > > I have to say, this is *not* some odd config here. It's literally a > default Fedora setup with encrypted volumes. > > So the fact that this got reported after I merged this shows a > complete lack of testing. Please see my other reply just now. This breakage is new. Obviously cryptsetup's testsuite is lacking too because there wasn't any issue for weeks now. > It also makes me suspect that you do all your performance-testing on > something that may show great performance, but isn't perhaps the best > thing to actually use. > > May I suggest you start looking at encrypted volumes, and do your > performance work on those for a while? > > Yes, it will suck to see the crypto overhead, but hey, it's also real > life for real loads, so... All for Jens being made to suffer with dm-crypt but I think we need a proper root cause of what is happening for you and Johannes ;) Mike
On 3/11/24 6:21 PM, Linus Torvalds wrote: > On Mon, 11 Mar 2024 at 17:02, Jens Axboe <axboe@kernel.dk> wrote: >> >> Just revert that commit it for now. > > Done. Thanks! > I have to say, this is *not* some odd config here. It's literally a > default Fedora setup with encrypted volumes. Oh I realize that, which is why I'm so puzzled why it was broken. It's probably THE most common laptop setup. > So the fact that this got reported after I merged this shows a > complete lack of testing. Mike reviewed AND tested the whole thing, so you are wrong. I see he's also responded with that. Why we had this fallout is as-of yet not known, but we'll certainly figure it out. > It also makes me suspect that you do all your performance-testing on > something that may show great performance, but isn't perhaps the best > thing to actually use. I do that on things that I use, and what's being used in production. This includes obvious the block core and bits that use it, and on the storage front mostly nvme these days. I tested dm scalability and performance with Mike some months ago, and md is a regular thing too. In fact some of the little tweaks in this current series benefit the distro configurations quite a bit, which is obviously what customers/users tend to run. It's all being worked up through the stack. crypt is fine and all for laptop usage, but I haven't otherwise seen it used. > May I suggest you start looking at encrypted volumes, and do your > performance work on those for a while? > > Yes, it will suck to see the crypto overhead, but hey, it's also real > life for real loads, so... Honestly, my knee jerk reaction was "pfft I don't think so" as it's not an interesting use case to me. I'd be very surprised if it wasn't all lower hanging DM related fruits here, and maybe it's things like a single decrypt/encrypt pipeline. Maybe that's out of necessity, maybe it's an implementation detail that could get fixed. That said, it certainly would be interesting to look at. But also probably something that require rewriting it from scratch, probably as a dm-crypt-v2 or something. Maybe? Pure handwaving. What would make me do that is if I had to use it myself. Without that motivation, there's not a lot of time leftover to throw at an area where I suspect performance is perhaps Good Enough that people don't complain about it, particularly because the use case is what it is.
On 3/11/24 6:28 PM, Mike Snitzer wrote: > On Mon, Mar 11 2024 at 8:21P -0400, > Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Mon, 11 Mar 2024 at 17:02, Jens Axboe <axboe@kernel.dk> wrote: >>> >>> Just revert that commit it for now. >> >> Done. >> >> I have to say, this is *not* some odd config here. It's literally a >> default Fedora setup with encrypted volumes. >> >> So the fact that this got reported after I merged this shows a >> complete lack of testing. > > Please see my other reply just now. This breakage is new. Obviously > cryptsetup's testsuite is lacking too because there wasn't any issue > for weeks now. Yep, agree on that, if it breaks maybe the first few booting it with dm-crypt, then the testing certainly should've caught it. >> It also makes me suspect that you do all your performance-testing on >> something that may show great performance, but isn't perhaps the best >> thing to actually use. >> >> May I suggest you start looking at encrypted volumes, and do your >> performance work on those for a while? >> >> Yes, it will suck to see the crypto overhead, but hey, it's also real >> life for real loads, so... > > All for Jens being made to suffer with dm-crypt but I think we need a > proper root cause of what is happening for you and Johannes ;) Hah, yes. Does current -git without that revert boot for you? I'm assuming you have a dm-crypt setup on your laptop :-)
On Mon, Mar 11, 2024 at 08:28:50PM -0400, Mike Snitzer wrote: > All for Jens being made to suffer with dm-crypt but I think we need a > proper root cause of what is happening for you and Johannes ;) I'm going to try to stay out of the cranking, but I think the reason is that the limits stacking inherits the max_segment_size, nvme has weird rules for them due their odd PRPs, and dm-crypt set it's own max_segment_size to split out each page. The regression here is that we now actually verify that conflict. So this happens only for dm-crypt on nvme. The fix is probably to not inherit low-level limits like max_segment_size, but I need to think about it a bit more and come up with an automated test case using say nvme-loop. So for now the revert is the right thing.
On 3/11/24 7:09 PM, Christoph Hellwig wrote: > On Mon, Mar 11, 2024 at 08:28:50PM -0400, Mike Snitzer wrote: >> All for Jens being made to suffer with dm-crypt but I think we need a >> proper root cause of what is happening for you and Johannes ;) > > I'm going to try to stay out of the cranking, but I think the reason is > that the limits stacking inherits the max_segment_size, nvme has weird > rules for them due their odd PRPs, and dm-crypt set it's own > max_segment_size to split out each page. The regression here is > that we now actually verify that conflict. > > So this happens only for dm-crypt on nvme. The fix is probably > to not inherit low-level limits like max_segment_size, but I need > to think about it a bit more and come up with an automated test case > using say nvme-loop. That does seem like the most plausible explanation, I'm just puzzled why nobody hit it before it landed in Linus's tree. I know linux-next isn't THAT well runtime tested, but still. That aside, obviously the usual test cases should've hit it. Unless that was all on non-nvme storage, which is of course possible. > So for now the revert is the right thing. Yup
On Mon, 11 Mar 2024 at 18:17, Jens Axboe <axboe@kernel.dk> wrote: > > That does seem like the most plausible explanation, I'm just puzzled why > nobody hit it before it landed in Linus's tree. Yeah, who _doesn't_ have nvme drives in their system today? What odd hardware are people running? Linus
On 3/11/24 7:20 PM, Linus Torvalds wrote: > On Mon, 11 Mar 2024 at 18:17, Jens Axboe <axboe@kernel.dk> wrote: >> >> That does seem like the most plausible explanation, I'm just puzzled why >> nobody hit it before it landed in Linus's tree. > > Yeah, who _doesn't_ have nvme drives in their system today? > > What odd hardware are people running? Maybe older SATA based flash? But I haven't seen any of those in years. Or, god forbid, rotational storage? Various NVMe devices do have different limits for things like max transfer size etc, so if it's related to that, then it is possible that nvme was used but just didn't trigger on that test case. Out of curiosity, on your box where it broken, what does: grep . /sys/block/nvme0n1/queue/* say?
On Mon, 11 Mar 2024 at 18:23, Jens Axboe <axboe@kernel.dk> wrote: > > > What odd hardware are people running? > > Maybe older SATA based flash? But I haven't seen any of those in years. > Or, god forbid, rotational storage? Christ. I haven't touched rotating rust in like twenty years by now. I feel dirty just thinking about it. > Out of curiosity, on your box where it broken, what does: > > grep . /sys/block/nvme0n1/queue/* > > say? Appended. FWIW, it's a 4TB Samsung 990 PRO (and not in a laptop, this is my Threadripper). Linus --- /sys/block/nvme0n1/queue/add_random:0 /sys/block/nvme0n1/queue/chunk_sectors:0 /sys/block/nvme0n1/queue/dax:0 /sys/block/nvme0n1/queue/discard_granularity:512 /sys/block/nvme0n1/queue/discard_max_bytes:2199023255040 /sys/block/nvme0n1/queue/discard_max_hw_bytes:2199023255040 /sys/block/nvme0n1/queue/discard_zeroes_data:0 /sys/block/nvme0n1/queue/dma_alignment:3 /sys/block/nvme0n1/queue/fua:1 /sys/block/nvme0n1/queue/hw_sector_size:512 /sys/block/nvme0n1/queue/io_poll:0 /sys/block/nvme0n1/queue/io_poll_delay:-1 /sys/block/nvme0n1/queue/iostats:1 /sys/block/nvme0n1/queue/io_timeout:30000 /sys/block/nvme0n1/queue/logical_block_size:512 /sys/block/nvme0n1/queue/max_discard_segments:256 /sys/block/nvme0n1/queue/max_hw_sectors_kb:128 /sys/block/nvme0n1/queue/max_integrity_segments:1 /sys/block/nvme0n1/queue/max_sectors_kb:128 /sys/block/nvme0n1/queue/max_segments:33 /sys/block/nvme0n1/queue/max_segment_size:4294967295 /sys/block/nvme0n1/queue/minimum_io_size:512 /sys/block/nvme0n1/queue/nomerges:0 /sys/block/nvme0n1/queue/nr_requests:1023 /sys/block/nvme0n1/queue/nr_zones:0 /sys/block/nvme0n1/queue/optimal_io_size:0 /sys/block/nvme0n1/queue/physical_block_size:512 /sys/block/nvme0n1/queue/read_ahead_kb:128 /sys/block/nvme0n1/queue/rotational:0 /sys/block/nvme0n1/queue/rq_affinity:1 /sys/block/nvme0n1/queue/scheduler:[none] mq-deadline kyber bfq /sys/block/nvme0n1/queue/stable_writes:0 /sys/block/nvme0n1/queue/virt_boundary_mask:4095 /sys/block/nvme0n1/queue/wbt_lat_usec:2000 /sys/block/nvme0n1/queue/write_cache:write back /sys/block/nvme0n1/queue/write_same_max_bytes:0 /sys/block/nvme0n1/queue/write_zeroes_max_bytes:0 /sys/block/nvme0n1/queue/zone_append_max_bytes:0 /sys/block/nvme0n1/queue/zoned:none /sys/block/nvme0n1/queue/zone_write_granularity:0
On 3/11/24 7:28 PM, Linus Torvalds wrote: > On Mon, 11 Mar 2024 at 18:23, Jens Axboe <axboe@kernel.dk> wrote: >> >>> What odd hardware are people running? >> >> Maybe older SATA based flash? But I haven't seen any of those in years. >> Or, god forbid, rotational storage? > > Christ. I haven't touched rotating rust in like twenty years by now. > > I feel dirty just thinking about it. > >> Out of curiosity, on your box where it broken, what does: >> >> grep . /sys/block/nvme0n1/queue/* >> >> say? > > Appended. > > FWIW, it's a 4TB Samsung 990 PRO (and not in a laptop, this is my > Threadripper). Summary is that this is obviously a pretty normal drive, and has the 128K transfer limit that's common there. So doesn't really explain anything in that regard. The segment size is also a bit odd at 33. The only samsung I have here is a 980 pro, which has a normal 512K limit and 128 segments. Oh well, we'll figure out what the hell went wrong, side channels are ongoing.
On Mon, Mar 11, 2024 at 06:20:01PM -0700, Linus Torvalds wrote: > On Mon, 11 Mar 2024 at 18:17, Jens Axboe <axboe@kernel.dk> wrote: > > > > That does seem like the most plausible explanation, I'm just puzzled why > > nobody hit it before it landed in Linus's tree. > > Yeah, who _doesn't_ have nvme drives in their system today? > > What odd hardware are people running? Whatever shows up in the test VMs, or what is set up by the automated tests.
On Mon, Mar 11, 2024 at 07:23:41PM -0600, Jens Axboe wrote: > Various NVMe devices do have different limits for things like max > transfer size etc, so if it's related to that, then it is possible that > nvme was used but just didn't trigger on that test case. Out of > curiosity, on your box where it broken, what does: All nvme-pci setups with dm-crypt would trigger this.
On Mon, Mar 11 2024 at 9:09P -0400, Christoph Hellwig <hch@infradead.org> wrote: > On Mon, Mar 11, 2024 at 08:28:50PM -0400, Mike Snitzer wrote: > > All for Jens being made to suffer with dm-crypt but I think we need a > > proper root cause of what is happening for you and Johannes ;) > > I'm going to try to stay out of the cranking, but I think the reason is > that the limits stacking inherits the max_segment_size, nvme has weird > rules for them due their odd PRPs, and dm-crypt set it's own > max_segment_size to split out each page. The regression here is > that we now actually verify that conflict. > > So this happens only for dm-crypt on nvme. The fix is probably > to not inherit low-level limits like max_segment_size, but I need > to think about it a bit more and come up with an automated test case > using say nvme-loop. Yeah, I generally agree. I looked at the latest code to more fully understand why this failed. 1) dm-crypt.c:crypt_io_hints() sets limits->max_segment_size = PAGE_SIZE; 2) drivers/nvme/host/core.c:nvme_set_ctrl_limits() sets: lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1; lim->max_segment_size = UINT_MAX; 3) blk_stack_limits(t=dm-crypt, b=nvme-pci) will combine limits: t->virt_boundary_mask = min_not_zero(t->virt_boundary_mask, b->virt_boundary_mask); t->max_segment_size = min_not_zero(t->max_segment_size, b->max_segment_size); 4) blk_validate_limits() will reject the limits that blk_stack_limits() created: /* * Devices that require a virtual boundary do not support scatter/gather * I/O natively, but instead require a descriptor list entry for each * page (which might not be identical to the Linux PAGE_SIZE). Because * of that they are not limited by our notion of "segment size". */ if (lim->virt_boundary_mask) { if (WARN_ON_ONCE(lim->max_segment_size && lim->max_segment_size != UINT_MAX)) return -EINVAL; lim->max_segment_size = UINT_MAX; } else { /* * The maximum segment size has an odd historic 64k default that * drivers probably should override. Just like the I/O size we * require drivers to at least handle a full page per segment. */ if (!lim->max_segment_size) lim->max_segment_size = BLK_MAX_SEGMENT_SIZE; if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE)) return -EINVAL; } blk_validate_limits() is currently very pedantic. I discussed with Jens briefly and we're thinking it might make sense for blk_validate_limits() to be more forgiving by _not_ imposing hard -EINVAL failure. That in the interim, during this transition to more curated and atomic limits, a WARN_ON_ONCE() splat should serve as enough notice to developers (be it lower level nvme or higher-level virtual devices like DM). BUT for this specific max_segment_size case, the constraints of dm-crypt are actually more conservative due to crypto requirements. Yet nvme's more general "don't care, but will care if non-nvme driver does" for this particular max_segment_size limit is being imposed when validating the combined limits that dm-crypt will impose at the top-level. All said, the above "if (lim->virt_boundary_mask)" check in blk_validate_limits() looks bogus for stacked device limits. Mike
On 3/12/24 5:53 AM, Christoph Hellwig wrote: > On Mon, Mar 11, 2024 at 07:23:41PM -0600, Jens Axboe wrote: >> Various NVMe devices do have different limits for things like max >> transfer size etc, so if it's related to that, then it is possible that >> nvme was used but just didn't trigger on that test case. Out of >> curiosity, on your box where it broken, what does: > > All nvme-pci setups with dm-crypt would trigger this. This is most likely the key, basically all test suites are run in a vm these days, and not on raw nvme devices...
On Tue, Mar 12, 2024 at 11:22:53AM -0400, Mike Snitzer wrote: > 4) blk_validate_limits() will reject the limits that > blk_stack_limits() created: > /* > * Devices that require a virtual boundary do not support scatter/gather > * I/O natively, but instead require a descriptor list entry for each > * page (which might not be identical to the Linux PAGE_SIZE). Because > * of that they are not limited by our notion of "segment size". > */ > if (lim->virt_boundary_mask) { > if (WARN_ON_ONCE(lim->max_segment_size && > lim->max_segment_size != UINT_MAX)) > return -EINVAL; > lim->max_segment_size = UINT_MAX; > } else { > /* > * The maximum segment size has an odd historic 64k default that > * drivers probably should override. Just like the I/O size we > * require drivers to at least handle a full page per segment. > */ > if (!lim->max_segment_size) > lim->max_segment_size = BLK_MAX_SEGMENT_SIZE; > if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE)) > return -EINVAL; > } > > blk_validate_limits() is currently very pedantic. I discussed with Jens > briefly and we're thinking it might make sense for blk_validate_limits() > to be more forgiving by _not_ imposing hard -EINVAL failure. That in > the interim, during this transition to more curated and atomic limits, a > WARN_ON_ONCE() splat should serve as enough notice to developers (be it > lower level nvme or higher-level virtual devices like DM). > > BUT for this specific max_segment_size case, the constraints of dm-crypt > are actually more conservative due to crypto requirements. Yet nvme's > more general "don't care, but will care if non-nvme driver does" for > this particular max_segment_size limit is being imposed when validating > the combined limits that dm-crypt will impose at the top-level. > > All said, the above "if (lim->virt_boundary_mask)" check in > blk_validate_limits() looks bogus for stacked device limits. Yes, I think you're right. I can't tell why this check makes sense for any device, not just stacked ones. It could verify lim->max_segment_size is >= virt_boundary_mask, but to require it be UINT_MAX doesn't look necessary.
On Mon, Mar 11, 2024 at 07:37:06PM -0600, Jens Axboe wrote: > Summary is that this is obviously a pretty normal drive, and has the > 128K transfer limit that's common there. So doesn't really explain > anything in that regard. The segment size is also a bit odd at 33. That's "max_segments" at 33, not segment size. Max payload is 128k, divide by 4k nvme page size = 32 nvme pages. +1 to allow a first segment offset, so 33 max segments for this device.
On Tue, Mar 12, 2024 at 11:22:53AM -0400, Mike Snitzer wrote: > blk_validate_limits() is currently very pedantic. I discussed with Jens > briefly and we're thinking it might make sense for blk_validate_limits() > to be more forgiving by _not_ imposing hard -EINVAL failure. That in > the interim, during this transition to more curated and atomic limits, a > WARN_ON_ONCE() splat should serve as enough notice to developers (be it > lower level nvme or higher-level virtual devices like DM). I guess. And it more closely matches the status quo. That being said I want to move to hard rejection eventually to catch all the issues. > BUT for this specific max_segment_size case, the constraints of dm-crypt > are actually more conservative due to crypto requirements. Honestly, to me the dm-crypt requirement actually doesn't make much sense: max_segment_size is for hardware drivers that have requirements for SGLs or equivalent hardware interfaces. If dm-crypt never wants to see more than a single page per bio_vec it should just always iterate them using bio_for_each_segment. > Yet nvme's > more general "don't care, but will care if non-nvme driver does" for > this particular max_segment_size limit is being imposed when validating > the combined limits that dm-crypt will impose at the top-level. The real problem is that we combine the limits while we shouldn't. Every since we've supported immutable biovecs and do the splitting down in blk-mq there is no point to even inherit such limits in the upper drivers.
On Tue, Mar 12 2024 at 5:10P -0400, Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Mar 12, 2024 at 11:22:53AM -0400, Mike Snitzer wrote: > > blk_validate_limits() is currently very pedantic. I discussed with Jens > > briefly and we're thinking it might make sense for blk_validate_limits() > > to be more forgiving by _not_ imposing hard -EINVAL failure. That in > > the interim, during this transition to more curated and atomic limits, a > > WARN_ON_ONCE() splat should serve as enough notice to developers (be it > > lower level nvme or higher-level virtual devices like DM). > > I guess. And it more closely matches the status quo. That being said > I want to move to hard rejection eventually to catch all the issues. > > > BUT for this specific max_segment_size case, the constraints of dm-crypt > > are actually more conservative due to crypto requirements. > > Honestly, to me the dm-crypt requirement actually doesn't make much > sense: max_segment_size is for hardware drivers that have requirements > for SGLs or equivalent hardware interfaces. If dm-crypt never wants to > see more than a single page per bio_vec it should just always iterate > them using bio_for_each_segment. > > > Yet nvme's > > more general "don't care, but will care if non-nvme driver does" for > > this particular max_segment_size limit is being imposed when validating > > the combined limits that dm-crypt will impose at the top-level. > > The real problem is that we combine the limits while we shouldn't. > Every since we've supported immutable biovecs and do the splitting > down in blk-mq there is no point to even inherit such limits in the > upper drivers. immutable biovecs, late splitting and blk-mq aren't a factor. dm-crypt has to contend with the crypto subsystem and HW crypto engines that have their own constraints.
On Tue, Mar 12, 2024 at 06:22:21PM -0400, Mike Snitzer wrote: > > The real problem is that we combine the limits while we shouldn't. > > Every since we've supported immutable biovecs and do the splitting > > down in blk-mq there is no point to even inherit such limits in the > > upper drivers. > > immutable biovecs, late splitting and blk-mq aren't a factor. > > dm-crypt has to contend with the crypto subsystem and HW crypto > engines that have their own constraints. Yes, they are. The limit for underlying device does not matter for an upper devіce as it will split later. And that's not just my opinion, you also clearly stated that in the commit adding the limits (586b286b110e94e). We should have stopped inheriting all these limits only relevant for splitting when we switched to immutable bvecs. I don't know why we didn't, but a big part of that might be that we never made clear which limits these are.
On Tue, Mar 12 2024 at 6:30P -0400, Christoph Hellwig <hch@infradead.org> wrote: > On Tue, Mar 12, 2024 at 06:22:21PM -0400, Mike Snitzer wrote: > > > The real problem is that we combine the limits while we shouldn't. > > > Every since we've supported immutable biovecs and do the splitting > > > down in blk-mq there is no point to even inherit such limits in the > > > upper drivers. > > > > immutable biovecs, late splitting and blk-mq aren't a factor. > > > > dm-crypt has to contend with the crypto subsystem and HW crypto > > engines that have their own constraints. > > Yes, they are. The limit for underlying device does not matter for > an upper devіce as it will split later. And that's not just my > opinion, you also clearly stated that in the commit adding the > limits (586b286b110e94e). We should have stopped inheriting all > these limits only relevant for splitting when we switched to > immutable bvecs. I don't know why we didn't, but a big part of > that might be that we never made clear which limits these are. Wow, using my 8+ year old commit message against me ;) I've honestly paged most of this out but I'll revisit, likely with Mikulas, to pin this down better and then see what possible.
On Tue, Mar 12, 2024 at 06:50:51PM -0400, Mike Snitzer wrote: > Wow, using my 8+ year old commit message against me ;) Or for you :) > I've honestly paged most of this out but I'll revisit, likely with > Mikulas, to pin this down better and then see what possible. FYI, I don't think this is really a dm issue, but one of block infrastructure. But looping in the dm and md maintainers as well as Martin who wrote the stacking code originally is definitively a good idea.
On Tue, Mar 12, 2024 at 02:10:13PM -0700, Christoph Hellwig wrote: > On Tue, Mar 12, 2024 at 11:22:53AM -0400, Mike Snitzer wrote: > > blk_validate_limits() is currently very pedantic. I discussed with Jens > > briefly and we're thinking it might make sense for blk_validate_limits() > > to be more forgiving by _not_ imposing hard -EINVAL failure. That in > > the interim, during this transition to more curated and atomic limits, a > > WARN_ON_ONCE() splat should serve as enough notice to developers (be it > > lower level nvme or higher-level virtual devices like DM). > > I guess. And it more closely matches the status quo. That being said > I want to move to hard rejection eventually to catch all the issues. > > > BUT for this specific max_segment_size case, the constraints of dm-crypt > > are actually more conservative due to crypto requirements. > > Honestly, to me the dm-crypt requirement actually doesn't make much > sense: max_segment_size is for hardware drivers that have requirements > for SGLs or equivalent hardware interfaces. If dm-crypt never wants to > see more than a single page per bio_vec it should just always iterate > them using bio_for_each_segment. > > > Yet nvme's > > more general "don't care, but will care if non-nvme driver does" for > > this particular max_segment_size limit is being imposed when validating > > the combined limits that dm-crypt will impose at the top-level. > > The real problem is that we combine the limits while we shouldn't. > Every since we've supported immutable biovecs and do the splitting > down in blk-mq there is no point to even inherit such limits in the > upper drivers. In theory, it is yes, DM even doesn't use the combined segment size & virt boundary, but MD uses that(maybe unnecessarily), however the two are often stacked. There may be corner cases, and removing the two limits combination can be one big change for DM/MD since this way has been used long time. The warning & failure in blk_validate_limits() can fail any MD/DM which is over scsi & nvme, so I'd suggest to remove the 'warning & -EINVAL' first, otherwise more complaints may follow. Thanks, Ming
Hi, I'd like to get extra review and testing for these changes given how DM's use of queue_limits_set broke Linus's dm-crypt on NVMe setup during the 6.9 merge window. These changes have been staged in linux-next via linux-dm.git and while they should apply cleanly on 6.9-rcX they have been applied ontop of dm-6.10, see: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-6.10 Thanks, Mike Christoph Hellwig (1): dm: use queue_limits_set Mike Snitzer (1): dm-crypt: stop constraining max_segment_size to PAGE_SIZE drivers/md/dm-crypt.c | 12 ++---------- drivers/md/dm-table.c | 27 ++++++++++++--------------- 2 files changed, 14 insertions(+), 25 deletions(-)