Message ID | 20240510102906.51844-6-hare@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | enable bs > ps for block devices | expand |
On Fri, May 10, 2024 at 12:29:06PM +0200, hare@kernel.org wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > Don't set the capacity to zero for when logical block size > PAGE_SIZE > as the block device with iomap aops support allocating block cache with > a minimum folio order. It feels like this should be something the block layer does rather than something an individual block driver does. Is there similar code to rip out of sd.c? Or other block drivers?
On 5/11/24 00:37, Matthew Wilcox wrote: > On Fri, May 10, 2024 at 12:29:06PM +0200, hare@kernel.org wrote: >> From: Pankaj Raghav <p.raghav@samsung.com> >> >> Don't set the capacity to zero for when logical block size > PAGE_SIZE >> as the block device with iomap aops support allocating block cache with >> a minimum folio order. > > It feels like this should be something the block layer does rather than > something an individual block driver does. Is there similar code to > rip out of sd.c? Or other block drivers? Correct, this restriction should never have been there. I'll be checking sd.c, too. Cheers, Hannes
On Fri, May 10, 2024 at 12:29:06PM +0200, hare@kernel.org wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > Don't set the capacity to zero for when logical block size > PAGE_SIZE > as the block device with iomap aops support allocating block cache with > a minimum folio order. > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > Signed-off-by: Hannes Reinecke <hare@kernel.org> > --- > drivers/nvme/host/core.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 828c77fa13b7..5f1308daa74f 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -1963,11 +1963,10 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id, > bool valid = true; > > /* > - * The block layer can't support LBA sizes larger than the page size > - * or smaller than a sector size yet, so catch this early and don't > - * allow block I/O. > + * The block layer can't support LBA sizes smaller than a sector size, > + * so catch this early and don't allow block I/O. > */ > - if (head->lba_shift > PAGE_SHIFT || head->lba_shift < SECTOR_SHIFT) { > + if (head->lba_shift < SECTOR_SHIFT) { We can't just do this, we need to consider the actual nvme cap (test it, and if it crashes and below what the page cache supports, then we have to go below) and so to make the enablment easier. So we could just move this to helper [0]. Then when the bdev cache patch goes through the check for CONFIG_BUFFER_HEAD can be removed, if this goes first. We crash if we go above 1 MiB today, we should be able to go up to 2 MiB but that requires some review to see what stupid thing is getting in the way. [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53 Luis
On Fri, May 10, 2024 at 11:37:22PM +0100, Matthew Wilcox wrote: > On Fri, May 10, 2024 at 12:29:06PM +0200, hare@kernel.org wrote: > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > Don't set the capacity to zero for when logical block size > PAGE_SIZE > > as the block device with iomap aops support allocating block cache with > > a minimum folio order. > > It feels like this should be something the block layer does rather than > something an individual block driver does. Is there similar code to > rip out of sd.c? Or other block drivers? Each block driver must be tested and set a max if its below what the page cache supports. Luis
On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote: > We can't just do this, we need to consider the actual nvme cap (test it, > and if it crashes and below what the page cache supports, then we have > to go below) and so to make the enablment easier. So we could just move > this to helper [0]. Then when the bdev cache patch goes through the > check for CONFIG_BUFFER_HEAD can be removed, if this goes first. > > We crash if we go above 1 MiB today, we should be able to go up to 2 > MiB but that requires some review to see what stupid thing is getting > in the way. > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53 This is overengineered garbage. What's the crash?
On Sun, May 12, 2024 at 12:30:40AM +0100, Matthew Wilcox wrote: > On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote: > > We can't just do this, we need to consider the actual nvme cap (test it, > > and if it crashes and below what the page cache supports, then we have > > to go below) and so to make the enablment easier. So we could just move > > this to helper [0]. Then when the bdev cache patch goes through the > > check for CONFIG_BUFFER_HEAD can be removed, if this goes first. > > > > We crash if we go above 1 MiB today, we should be able to go up to 2 > > MiB but that requires some review to see what stupid thing is getting > > in the way. > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53 > > This is overengineered garbage. What's the crash? I had only tested it with iomap, I had not tested it with buffer-heads, and so it would require re-testing. It's Saturday 5pm, I should be doing something else other than being on my computer. Luis
On Sat, May 11, 2024 at 04:51:32PM -0700, Luis Chamberlain wrote: > On Sun, May 12, 2024 at 12:30:40AM +0100, Matthew Wilcox wrote: > > On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote: > > > We can't just do this, we need to consider the actual nvme cap (test it, > > > and if it crashes and below what the page cache supports, then we have > > > to go below) and so to make the enablment easier. So we could just move > > > this to helper [0]. Then when the bdev cache patch goes through the > > > check for CONFIG_BUFFER_HEAD can be removed, if this goes first. > > > > > > We crash if we go above 1 MiB today, we should be able to go up to 2 > > > MiB but that requires some review to see what stupid thing is getting > > > in the way. > > > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53 > > > > This is overengineered garbage. What's the crash? > > I had only tested it with iomap, I had not tested it with buffer-heads, > and so it would require re-testing. It's Saturday 5pm, I should be doing > something else other than being on my computer. It crashes because we forgot two things in this series below, so the change below its us to enable to *at least* boot up to 64k LBA format on NVMe. One can reproduce this with kdevops with: make defconfig-lbs-xfs-bdev-nvme make bringup make linux I've added another defconfig which bumps the LBA format up to 512 KiB to see if bootup blows up, that has another defconfig: make lbs-xfs-bdev-large-nvme make bringup make linux That at least booted. Note that the above defconfigs use this thread's message ID, so it applies this series on top of the min order branch. The patch below is just needed. I'll try next going above 512 KiB. diff --git a/fs/buffer.c b/fs/buffer.c index 4f73d23c2c46..fa88e300a946 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2360,8 +2360,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) limit = inode->i_sb->s_maxbytes; - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - head = folio_create_buffers(folio, inode, 0); blocksize = head->b_size; diff --git a/fs/mpage.c b/fs/mpage.c index e3732686e65f..e124c924b2e7 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -178,7 +178,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args) gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL); /* MAX_BUF_PER_PAGE, for example */ - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); if (args->is_readahead) { opf |= REQ_RAHEAD;
On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote:
> I'll try next going above 512 KiB.
At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split().
[ 13.401651] ------------[ cut here ]------------
[ 13.403298] kernel BUG at block/bio.c:1626!
[ 13.404708] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 13.406390] CPU: 5 PID: 87 Comm: kworker/u38:1 Not tainted 6.9.0-rc6+ #2
[ 13.408480] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 13.411304] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[ 13.412928] RIP: 0010:bio_split (block/bio.c:1626 (discriminator 1))
[ 13.414148] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00
All code
========
0: 5b pop %rbx
1: 4c 89 e0 mov %r12,%rax
4: 5d pop %rbp
5: 41 5c pop %r12
7: 41 5d pop %r13
9: c3 ret
a: cc int3
b: cc int3
c: cc int3
d: cc int3
e: c7 43 28 00 00 00 00 movl $0x0,0x28(%rbx)
15: eb db jmp 0xfffffffffffffff2
17: 0f 0b ud2
19: 45 31 e4 xor %r12d,%r12d
1c: 5b pop %rbx
1d: 5d pop %rbp
1e: 4c 89 e0 mov %r12,%rax
21: 41 5c pop %r12
23: 41 5d pop %r13
25: c3 ret
26: cc int3
27: cc int3
28: cc int3
29: cc int3
2a:* 0f 0b ud2 <-- trapping instruction
2c: 0f 0b ud2
2e: 4c 89 e7 mov %r12,%rdi
31: e8 bf ee ff ff call 0xffffffffffffeef5
36: eb e1 jmp 0x19
38: 66 data16
39: 66 data16
3a: 2e cs
3b: 0f .byte 0xf
3c: 1f (bad)
3d: 84 00 test %al,(%rax)
...
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 0f 0b ud2
4: 4c 89 e7 mov %r12,%rdi
7: e8 bf ee ff ff call 0xffffffffffffeecb
c: eb e1 jmp 0xffffffffffffffef
e: 66 data16
f: 66 data16
10: 2e cs
11: 0f .byte 0xf
12: 1f (bad)
13: 84 00 test %al,(%rax)
...
[ 13.420639] RSP: 0018:ffffc056407bb8e0 EFLAGS: 00010246
[ 13.422140] RAX: 0000000000000001 RBX: ffff9d97c165fa80 RCX: ffff9d97e65ce860
[ 13.424128] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9d97c165fa80
[ 13.426123] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000
[ 13.428372] R10: ffff9d97c165fa80 R11: ffff9d97c165faf8 R12: ffff9d97df3f7250
[ 13.430636] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[ 13.432510] FS: 0000000000000000(0000) GS:ffff9d983bd40000(0000) knlGS:0000000000000000
[ 13.434539] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 13.435987] CR2: 000055b34c8d6188 CR3: 000000011dfb8002 CR4: 0000000000770ef0
[ 13.437664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 13.439354] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 13.440911] PKRU: 55555554
[ 13.441605] Call Trace:
[ 13.442250] <TASK>
[ 13.442836] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
[ 13.443585] ? do_trap (arch/x86/kernel/traps.c:114 arch/x86/kernel/traps.c:155)
[ 13.444356] ? bio_split (block/bio.c:1626 (discriminator 1))
[ 13.445127] ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:176)
[ 13.445975] ? bio_split (block/bio.c:1626 (discriminator 1))
[ 13.446760] ? exc_invalid_op (arch/x86/kernel/traps.c:267)
[ 13.447635] ? bio_split (block/bio.c:1626 (discriminator 1))
[ 13.448325] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
[ 13.449137] ? bio_split (block/bio.c:1626 (discriminator 1))
[ 13.449819] __bio_split_to_limits (block/blk-merge.c:366)
[ 13.450675] blk_mq_submit_bio (block/blk-mq.c:2973)
[ 13.451490] ? kmem_cache_alloc (./include/linux/kmemleak.h:42 mm/slub.c:3802 mm/slub.c:3845 mm/slub.c:3852)
[ 13.452253] ? __mod_node_page_state (mm/vmstat.c:403 (discriminator 2))
[ 13.453050] submit_bio_noacct_nocheck (./include/linux/bio.h:639 block/blk-core.c:701 block/blk-core.c:729)
[ 13.453897] ? submit_bio_noacct (block/blk-core.c:758 (discriminator 1))
[ 13.454658] block_read_full_folio (fs/buffer.c:2429 (discriminator 1))
[ 13.455467] ? __pfx_blkdev_get_block (block/fops.c:409)
[ 13.456240] ? __pfx_blkdev_read_folio (block/fops.c:437)
[ 13.457014] ? __pfx_blkdev_read_folio (block/fops.c:437)
[ 13.457792] filemap_read_folio (mm/filemap.c:2335)
[ 13.458484] do_read_cache_folio (mm/filemap.c:3763)
[ 13.459220] ? __pfx_adfspart_check_ICS (block/partitions/acorn.c:351)
[ 13.459967] read_part_sector (./include/linux/pagemap.h:970 block/partitions/core.c:715)
[ 13.460602] adfspart_check_ICS (block/partitions/acorn.c:361)
[ 13.461269] ? snprintf (lib/vsprintf.c:2963)
[ 13.461844] ? __pfx_adfspart_check_ICS (block/partitions/acorn.c:351)
[ 13.462601] bdev_disk_changed (block/partitions/core.c:138 block/partitions/core.c:582 block/partitions/core.c:686 block/partitions/core.c:635)
[ 13.463268] blkdev_get_whole (block/bdev.c:684)
[ 13.463871] bdev_open (block/bdev.c:893)
[ 13.464386] bdev_file_open_by_dev (block/bdev.c:995 block/bdev.c:969)
[ 13.465006] disk_scan_partitions (block/genhd.c:369 (discriminator 1))
[ 13.465621] device_add_disk (block/genhd.c:512)
[ 13.466199] nvme_scan_ns (drivers/nvme/host/core.c:3807 (discriminator 1) drivers/nvme/host/core.c:3961 (discriminator 1)) nvme_core
[ 13.466885] nvme_scan_work (drivers/nvme/host/core.c:4015 drivers/nvme/host/core.c:4105) nvme_core
[ 13.467598] process_one_work (kernel/workqueue.c:3254)
[ 13.468186] worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2))
[ 13.468723] ? _raw_spin_lock_irqsave (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:111 (discriminator 4) kernel/locking/spinlock.c:162 (discriminator 4))
[ 13.469343] ? __pfx_worker_thread (kernel/workqueue.c:3362)
[ 13.469933] kthread (kernel/kthread.c:388)
[ 13.470395] ? __pfx_kthread (kernel/kthread.c:341)
[ 13.470928] ret_from_fork (arch/x86/kernel/process.c:147)
[ 13.471457] ? __pfx_kthread (kernel/kthread.c:341)
[ 13.472008] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
[ 13.472540] </TASK>
[ 13.472865] Modules linked in: crc32c_intel psmouse nvme nvme_core t10_pi crc64_rocksoft crc64 virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
[ 13.474629] ---[ end trace 0000000000000000 ]---
[ 13.475229] RIP: 0010:bio_split (block/bio.c:1626 (discriminator 1))
[ 13.475742] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00
All code
========
0: 5b pop %rbx
1: 4c 89 e0 mov %r12,%rax
4: 5d pop %rbp
5: 41 5c pop %r12
7: 41 5d pop %r13
9: c3 ret
a: cc int3
b: cc int3
c: cc int3
d: cc int3
e: c7 43 28 00 00 00 00 movl $0x0,0x28(%rbx)
15: eb db jmp 0xfffffffffffffff2
17: 0f 0b ud2
19: 45 31 e4 xor %r12d,%r12d
1c: 5b pop %rbx
1d: 5d pop %rbp
1e: 4c 89 e0 mov %r12,%rax
21: 41 5c pop %r12
23: 41 5d pop %r13
25: c3 ret
26: cc int3
27: cc int3
28: cc int3
29: cc int3
2a:* 0f 0b ud2 <-- trapping instruction
2c: 0f 0b ud2
2e: 4c 89 e7 mov %r12,%rdi
31: e8 bf ee ff ff call 0xffffffffffffeef5
36: eb e1 jmp 0x19
38: 66 data16
39: 66 data16
3a: 2e cs
3b: 0f .byte 0xf
3c: 1f (bad)
3d: 84 00 test %al,(%rax)
...
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 0f 0b ud2
4: 4c 89 e7 mov %r12,%rdi
7: e8 bf ee ff ff call 0xffffffffffffeecb
c: eb e1 jmp 0xffffffffffffffef
e: 66 data16
f: 66 data16
10: 2e cs
11: 0f .byte 0xf
12: 1f (bad)
13: 84 00 test %al,(%rax)
...
[ 13.477749] RSP: 0018:ffffc056407bb8e0 EFLAGS: 00010246
[ 13.478405] RAX: 0000000000000001 RBX: ffff9d97c165fa80 RCX: ffff9d97e65ce860
[ 13.479230] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9d97c165fa80
[ 13.480026] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000
[ 13.480802] R10: ffff9d97c165fa80 R11: ffff9d97c165faf8 R12: ffff9d97df3f7250
[ 13.481568] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[ 13.482337] FS: 0000000000000000(0000) GS:ffff9d983bd40000(0000) knlGS:0000000000000000
[ 13.483218] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 13.483853] CR2: 000055b34c8d6188 CR3: 000000011dfb8002 CR4: 0000000000770ef0
[ 13.484584] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 13.485327] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 13.486070] PKRU: 55555554
On 5/12/24 04:43, Luis Chamberlain wrote: > On Sat, May 11, 2024 at 04:51:32PM -0700, Luis Chamberlain wrote: >> On Sun, May 12, 2024 at 12:30:40AM +0100, Matthew Wilcox wrote: >>> On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote: >>>> We can't just do this, we need to consider the actual nvme cap (test it, >>>> and if it crashes and below what the page cache supports, then we have >>>> to go below) and so to make the enablment easier. So we could just move >>>> this to helper [0]. Then when the bdev cache patch goes through the >>>> check for CONFIG_BUFFER_HEAD can be removed, if this goes first. >>>> >>>> We crash if we go above 1 MiB today, we should be able to go up to 2 >>>> MiB but that requires some review to see what stupid thing is getting >>>> in the way. >>>> >>>> [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53 >>> >>> This is overengineered garbage. What's the crash? >> >> I had only tested it with iomap, I had not tested it with buffer-heads, >> and so it would require re-testing. It's Saturday 5pm, I should be doing >> something else other than being on my computer. > > It crashes because we forgot two things in this series below, so the > change below its us to enable to *at least* boot up to 64k LBA format on > NVMe. > > One can reproduce this with kdevops with: > > make defconfig-lbs-xfs-bdev-nvme > make bringup > make linux > > I've added another defconfig which bumps the LBA format up to 512 KiB to > see if bootup blows up, that has another defconfig: > > make lbs-xfs-bdev-large-nvme > make bringup > make linux > > That at least booted. Note that the above defconfigs use this thread's > message ID, so it applies this series on top of the min order branch. > The patch below is just needed. > > I'll try next going above 512 KiB. > > diff --git a/fs/buffer.c b/fs/buffer.c > index 4f73d23c2c46..fa88e300a946 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2360,8 +2360,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > head = folio_create_buffers(folio, inode, 0); > blocksize = head->b_size; > > diff --git a/fs/mpage.c b/fs/mpage.c > index e3732686e65f..e124c924b2e7 100644 > --- a/fs/mpage.c > +++ b/fs/mpage.c > @@ -178,7 +178,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args) > gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL); > > /* MAX_BUF_PER_PAGE, for example */ > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > > if (args->is_readahead) { > opf |= REQ_RAHEAD; > Thanks. Will be including that in the next round. Cheers, Hannes
On Mon, May 13, 2024 at 03:43:38PM +0200, Hannes Reinecke wrote: > On 5/12/24 04:43, Luis Chamberlain wrote: > > On Sat, May 11, 2024 at 04:51:32PM -0700, Luis Chamberlain wrote: > > > On Sun, May 12, 2024 at 12:30:40AM +0100, Matthew Wilcox wrote: > > > > On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote: > > > > > We can't just do this, we need to consider the actual nvme cap (test it, > > > > > and if it crashes and below what the page cache supports, then we have > > > > > to go below) and so to make the enablment easier. So we could just move > > > > > this to helper [0]. Then when the bdev cache patch goes through the > > > > > check for CONFIG_BUFFER_HEAD can be removed, if this goes first. > > > > > > > > > > We crash if we go above 1 MiB today, we should be able to go up to 2 > > > > > MiB but that requires some review to see what stupid thing is getting > > > > > in the way. > > > > > > > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53 > > > > > > > > This is overengineered garbage. What's the crash? > > > > > > I had only tested it with iomap, I had not tested it with buffer-heads, > > > and so it would require re-testing. It's Saturday 5pm, I should be doing > > > something else other than being on my computer. > > > > It crashes because we forgot two things in this series below, so the > > change below its us to enable to *at least* boot up to 64k LBA format on > > NVMe. > > > > One can reproduce this with kdevops with: > > > > make defconfig-lbs-xfs-bdev-nvme > > make bringup > > make linux > > > > I've added another defconfig which bumps the LBA format up to 512 KiB to > > see if bootup blows up, that has another defconfig: > > > > make lbs-xfs-bdev-large-nvme > > make bringup > > make linux > > > > That at least booted. Note that the above defconfigs use this thread's > > message ID, so it applies this series on top of the min order branch. > > The patch below is just needed. > > > > I'll try next going above 512 KiB. > > diff --git a/fs/buffer.c b/fs/buffer.c > > index 4f73d23c2c46..fa88e300a946 100644 > > --- a/fs/buffer.c > > +++ b/fs/buffer.c > > @@ -2360,8 +2360,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) > > if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) > > limit = inode->i_sb->s_maxbytes; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > > - > > head = folio_create_buffers(folio, inode, 0); > > blocksize = head->b_size; > > diff --git a/fs/mpage.c b/fs/mpage.c > > index e3732686e65f..e124c924b2e7 100644 > > --- a/fs/mpage.c > > +++ b/fs/mpage.c > > @@ -178,7 +178,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args) > > gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL); > > /* MAX_BUF_PER_PAGE, for example */ > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > > if (args->is_readahead) { > > opf |= REQ_RAHEAD; > > > Thanks. Will be including that in the next round. I gave it a spin with the above changes, results: https://github.com/linux-kdevops/kdevops/commit/f2641efe7d2037892e0477fdc8daf66af59c1f01 https://github.com/linux-kdevops/kdevops-results-archive/commit/c3ac880e909a30aa2f9b24231d0f7297dfdf4e8e Luis
On 5/12/24 11:16, Luis Chamberlain wrote: > On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote: >> I'll try next going above 512 KiB. > > At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split(). > > [ 13.401651] ------------[ cut here ]------------ > [ 13.403298] kernel BUG at block/bio.c:1626! > [ 13.404708] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > [ 13.406390] CPU: 5 PID: 87 Comm: kworker/u38:1 Not tainted 6.9.0-rc6+ #2 > [ 13.408480] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [ 13.411304] Workqueue: nvme-wq nvme_scan_work [nvme_core] > [ 13.412928] RIP: 0010:bio_split (block/bio.c:1626 (discriminator 1)) > [ 13.414148] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00 > All code > ======== > 0: 5b pop %rbx > 1: 4c 89 e0 mov %r12,%rax > 4: 5d pop %rbp > 5: 41 5c pop %r12 > 7: 41 5d pop %r13 > 9: c3 ret > a: cc int3 > b: cc int3 > c: cc int3 > d: cc int3 > e: c7 43 28 00 00 00 00 movl $0x0,0x28(%rbx) > 15: eb db jmp 0xfffffffffffffff2 > 17: 0f 0b ud2 > 19: 45 31 e4 xor %r12d,%r12d > 1c: 5b pop %rbx > 1d: 5d pop %rbp > 1e: 4c 89 e0 mov %r12,%rax > 21: 41 5c pop %r12 > 23: 41 5d pop %r13 > 25: c3 ret > 26: cc int3 > 27: cc int3 > 28: cc int3 > 29: cc int3 > 2a:* 0f 0b ud2 <-- trapping instruction > 2c: 0f 0b ud2 > 2e: 4c 89 e7 mov %r12,%rdi > 31: e8 bf ee ff ff call 0xffffffffffffeef5 > 36: eb e1 jmp 0x19 > 38: 66 data16 > 39: 66 data16 > 3a: 2e cs > 3b: 0f .byte 0xf > 3c: 1f (bad) > 3d: 84 00 test %al,(%rax) > ... > > Code starting with the faulting instruction > =========================================== > 0: 0f 0b ud2 > 2: 0f 0b ud2 > 4: 4c 89 e7 mov %r12,%rdi > 7: e8 bf ee ff ff call 0xffffffffffffeecb > c: eb e1 jmp 0xffffffffffffffef > e: 66 data16 > f: 66 data16 > 10: 2e cs > 11: 0f .byte 0xf > 12: 1f (bad) > 13: 84 00 test %al,(%rax) > ... > [ 13.420639] RSP: 0018:ffffc056407bb8e0 EFLAGS: 00010246 > [ 13.422140] RAX: 0000000000000001 RBX: ffff9d97c165fa80 RCX: ffff9d97e65ce860 > [ 13.424128] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9d97c165fa80 > [ 13.426123] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000 > [ 13.428372] R10: ffff9d97c165fa80 R11: ffff9d97c165faf8 R12: ffff9d97df3f7250 > [ 13.430636] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 > [ 13.432510] FS: 0000000000000000(0000) GS:ffff9d983bd40000(0000) knlGS:0000000000000000 > [ 13.434539] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 13.435987] CR2: 000055b34c8d6188 CR3: 000000011dfb8002 CR4: 0000000000770ef0 > [ 13.437664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 13.439354] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > [ 13.440911] PKRU: 55555554 > [ 13.441605] Call Trace: > [ 13.442250] <TASK> > [ 13.442836] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) > [ 13.443585] ? do_trap (arch/x86/kernel/traps.c:114 arch/x86/kernel/traps.c:155) > [ 13.444356] ? bio_split (block/bio.c:1626 (discriminator 1)) > [ 13.445127] ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:176) > [ 13.445975] ? bio_split (block/bio.c:1626 (discriminator 1)) > [ 13.446760] ? exc_invalid_op (arch/x86/kernel/traps.c:267) > [ 13.447635] ? bio_split (block/bio.c:1626 (discriminator 1)) > [ 13.448325] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) > [ 13.449137] ? bio_split (block/bio.c:1626 (discriminator 1)) > [ 13.449819] __bio_split_to_limits (block/blk-merge.c:366) > [ 13.450675] blk_mq_submit_bio (block/blk-mq.c:2973) > [ 13.451490] ? kmem_cache_alloc (./include/linux/kmemleak.h:42 mm/slub.c:3802 mm/slub.c:3845 mm/slub.c:3852) > [ 13.452253] ? __mod_node_page_state (mm/vmstat.c:403 (discriminator 2)) > [ 13.453050] submit_bio_noacct_nocheck (./include/linux/bio.h:639 block/blk-core.c:701 block/blk-core.c:729) > [ 13.453897] ? submit_bio_noacct (block/blk-core.c:758 (discriminator 1)) > [ 13.454658] block_read_full_folio (fs/buffer.c:2429 (discriminator 1)) > [ 13.455467] ? __pfx_blkdev_get_block (block/fops.c:409) > [ 13.456240] ? __pfx_blkdev_read_folio (block/fops.c:437) > [ 13.457014] ? __pfx_blkdev_read_folio (block/fops.c:437) > [ 13.457792] filemap_read_folio (mm/filemap.c:2335) > [ 13.458484] do_read_cache_folio (mm/filemap.c:3763) > [ 13.459220] ? __pfx_adfspart_check_ICS (block/partitions/acorn.c:351) > [ 13.459967] read_part_sector (./include/linux/pagemap.h:970 block/partitions/core.c:715) > [ 13.460602] adfspart_check_ICS (block/partitions/acorn.c:361) > [ 13.461269] ? snprintf (lib/vsprintf.c:2963) > [ 13.461844] ? __pfx_adfspart_check_ICS (block/partitions/acorn.c:351) > [ 13.462601] bdev_disk_changed (block/partitions/core.c:138 block/partitions/core.c:582 block/partitions/core.c:686 block/partitions/core.c:635) > [ 13.463268] blkdev_get_whole (block/bdev.c:684) > [ 13.463871] bdev_open (block/bdev.c:893) > [ 13.464386] bdev_file_open_by_dev (block/bdev.c:995 block/bdev.c:969) > [ 13.465006] disk_scan_partitions (block/genhd.c:369 (discriminator 1)) > [ 13.465621] device_add_disk (block/genhd.c:512) > [ 13.466199] nvme_scan_ns (drivers/nvme/host/core.c:3807 (discriminator 1) drivers/nvme/host/core.c:3961 (discriminator 1)) nvme_core > [ 13.466885] nvme_scan_work (drivers/nvme/host/core.c:4015 drivers/nvme/host/core.c:4105) nvme_core > [ 13.467598] process_one_work (kernel/workqueue.c:3254) > [ 13.468186] worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2)) > [ 13.468723] ? _raw_spin_lock_irqsave (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:111 (discriminator 4) kernel/locking/spinlock.c:162 (discriminator 4)) > [ 13.469343] ? __pfx_worker_thread (kernel/workqueue.c:3362) > [ 13.469933] kthread (kernel/kthread.c:388) > [ 13.470395] ? __pfx_kthread (kernel/kthread.c:341) > [ 13.470928] ret_from_fork (arch/x86/kernel/process.c:147) > [ 13.471457] ? __pfx_kthread (kernel/kthread.c:341) > [ 13.472008] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) > [ 13.472540] </TASK> > [ 13.472865] Modules linked in: crc32c_intel psmouse nvme nvme_core t10_pi crc64_rocksoft crc64 virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring > [ 13.474629] ---[ end trace 0000000000000000 ]--- > [ 13.475229] RIP: 0010:bio_split (block/bio.c:1626 (discriminator 1)) > [ 13.475742] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00 > All code > ======== > 0: 5b pop %rbx > 1: 4c 89 e0 mov %r12,%rax > 4: 5d pop %rbp > 5: 41 5c pop %r12 > 7: 41 5d pop %r13 > 9: c3 ret > a: cc int3 > b: cc int3 > c: cc int3 > d: cc int3 > e: c7 43 28 00 00 00 00 movl $0x0,0x28(%rbx) > 15: eb db jmp 0xfffffffffffffff2 > 17: 0f 0b ud2 > 19: 45 31 e4 xor %r12d,%r12d > 1c: 5b pop %rbx > 1d: 5d pop %rbp > 1e: 4c 89 e0 mov %r12,%rax > 21: 41 5c pop %r12 > 23: 41 5d pop %r13 > 25: c3 ret > 26: cc int3 > 27: cc int3 > 28: cc int3 > 29: cc int3 > 2a:* 0f 0b ud2 <-- trapping instruction > 2c: 0f 0b ud2 > 2e: 4c 89 e7 mov %r12,%rdi > 31: e8 bf ee ff ff call 0xffffffffffffeef5 > 36: eb e1 jmp 0x19 > 38: 66 data16 > 39: 66 data16 > 3a: 2e cs > 3b: 0f .byte 0xf > 3c: 1f (bad) > 3d: 84 00 test %al,(%rax) > ... > > Code starting with the faulting instruction > =========================================== > 0: 0f 0b ud2 > 2: 0f 0b ud2 > 4: 4c 89 e7 mov %r12,%rdi > 7: e8 bf ee ff ff call 0xffffffffffffeecb > c: eb e1 jmp 0xffffffffffffffef > e: 66 data16 > f: 66 data16 > 10: 2e cs > 11: 0f .byte 0xf > 12: 1f (bad) > 13: 84 00 test %al,(%rax) > ... > [ 13.477749] RSP: 0018:ffffc056407bb8e0 EFLAGS: 00010246 > [ 13.478405] RAX: 0000000000000001 RBX: ffff9d97c165fa80 RCX: ffff9d97e65ce860 > [ 13.479230] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9d97c165fa80 > [ 13.480026] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000 > [ 13.480802] R10: ffff9d97c165fa80 R11: ffff9d97c165faf8 R12: ffff9d97df3f7250 > [ 13.481568] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 > [ 13.482337] FS: 0000000000000000(0000) GS:ffff9d983bd40000(0000) knlGS:0000000000000000 > [ 13.483218] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 13.483853] CR2: 000055b34c8d6188 CR3: 000000011dfb8002 CR4: 0000000000770ef0 > [ 13.484584] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 13.485327] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > [ 13.486070] PKRU: 55555554 > Ah. MAX_BUFS_PER_PAGE getting in the way. Can you test with the attached patch? Cheers, Hannes
On Mon, May 13, 2024 at 06:07:55PM +0200, Hannes Reinecke wrote: > On 5/12/24 11:16, Luis Chamberlain wrote: > > On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote: > > > I'll try next going above 512 KiB. > > > > At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split(). > > > > [ 13.401651] ------------[ cut here ]------------ > > [ 13.403298] kernel BUG at block/bio.c:1626! > Ah. MAX_BUFS_PER_PAGE getting in the way. > > Can you test with the attached patch? Nope same crash: I've enabled you to easily test with with NVMe on libvirt with kdevops, please test. Luis [ 14.972734] ------------[ cut here ]------------ [ 14.974731] kernel BUG at block/bio.c:1626! [ 14.976906] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 14.978899] CPU: 3 PID: 59 Comm: kworker/u36:0 Not tainted 6.9.0-rc6+ #4 [ 14.981005] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 14.983782] Workqueue: nvme-wq nvme_scan_work [nvme_core] [ 14.985431] RIP: 0010:bio_split+0xd5/0xf0 [ 14.986627] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00 [ 14.992063] RSP: 0018:ffffbecc002378d0 EFLAGS: 00010246 [ 14.993416] RAX: 0000000000000001 RBX: ffff9e2fe8583e40 RCX: ffff9e2fdcb73060 [ 14.995181] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9e2fe8583e40 [ 14.996960] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000 [ 14.998715] R10: ffff9e2fe8583e40 R11: ffff9e2fe8583eb8 R12: ffff9e2fe884b750 [ 15.000510] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 [ 15.002128] FS: 0000000000000000(0000) GS:ffff9e303bcc0000(0000) knlGS:0000000000000000 [ 15.003956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.005294] CR2: 0000561b2b5ce478 CR3: 0000000102484002 CR4: 0000000000770ef0 [ 15.006921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 15.008509] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 15.010001] PKRU: 55555554 [ 15.010672] Call Trace: [ 15.011297] <TASK> [ 15.011868] ? die+0x32/0x80 [ 15.012572] ? do_trap+0xd9/0x100 [ 15.013306] ? bio_split+0xd5/0xf0 [ 15.014051] ? do_error_trap+0x6a/0x90 [ 15.014854] ? bio_split+0xd5/0xf0 [ 15.015597] ? exc_invalid_op+0x4c/0x60 [ 15.016419] ? bio_split+0xd5/0xf0 [ 15.017113] ? asm_exc_invalid_op+0x16/0x20 [ 15.017932] ? bio_split+0xd5/0xf0 [ 15.018624] __bio_split_to_limits+0x90/0x2d0 [ 15.019474] blk_mq_submit_bio+0x111/0x6a0 [ 15.020280] ? kmem_cache_alloc+0x254/0x2e0 [ 15.021040] submit_bio_noacct_nocheck+0x2f1/0x3d0 [ 15.021893] ? submit_bio_noacct+0x42/0x5b0 [ 15.022658] block_read_full_folio+0x2b7/0x350 [ 15.023457] ? __pfx_blkdev_get_block+0x10/0x10 [ 15.024284] ? __pfx_blkdev_read_folio+0x10/0x10 [ 15.025073] ? __pfx_blkdev_read_folio+0x10/0x10 [ 15.025851] filemap_read_folio+0x32/0xb0 [ 15.026540] do_read_cache_folio+0x108/0x200 [ 15.027271] ? __pfx_adfspart_check_ICS+0x10/0x10 [ 15.028066] read_part_sector+0x32/0xe0 [ 15.028701] adfspart_check_ICS+0x32/0x480 [ 15.029334] ? snprintf+0x49/0x70 [ 15.029875] ? __pfx_adfspart_check_ICS+0x10/0x10 [ 15.030592] bdev_disk_changed+0x2a2/0x6e0 [ 15.031226] blkdev_get_whole+0x5f/0xa0 [ 15.031827] bdev_open+0x201/0x3c0 [ 15.032360] bdev_file_open_by_dev+0xb5/0x110 [ 15.032990] disk_scan_partitions+0x65/0xe0 [ 15.033598] device_add_disk+0x3e0/0x3f0 [ 15.034172] nvme_scan_ns+0x5f0/0xe50 [nvme_core] [ 15.034862] nvme_scan_work+0x26f/0x5a0 [nvme_core] [ 15.035568] process_one_work+0x189/0x3b0 [ 15.036168] worker_thread+0x273/0x390 [ 15.036713] ? __pfx_worker_thread+0x10/0x10 [ 15.037312] kthread+0xda/0x110 [ 15.037779] ? __pfx_kthread+0x10/0x10 [ 15.038316] ret_from_fork+0x2d/0x50 [ 15.038829] ? __pfx_kthread+0x10/0x10 [ 15.039364] ret_from_fork_asm+0x1a/0x30 [ 15.039924] </TASK>
On 5/13/24 23:05, Luis Chamberlain wrote: > On Mon, May 13, 2024 at 06:07:55PM +0200, Hannes Reinecke wrote: >> On 5/12/24 11:16, Luis Chamberlain wrote: >>> On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote: >>>> I'll try next going above 512 KiB. >>> >>> At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split(). >>> >>> [ 13.401651] ------------[ cut here ]------------ >>> [ 13.403298] kernel BUG at block/bio.c:1626! >> Ah. MAX_BUFS_PER_PAGE getting in the way. >> >> Can you test with the attached patch? > > Nope same crash: > > I've enabled you to easily test with with NVMe on libvirt with kdevops, > please test. > > Luis > > [ 14.972734] ------------[ cut here ]------------ > [ 14.974731] kernel BUG at block/bio.c:1626! > [ 14.976906] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > [ 14.978899] CPU: 3 PID: 59 Comm: kworker/u36:0 Not tainted 6.9.0-rc6+ #4 > [ 14.981005] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [ 14.983782] Workqueue: nvme-wq nvme_scan_work [nvme_core] > [ 14.985431] RIP: 0010:bio_split+0xd5/0xf0 > [ 14.986627] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00 > [ 14.992063] RSP: 0018:ffffbecc002378d0 EFLAGS: 00010246 > [ 14.993416] RAX: 0000000000000001 RBX: ffff9e2fe8583e40 RCX: ffff9e2fdcb73060 > [ 14.995181] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9e2fe8583e40 > [ 14.996960] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000 > [ 14.998715] R10: ffff9e2fe8583e40 R11: ffff9e2fe8583eb8 R12: ffff9e2fe884b750 > [ 15.000510] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000 > [ 15.002128] FS: 0000000000000000(0000) GS:ffff9e303bcc0000(0000) knlGS:0000000000000000 > [ 15.003956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 15.005294] CR2: 0000561b2b5ce478 CR3: 0000000102484002 CR4: 0000000000770ef0 > [ 15.006921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 15.008509] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > [ 15.010001] PKRU: 55555554 > [ 15.010672] Call Trace: > [ 15.011297] <TASK> > [ 15.011868] ? die+0x32/0x80 > [ 15.012572] ? do_trap+0xd9/0x100 > [ 15.013306] ? bio_split+0xd5/0xf0 > [ 15.014051] ? do_error_trap+0x6a/0x90 > [ 15.014854] ? bio_split+0xd5/0xf0 > [ 15.015597] ? exc_invalid_op+0x4c/0x60 > [ 15.016419] ? bio_split+0xd5/0xf0 > [ 15.017113] ? asm_exc_invalid_op+0x16/0x20 > [ 15.017932] ? bio_split+0xd5/0xf0 > [ 15.018624] __bio_split_to_limits+0x90/0x2d0 > [ 15.019474] blk_mq_submit_bio+0x111/0x6a0 > [ 15.020280] ? kmem_cache_alloc+0x254/0x2e0 > [ 15.021040] submit_bio_noacct_nocheck+0x2f1/0x3d0 > [ 15.021893] ? submit_bio_noacct+0x42/0x5b0 > [ 15.022658] block_read_full_folio+0x2b7/0x350 > [ 15.023457] ? __pfx_blkdev_get_block+0x10/0x10 > [ 15.024284] ? __pfx_blkdev_read_folio+0x10/0x10 > [ 15.025073] ? __pfx_blkdev_read_folio+0x10/0x10 > [ 15.025851] filemap_read_folio+0x32/0xb0 > [ 15.026540] do_read_cache_folio+0x108/0x200 > [ 15.027271] ? __pfx_adfspart_check_ICS+0x10/0x10 > [ 15.028066] read_part_sector+0x32/0xe0 > [ 15.028701] adfspart_check_ICS+0x32/0x480 > [ 15.029334] ? snprintf+0x49/0x70 > [ 15.029875] ? __pfx_adfspart_check_ICS+0x10/0x10 > [ 15.030592] bdev_disk_changed+0x2a2/0x6e0 > [ 15.031226] blkdev_get_whole+0x5f/0xa0 > [ 15.031827] bdev_open+0x201/0x3c0 > [ 15.032360] bdev_file_open_by_dev+0xb5/0x110 > [ 15.032990] disk_scan_partitions+0x65/0xe0 > [ 15.033598] device_add_disk+0x3e0/0x3f0 > [ 15.034172] nvme_scan_ns+0x5f0/0xe50 [nvme_core] > [ 15.034862] nvme_scan_work+0x26f/0x5a0 [nvme_core] > [ 15.035568] process_one_work+0x189/0x3b0 > [ 15.036168] worker_thread+0x273/0x390 > [ 15.036713] ? __pfx_worker_thread+0x10/0x10 > [ 15.037312] kthread+0xda/0x110 > [ 15.037779] ? __pfx_kthread+0x10/0x10 > [ 15.038316] ret_from_fork+0x2d/0x50 > [ 15.038829] ? __pfx_kthread+0x10/0x10 > [ 15.039364] ret_from_fork_asm+0x1a/0x30 > [ 15.039924] </TASK> > Ah. So this should fix it: diff --git a/block/blk-merge.c b/block/blk-merge.c index 4e3483a16b75..4fac11edd0c8 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -289,7 +289,7 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim, if (nsegs < lim->max_segments && bytes + bv.bv_len <= max_bytes && - bv.bv_offset + bv.bv_len <= PAGE_SIZE) { + bv.bv_offset + bv.bv_len <= lim->max_segment_size) { nsegs++; bytes += bv.bv_len; } else { Cheers, Hannes
On Mon, May 13, 2024 at 06:07:55PM +0200, Hannes Reinecke wrote: > On 5/12/24 11:16, Luis Chamberlain wrote: > > On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote: > > > I'll try next going above 512 KiB. > > > > At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split(). > > Ah. MAX_BUFS_PER_PAGE getting in the way. That doesn't make sense. We will need something like this patch for filesystems which use large folios with buffer_heads. But for a 1MB block size device, we should have one buffer_head per folio.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 828c77fa13b7..5f1308daa74f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1963,11 +1963,10 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id, bool valid = true; /* - * The block layer can't support LBA sizes larger than the page size - * or smaller than a sector size yet, so catch this early and don't - * allow block I/O. + * The block layer can't support LBA sizes smaller than a sector size, + * so catch this early and don't allow block I/O. */ - if (head->lba_shift > PAGE_SHIFT || head->lba_shift < SECTOR_SHIFT) { + if (head->lba_shift < SECTOR_SHIFT) { bs = (1 << 9); valid = false; }