[5/5] nvme: enable logical block size > PAGE_SIZE

Message ID	20240510102906.51844-6-hare@kernel.org (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 29BF116D4EC for <linux-block@vger.kernel.org>; Fri, 10 May 2024 10:29:41 +0000 (UTC) From: hare@kernel.org To: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org>, Pankaj Raghav <kernel@pankajraghav.com>, Luis Chamberlain <mcgrof@kernel.org>, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Pankaj Raghav <p.raghav@samsung.com>, Hannes Reinecke <hare@kernel.org> Subject: [PATCH 5/5] nvme: enable logical block size > PAGE_SIZE Date: Fri, 10 May 2024 12:29:06 +0200 Message-Id: <20240510102906.51844-6-hare@kernel.org> In-Reply-To: <20240510102906.51844-1-hare@kernel.org> References: <20240510102906.51844-1-hare@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	enable bs > ps for block devices \| expand [0/5] enable bs > ps for block devices [1/5] fs/mpage: use blocks_per_folio instead of blocks_per_page [2/5] fs/mpage: avoid negative shift for large blocksize [3/5] block/bdev: lift restrictions on supported blocksize [4/5] block/bdev: enable large folio support for large logical block sizes [5/5] nvme: enable logical block size > PAGE_SIZE

Hannes Reinecke May 10, 2024, 10:29 a.m. UTC

From: Pankaj Raghav <p.raghav@samsung.com>

Don't set the capacity to zero for when logical block size > PAGE_SIZE
as the block device with iomap aops support allocating block cache with
a minimum folio order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/core.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Matthew Wilcox May 10, 2024, 10:37 p.m. UTC | #1

On Fri, May 10, 2024 at 12:29:06PM +0200, hare@kernel.org wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Don't set the capacity to zero for when logical block size > PAGE_SIZE
> as the block device with iomap aops support allocating block cache with
> a minimum folio order.

It feels like this should be something the block layer does rather than
something an individual block driver does.  Is there similar code to
rip out of sd.c?  Or other block drivers?

Hannes Reinecke May 11, 2024, 9:18 a.m. UTC | #2

On 5/11/24 00:37, Matthew Wilcox wrote:
> On Fri, May 10, 2024 at 12:29:06PM +0200, hare@kernel.org wrote:
>> From: Pankaj Raghav <p.raghav@samsung.com>
>>
>> Don't set the capacity to zero for when logical block size > PAGE_SIZE
>> as the block device with iomap aops support allocating block cache with
>> a minimum folio order.
> 
> It feels like this should be something the block layer does rather than
> something an individual block driver does.  Is there similar code to
> rip out of sd.c?  Or other block drivers?

Correct, this restriction should never have been there. I'll be checking 
sd.c, too.

Cheers,

Hannes

Luis Chamberlain May 11, 2024, 11:09 p.m. UTC | #3

On Fri, May 10, 2024 at 12:29:06PM +0200, hare@kernel.org wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Don't set the capacity to zero for when logical block size > PAGE_SIZE
> as the block device with iomap aops support allocating block cache with
> a minimum folio order.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/nvme/host/core.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 828c77fa13b7..5f1308daa74f 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1963,11 +1963,10 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
>  	bool valid = true;
>  
>  	/*
> -	 * The block layer can't support LBA sizes larger than the page size
> -	 * or smaller than a sector size yet, so catch this early and don't
> -	 * allow block I/O.
> +	 * The block layer can't support LBA sizes smaller than a sector size,
> +	 * so catch this early and don't allow block I/O.
>  	 */
> -	if (head->lba_shift > PAGE_SHIFT || head->lba_shift < SECTOR_SHIFT) {
> +	if (head->lba_shift < SECTOR_SHIFT) {

We can't just do this, we need to consider the actual nvme cap (test it,
and if it crashes and below what the page cache supports, then we have
to go below) and so to make the enablment easier. So we could just move
this to helper [0]. Then when the bdev cache patch goes through the
check for CONFIG_BUFFER_HEAD can be removed, if this goes first.

We crash if we go above 1 MiB today, we should be able to go up to 2
MiB but that requires some review to see what stupid thing is getting
in the way.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53

  Luis

Luis Chamberlain May 11, 2024, 11:11 p.m. UTC | #4

On Fri, May 10, 2024 at 11:37:22PM +0100, Matthew Wilcox wrote:
> On Fri, May 10, 2024 at 12:29:06PM +0200, hare@kernel.org wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Don't set the capacity to zero for when logical block size > PAGE_SIZE
> > as the block device with iomap aops support allocating block cache with
> > a minimum folio order.
> 
> It feels like this should be something the block layer does rather than
> something an individual block driver does.  Is there similar code to
> rip out of sd.c?  Or other block drivers?

Each block driver must be tested and set a max if its below what the page
cache supports.

  Luis

Matthew Wilcox May 11, 2024, 11:30 p.m. UTC | #5

On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote:
> We can't just do this, we need to consider the actual nvme cap (test it,
> and if it crashes and below what the page cache supports, then we have
> to go below) and so to make the enablment easier. So we could just move
> this to helper [0]. Then when the bdev cache patch goes through the
> check for CONFIG_BUFFER_HEAD can be removed, if this goes first.
> 
> We crash if we go above 1 MiB today, we should be able to go up to 2
> MiB but that requires some review to see what stupid thing is getting
> in the way.
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53

This is overengineered garbage.  What's the crash?

Luis Chamberlain May 11, 2024, 11:51 p.m. UTC | #6

On Sun, May 12, 2024 at 12:30:40AM +0100, Matthew Wilcox wrote:
> On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote:
> > We can't just do this, we need to consider the actual nvme cap (test it,
> > and if it crashes and below what the page cache supports, then we have
> > to go below) and so to make the enablment easier. So we could just move
> > this to helper [0]. Then when the bdev cache patch goes through the
> > check for CONFIG_BUFFER_HEAD can be removed, if this goes first.
> > 
> > We crash if we go above 1 MiB today, we should be able to go up to 2
> > MiB but that requires some review to see what stupid thing is getting
> > in the way.
> > 
> > [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53
> 
> This is overengineered garbage.  What's the crash?

I had only tested it with iomap, I had not tested it with buffer-heads,
and so it would require re-testing. It's Saturday 5pm, I should be doing
something else other than being on my computer.

  Luis

Luis Chamberlain May 12, 2024, 2:43 a.m. UTC | #7

On Sat, May 11, 2024 at 04:51:32PM -0700, Luis Chamberlain wrote:
> On Sun, May 12, 2024 at 12:30:40AM +0100, Matthew Wilcox wrote:
> > On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote:
> > > We can't just do this, we need to consider the actual nvme cap (test it,
> > > and if it crashes and below what the page cache supports, then we have
> > > to go below) and so to make the enablment easier. So we could just move
> > > this to helper [0]. Then when the bdev cache patch goes through the
> > > check for CONFIG_BUFFER_HEAD can be removed, if this goes first.
> > > 
> > > We crash if we go above 1 MiB today, we should be able to go up to 2
> > > MiB but that requires some review to see what stupid thing is getting
> > > in the way.
> > > 
> > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53
> > 
> > This is overengineered garbage.  What's the crash?
> 
> I had only tested it with iomap, I had not tested it with buffer-heads,
> and so it would require re-testing. It's Saturday 5pm, I should be doing
> something else other than being on my computer.

It crashes because we forgot two things in this series below, so the
change below its us to enable to *at least* boot up to 64k LBA format on
NVMe.

One can reproduce this with kdevops with:

make defconfig-lbs-xfs-bdev-nvme
make bringup
make linux

I've added another defconfig which bumps the LBA format up to 512 KiB to
see if bootup blows up, that has another defconfig:

make lbs-xfs-bdev-large-nvme
make bringup
make linux

That at least booted. Note that the above defconfigs use this thread's
message ID, so it applies this series on top of the min order branch.
The patch below is just needed.

I'll try next going above 512 KiB.
 
diff --git a/fs/buffer.c b/fs/buffer.c 
index 4f73d23c2c46..fa88e300a946 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2360,8 +2360,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
 		limit = inode->i_sb->s_maxbytes;
 
-	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
 	head = folio_create_buffers(folio, inode, 0);
 	blocksize = head->b_size;
 
diff --git a/fs/mpage.c b/fs/mpage.c
index e3732686e65f..e124c924b2e7 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -178,7 +178,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
 
 	/* MAX_BUF_PER_PAGE, for example */
-	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
 
 	if (args->is_readahead) {
 		opf |= REQ_RAHEAD;

Luis Chamberlain May 12, 2024, 9:16 a.m. UTC | #8

On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote:
> I'll try next going above 512 KiB.

At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split().

[   13.401651] ------------[ cut here ]------------
[   13.403298] kernel BUG at block/bio.c:1626!
[   13.404708] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   13.406390] CPU: 5 PID: 87 Comm: kworker/u38:1 Not tainted 6.9.0-rc6+ #2
[   13.408480] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   13.411304] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[   13.412928] RIP: 0010:bio_split (block/bio.c:1626 (discriminator 1)) 
[ 13.414148] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00
All code
========
   0:	5b                   	pop    %rbx
   1:	4c 89 e0             	mov    %r12,%rax
   4:	5d                   	pop    %rbp
   5:	41 5c                	pop    %r12
   7:	41 5d                	pop    %r13
   9:	c3                   	ret
   a:	cc                   	int3
   b:	cc                   	int3
   c:	cc                   	int3
   d:	cc                   	int3
   e:	c7 43 28 00 00 00 00 	movl   $0x0,0x28(%rbx)
  15:	eb db                	jmp    0xfffffffffffffff2
  17:	0f 0b                	ud2
  19:	45 31 e4             	xor    %r12d,%r12d
  1c:	5b                   	pop    %rbx
  1d:	5d                   	pop    %rbp
  1e:	4c 89 e0             	mov    %r12,%rax
  21:	41 5c                	pop    %r12
  23:	41 5d                	pop    %r13
  25:	c3                   	ret
  26:	cc                   	int3
  27:	cc                   	int3
  28:	cc                   	int3
  29:	cc                   	int3
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	0f 0b                	ud2
  2e:	4c 89 e7             	mov    %r12,%rdi
  31:	e8 bf ee ff ff       	call   0xffffffffffffeef5
  36:	eb e1                	jmp    0x19
  38:	66                   	data16
  39:	66                   	data16
  3a:	2e                   	cs
  3b:	0f                   	.byte 0xf
  3c:	1f                   	(bad)
  3d:	84 00                	test   %al,(%rax)
	...

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	0f 0b                	ud2
   4:	4c 89 e7             	mov    %r12,%rdi
   7:	e8 bf ee ff ff       	call   0xffffffffffffeecb
   c:	eb e1                	jmp    0xffffffffffffffef
   e:	66                   	data16
   f:	66                   	data16
  10:	2e                   	cs
  11:	0f                   	.byte 0xf
  12:	1f                   	(bad)
  13:	84 00                	test   %al,(%rax)
	...
[   13.420639] RSP: 0018:ffffc056407bb8e0 EFLAGS: 00010246
[   13.422140] RAX: 0000000000000001 RBX: ffff9d97c165fa80 RCX: ffff9d97e65ce860
[   13.424128] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9d97c165fa80
[   13.426123] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000
[   13.428372] R10: ffff9d97c165fa80 R11: ffff9d97c165faf8 R12: ffff9d97df3f7250
[   13.430636] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[   13.432510] FS:  0000000000000000(0000) GS:ffff9d983bd40000(0000) knlGS:0000000000000000
[   13.434539] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.435987] CR2: 000055b34c8d6188 CR3: 000000011dfb8002 CR4: 0000000000770ef0
[   13.437664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   13.439354] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   13.440911] PKRU: 55555554
[   13.441605] Call Trace:
[   13.442250]  <TASK>
[   13.442836] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) 
[   13.443585] ? do_trap (arch/x86/kernel/traps.c:114 arch/x86/kernel/traps.c:155) 
[   13.444356] ? bio_split (block/bio.c:1626 (discriminator 1)) 
[   13.445127] ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:176) 
[   13.445975] ? bio_split (block/bio.c:1626 (discriminator 1)) 
[   13.446760] ? exc_invalid_op (arch/x86/kernel/traps.c:267) 
[   13.447635] ? bio_split (block/bio.c:1626 (discriminator 1)) 
[   13.448325] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) 
[   13.449137] ? bio_split (block/bio.c:1626 (discriminator 1)) 
[   13.449819] __bio_split_to_limits (block/blk-merge.c:366) 
[   13.450675] blk_mq_submit_bio (block/blk-mq.c:2973) 
[   13.451490] ? kmem_cache_alloc (./include/linux/kmemleak.h:42 mm/slub.c:3802 mm/slub.c:3845 mm/slub.c:3852) 
[   13.452253] ? __mod_node_page_state (mm/vmstat.c:403 (discriminator 2)) 
[   13.453050] submit_bio_noacct_nocheck (./include/linux/bio.h:639 block/blk-core.c:701 block/blk-core.c:729) 
[   13.453897] ? submit_bio_noacct (block/blk-core.c:758 (discriminator 1)) 
[   13.454658] block_read_full_folio (fs/buffer.c:2429 (discriminator 1)) 
[   13.455467] ? __pfx_blkdev_get_block (block/fops.c:409) 
[   13.456240] ? __pfx_blkdev_read_folio (block/fops.c:437) 
[   13.457014] ? __pfx_blkdev_read_folio (block/fops.c:437) 
[   13.457792] filemap_read_folio (mm/filemap.c:2335) 
[   13.458484] do_read_cache_folio (mm/filemap.c:3763) 
[   13.459220] ? __pfx_adfspart_check_ICS (block/partitions/acorn.c:351) 
[   13.459967] read_part_sector (./include/linux/pagemap.h:970 block/partitions/core.c:715) 
[   13.460602] adfspart_check_ICS (block/partitions/acorn.c:361) 
[   13.461269] ? snprintf (lib/vsprintf.c:2963) 
[   13.461844] ? __pfx_adfspart_check_ICS (block/partitions/acorn.c:351) 
[   13.462601] bdev_disk_changed (block/partitions/core.c:138 block/partitions/core.c:582 block/partitions/core.c:686 block/partitions/core.c:635) 
[   13.463268] blkdev_get_whole (block/bdev.c:684) 
[   13.463871] bdev_open (block/bdev.c:893) 
[   13.464386] bdev_file_open_by_dev (block/bdev.c:995 block/bdev.c:969) 
[   13.465006] disk_scan_partitions (block/genhd.c:369 (discriminator 1)) 
[   13.465621] device_add_disk (block/genhd.c:512) 
[   13.466199] nvme_scan_ns (drivers/nvme/host/core.c:3807 (discriminator 1) drivers/nvme/host/core.c:3961 (discriminator 1)) nvme_core
[   13.466885] nvme_scan_work (drivers/nvme/host/core.c:4015 drivers/nvme/host/core.c:4105) nvme_core
[   13.467598] process_one_work (kernel/workqueue.c:3254) 
[   13.468186] worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2)) 
[   13.468723] ? _raw_spin_lock_irqsave (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:111 (discriminator 4) kernel/locking/spinlock.c:162 (discriminator 4)) 
[   13.469343] ? __pfx_worker_thread (kernel/workqueue.c:3362) 
[   13.469933] kthread (kernel/kthread.c:388) 
[   13.470395] ? __pfx_kthread (kernel/kthread.c:341) 
[   13.470928] ret_from_fork (arch/x86/kernel/process.c:147) 
[   13.471457] ? __pfx_kthread (kernel/kthread.c:341) 
[   13.472008] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[   13.472540]  </TASK>
[   13.472865] Modules linked in: crc32c_intel psmouse nvme nvme_core t10_pi crc64_rocksoft crc64 virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
[   13.474629] ---[ end trace 0000000000000000 ]---
[   13.475229] RIP: 0010:bio_split (block/bio.c:1626 (discriminator 1)) 
[ 13.475742] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00
All code
========
   0:	5b                   	pop    %rbx
   1:	4c 89 e0             	mov    %r12,%rax
   4:	5d                   	pop    %rbp
   5:	41 5c                	pop    %r12
   7:	41 5d                	pop    %r13
   9:	c3                   	ret
   a:	cc                   	int3
   b:	cc                   	int3
   c:	cc                   	int3
   d:	cc                   	int3
   e:	c7 43 28 00 00 00 00 	movl   $0x0,0x28(%rbx)
  15:	eb db                	jmp    0xfffffffffffffff2
  17:	0f 0b                	ud2
  19:	45 31 e4             	xor    %r12d,%r12d
  1c:	5b                   	pop    %rbx
  1d:	5d                   	pop    %rbp
  1e:	4c 89 e0             	mov    %r12,%rax
  21:	41 5c                	pop    %r12
  23:	41 5d                	pop    %r13
  25:	c3                   	ret
  26:	cc                   	int3
  27:	cc                   	int3
  28:	cc                   	int3
  29:	cc                   	int3
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	0f 0b                	ud2
  2e:	4c 89 e7             	mov    %r12,%rdi
  31:	e8 bf ee ff ff       	call   0xffffffffffffeef5
  36:	eb e1                	jmp    0x19
  38:	66                   	data16
  39:	66                   	data16
  3a:	2e                   	cs
  3b:	0f                   	.byte 0xf
  3c:	1f                   	(bad)
  3d:	84 00                	test   %al,(%rax)
	...

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	0f 0b                	ud2
   4:	4c 89 e7             	mov    %r12,%rdi
   7:	e8 bf ee ff ff       	call   0xffffffffffffeecb
   c:	eb e1                	jmp    0xffffffffffffffef
   e:	66                   	data16
   f:	66                   	data16
  10:	2e                   	cs
  11:	0f                   	.byte 0xf
  12:	1f                   	(bad)
  13:	84 00                	test   %al,(%rax)
	...
[   13.477749] RSP: 0018:ffffc056407bb8e0 EFLAGS: 00010246
[   13.478405] RAX: 0000000000000001 RBX: ffff9d97c165fa80 RCX: ffff9d97e65ce860
[   13.479230] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9d97c165fa80
[   13.480026] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000
[   13.480802] R10: ffff9d97c165fa80 R11: ffff9d97c165faf8 R12: ffff9d97df3f7250
[   13.481568] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[   13.482337] FS:  0000000000000000(0000) GS:ffff9d983bd40000(0000) knlGS:0000000000000000
[   13.483218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.483853] CR2: 000055b34c8d6188 CR3: 000000011dfb8002 CR4: 0000000000770ef0
[   13.484584] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   13.485327] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   13.486070] PKRU: 55555554

Hannes Reinecke May 13, 2024, 1:43 p.m. UTC | #9

On 5/12/24 04:43, Luis Chamberlain wrote:
> On Sat, May 11, 2024 at 04:51:32PM -0700, Luis Chamberlain wrote:
>> On Sun, May 12, 2024 at 12:30:40AM +0100, Matthew Wilcox wrote:
>>> On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote:
>>>> We can't just do this, we need to consider the actual nvme cap (test it,
>>>> and if it crashes and below what the page cache supports, then we have
>>>> to go below) and so to make the enablment easier. So we could just move
>>>> this to helper [0]. Then when the bdev cache patch goes through the
>>>> check for CONFIG_BUFFER_HEAD can be removed, if this goes first.
>>>>
>>>> We crash if we go above 1 MiB today, we should be able to go up to 2
>>>> MiB but that requires some review to see what stupid thing is getting
>>>> in the way.
>>>>
>>>> [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53
>>>
>>> This is overengineered garbage.  What's the crash?
>>
>> I had only tested it with iomap, I had not tested it with buffer-heads,
>> and so it would require re-testing. It's Saturday 5pm, I should be doing
>> something else other than being on my computer.
> 
> It crashes because we forgot two things in this series below, so the
> change below its us to enable to *at least* boot up to 64k LBA format on
> NVMe.
> 
> One can reproduce this with kdevops with:
> 
> make defconfig-lbs-xfs-bdev-nvme
> make bringup
> make linux
> 
> I've added another defconfig which bumps the LBA format up to 512 KiB to
> see if bootup blows up, that has another defconfig:
> 
> make lbs-xfs-bdev-large-nvme
> make bringup
> make linux
> 
> That at least booted. Note that the above defconfigs use this thread's
> message ID, so it applies this series on top of the min order branch.
> The patch below is just needed.
> 
> I'll try next going above 512 KiB.
>   
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 4f73d23c2c46..fa88e300a946 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2360,8 +2360,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
>   	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
>   		limit = inode->i_sb->s_maxbytes;
>   
> -	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
>   	head = folio_create_buffers(folio, inode, 0);
>   	blocksize = head->b_size;
>   
> diff --git a/fs/mpage.c b/fs/mpage.c
> index e3732686e65f..e124c924b2e7 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -178,7 +178,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>   	gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
>   
>   	/* MAX_BUF_PER_PAGE, for example */
> -	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>   
>   	if (args->is_readahead) {
>   		opf |= REQ_RAHEAD;
> 
Thanks. Will be including that in the next round.

Cheers,

Hannes

Luis Chamberlain May 13, 2024, 2:32 p.m. UTC | #10

On Mon, May 13, 2024 at 03:43:38PM +0200, Hannes Reinecke wrote:
> On 5/12/24 04:43, Luis Chamberlain wrote:
> > On Sat, May 11, 2024 at 04:51:32PM -0700, Luis Chamberlain wrote:
> > > On Sun, May 12, 2024 at 12:30:40AM +0100, Matthew Wilcox wrote:
> > > > On Sat, May 11, 2024 at 04:09:48PM -0700, Luis Chamberlain wrote:
> > > > > We can't just do this, we need to consider the actual nvme cap (test it,
> > > > > and if it crashes and below what the page cache supports, then we have
> > > > > to go below) and so to make the enablment easier. So we could just move
> > > > > this to helper [0]. Then when the bdev cache patch goes through the
> > > > > check for CONFIG_BUFFER_HEAD can be removed, if this goes first.
> > > > > 
> > > > > We crash if we go above 1 MiB today, we should be able to go up to 2
> > > > > MiB but that requires some review to see what stupid thing is getting
> > > > > in the way.
> > > > > 
> > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20240408-lbs-scsi-kludge&id=1f7f4dce548cc11872e977939a872b107c68ad53
> > > > 
> > > > This is overengineered garbage.  What's the crash?
> > > 
> > > I had only tested it with iomap, I had not tested it with buffer-heads,
> > > and so it would require re-testing. It's Saturday 5pm, I should be doing
> > > something else other than being on my computer.
> > 
> > It crashes because we forgot two things in this series below, so the
> > change below its us to enable to *at least* boot up to 64k LBA format on
> > NVMe.
> > 
> > One can reproduce this with kdevops with:
> > 
> > make defconfig-lbs-xfs-bdev-nvme
> > make bringup
> > make linux
> > 
> > I've added another defconfig which bumps the LBA format up to 512 KiB to
> > see if bootup blows up, that has another defconfig:
> > 
> > make lbs-xfs-bdev-large-nvme
> > make bringup
> > make linux
> > 
> > That at least booted. Note that the above defconfigs use this thread's
> > message ID, so it applies this series on top of the min order branch.
> > The patch below is just needed.
> > 
> > I'll try next going above 512 KiB.
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 4f73d23c2c46..fa88e300a946 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2360,8 +2360,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> >   	if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
> >   		limit = inode->i_sb->s_maxbytes;
> > -	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> > -
> >   	head = folio_create_buffers(folio, inode, 0);
> >   	blocksize = head->b_size;
> > diff --git a/fs/mpage.c b/fs/mpage.c
> > index e3732686e65f..e124c924b2e7 100644
> > --- a/fs/mpage.c
> > +++ b/fs/mpage.c
> > @@ -178,7 +178,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
> >   	gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
> >   	/* MAX_BUF_PER_PAGE, for example */
> > -	VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >   	if (args->is_readahead) {
> >   		opf |= REQ_RAHEAD;
> > 
> Thanks. Will be including that in the next round.

I gave it a spin with the above changes, results:

https://github.com/linux-kdevops/kdevops/commit/f2641efe7d2037892e0477fdc8daf66af59c1f01
https://github.com/linux-kdevops/kdevops-results-archive/commit/c3ac880e909a30aa2f9b24231d0f7297dfdf4e8e

  Luis

Hannes Reinecke May 13, 2024, 4:07 p.m. UTC | #11

On 5/12/24 11:16, Luis Chamberlain wrote:
> On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote:
>> I'll try next going above 512 KiB.
> 
> At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split().
> 
> [   13.401651] ------------[ cut here ]------------
> [   13.403298] kernel BUG at block/bio.c:1626!
> [   13.404708] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [   13.406390] CPU: 5 PID: 87 Comm: kworker/u38:1 Not tainted 6.9.0-rc6+ #2
> [   13.408480] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   13.411304] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [   13.412928] RIP: 0010:bio_split (block/bio.c:1626 (discriminator 1))
> [ 13.414148] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00
> All code
> ========
>     0:	5b                   	pop    %rbx
>     1:	4c 89 e0             	mov    %r12,%rax
>     4:	5d                   	pop    %rbp
>     5:	41 5c                	pop    %r12
>     7:	41 5d                	pop    %r13
>     9:	c3                   	ret
>     a:	cc                   	int3
>     b:	cc                   	int3
>     c:	cc                   	int3
>     d:	cc                   	int3
>     e:	c7 43 28 00 00 00 00 	movl   $0x0,0x28(%rbx)
>    15:	eb db                	jmp    0xfffffffffffffff2
>    17:	0f 0b                	ud2
>    19:	45 31 e4             	xor    %r12d,%r12d
>    1c:	5b                   	pop    %rbx
>    1d:	5d                   	pop    %rbp
>    1e:	4c 89 e0             	mov    %r12,%rax
>    21:	41 5c                	pop    %r12
>    23:	41 5d                	pop    %r13
>    25:	c3                   	ret
>    26:	cc                   	int3
>    27:	cc                   	int3
>    28:	cc                   	int3
>    29:	cc                   	int3
>    2a:*	0f 0b                	ud2		<-- trapping instruction
>    2c:	0f 0b                	ud2
>    2e:	4c 89 e7             	mov    %r12,%rdi
>    31:	e8 bf ee ff ff       	call   0xffffffffffffeef5
>    36:	eb e1                	jmp    0x19
>    38:	66                   	data16
>    39:	66                   	data16
>    3a:	2e                   	cs
>    3b:	0f                   	.byte 0xf
>    3c:	1f                   	(bad)
>    3d:	84 00                	test   %al,(%rax)
> 	...
> 
> Code starting with the faulting instruction
> ===========================================
>     0:	0f 0b                	ud2
>     2:	0f 0b                	ud2
>     4:	4c 89 e7             	mov    %r12,%rdi
>     7:	e8 bf ee ff ff       	call   0xffffffffffffeecb
>     c:	eb e1                	jmp    0xffffffffffffffef
>     e:	66                   	data16
>     f:	66                   	data16
>    10:	2e                   	cs
>    11:	0f                   	.byte 0xf
>    12:	1f                   	(bad)
>    13:	84 00                	test   %al,(%rax)
> 	...
> [   13.420639] RSP: 0018:ffffc056407bb8e0 EFLAGS: 00010246
> [   13.422140] RAX: 0000000000000001 RBX: ffff9d97c165fa80 RCX: ffff9d97e65ce860
> [   13.424128] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9d97c165fa80
> [   13.426123] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000
> [   13.428372] R10: ffff9d97c165fa80 R11: ffff9d97c165faf8 R12: ffff9d97df3f7250
> [   13.430636] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> [   13.432510] FS:  0000000000000000(0000) GS:ffff9d983bd40000(0000) knlGS:0000000000000000
> [   13.434539] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.435987] CR2: 000055b34c8d6188 CR3: 000000011dfb8002 CR4: 0000000000770ef0
> [   13.437664] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   13.439354] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [   13.440911] PKRU: 55555554
> [   13.441605] Call Trace:
> [   13.442250]  <TASK>
> [   13.442836] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447)
> [   13.443585] ? do_trap (arch/x86/kernel/traps.c:114 arch/x86/kernel/traps.c:155)
> [   13.444356] ? bio_split (block/bio.c:1626 (discriminator 1))
> [   13.445127] ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:176)
> [   13.445975] ? bio_split (block/bio.c:1626 (discriminator 1))
> [   13.446760] ? exc_invalid_op (arch/x86/kernel/traps.c:267)
> [   13.447635] ? bio_split (block/bio.c:1626 (discriminator 1))
> [   13.448325] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
> [   13.449137] ? bio_split (block/bio.c:1626 (discriminator 1))
> [   13.449819] __bio_split_to_limits (block/blk-merge.c:366)
> [   13.450675] blk_mq_submit_bio (block/blk-mq.c:2973)
> [   13.451490] ? kmem_cache_alloc (./include/linux/kmemleak.h:42 mm/slub.c:3802 mm/slub.c:3845 mm/slub.c:3852)
> [   13.452253] ? __mod_node_page_state (mm/vmstat.c:403 (discriminator 2))
> [   13.453050] submit_bio_noacct_nocheck (./include/linux/bio.h:639 block/blk-core.c:701 block/blk-core.c:729)
> [   13.453897] ? submit_bio_noacct (block/blk-core.c:758 (discriminator 1))
> [   13.454658] block_read_full_folio (fs/buffer.c:2429 (discriminator 1))
> [   13.455467] ? __pfx_blkdev_get_block (block/fops.c:409)
> [   13.456240] ? __pfx_blkdev_read_folio (block/fops.c:437)
> [   13.457014] ? __pfx_blkdev_read_folio (block/fops.c:437)
> [   13.457792] filemap_read_folio (mm/filemap.c:2335)
> [   13.458484] do_read_cache_folio (mm/filemap.c:3763)
> [   13.459220] ? __pfx_adfspart_check_ICS (block/partitions/acorn.c:351)
> [   13.459967] read_part_sector (./include/linux/pagemap.h:970 block/partitions/core.c:715)
> [   13.460602] adfspart_check_ICS (block/partitions/acorn.c:361)
> [   13.461269] ? snprintf (lib/vsprintf.c:2963)
> [   13.461844] ? __pfx_adfspart_check_ICS (block/partitions/acorn.c:351)
> [   13.462601] bdev_disk_changed (block/partitions/core.c:138 block/partitions/core.c:582 block/partitions/core.c:686 block/partitions/core.c:635)
> [   13.463268] blkdev_get_whole (block/bdev.c:684)
> [   13.463871] bdev_open (block/bdev.c:893)
> [   13.464386] bdev_file_open_by_dev (block/bdev.c:995 block/bdev.c:969)
> [   13.465006] disk_scan_partitions (block/genhd.c:369 (discriminator 1))
> [   13.465621] device_add_disk (block/genhd.c:512)
> [   13.466199] nvme_scan_ns (drivers/nvme/host/core.c:3807 (discriminator 1) drivers/nvme/host/core.c:3961 (discriminator 1)) nvme_core
> [   13.466885] nvme_scan_work (drivers/nvme/host/core.c:4015 drivers/nvme/host/core.c:4105) nvme_core
> [   13.467598] process_one_work (kernel/workqueue.c:3254)
> [   13.468186] worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2))
> [   13.468723] ? _raw_spin_lock_irqsave (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:111 (discriminator 4) kernel/locking/spinlock.c:162 (discriminator 4))
> [   13.469343] ? __pfx_worker_thread (kernel/workqueue.c:3362)
> [   13.469933] kthread (kernel/kthread.c:388)
> [   13.470395] ? __pfx_kthread (kernel/kthread.c:341)
> [   13.470928] ret_from_fork (arch/x86/kernel/process.c:147)
> [   13.471457] ? __pfx_kthread (kernel/kthread.c:341)
> [   13.472008] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
> [   13.472540]  </TASK>
> [   13.472865] Modules linked in: crc32c_intel psmouse nvme nvme_core t10_pi crc64_rocksoft crc64 virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
> [   13.474629] ---[ end trace 0000000000000000 ]---
> [   13.475229] RIP: 0010:bio_split (block/bio.c:1626 (discriminator 1))
> [ 13.475742] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00
> All code
> ========
>     0:	5b                   	pop    %rbx
>     1:	4c 89 e0             	mov    %r12,%rax
>     4:	5d                   	pop    %rbp
>     5:	41 5c                	pop    %r12
>     7:	41 5d                	pop    %r13
>     9:	c3                   	ret
>     a:	cc                   	int3
>     b:	cc                   	int3
>     c:	cc                   	int3
>     d:	cc                   	int3
>     e:	c7 43 28 00 00 00 00 	movl   $0x0,0x28(%rbx)
>    15:	eb db                	jmp    0xfffffffffffffff2
>    17:	0f 0b                	ud2
>    19:	45 31 e4             	xor    %r12d,%r12d
>    1c:	5b                   	pop    %rbx
>    1d:	5d                   	pop    %rbp
>    1e:	4c 89 e0             	mov    %r12,%rax
>    21:	41 5c                	pop    %r12
>    23:	41 5d                	pop    %r13
>    25:	c3                   	ret
>    26:	cc                   	int3
>    27:	cc                   	int3
>    28:	cc                   	int3
>    29:	cc                   	int3
>    2a:*	0f 0b                	ud2		<-- trapping instruction
>    2c:	0f 0b                	ud2
>    2e:	4c 89 e7             	mov    %r12,%rdi
>    31:	e8 bf ee ff ff       	call   0xffffffffffffeef5
>    36:	eb e1                	jmp    0x19
>    38:	66                   	data16
>    39:	66                   	data16
>    3a:	2e                   	cs
>    3b:	0f                   	.byte 0xf
>    3c:	1f                   	(bad)
>    3d:	84 00                	test   %al,(%rax)
> 	...
> 
> Code starting with the faulting instruction
> ===========================================
>     0:	0f 0b                	ud2
>     2:	0f 0b                	ud2
>     4:	4c 89 e7             	mov    %r12,%rdi
>     7:	e8 bf ee ff ff       	call   0xffffffffffffeecb
>     c:	eb e1                	jmp    0xffffffffffffffef
>     e:	66                   	data16
>     f:	66                   	data16
>    10:	2e                   	cs
>    11:	0f                   	.byte 0xf
>    12:	1f                   	(bad)
>    13:	84 00                	test   %al,(%rax)
> 	...
> [   13.477749] RSP: 0018:ffffc056407bb8e0 EFLAGS: 00010246
> [   13.478405] RAX: 0000000000000001 RBX: ffff9d97c165fa80 RCX: ffff9d97e65ce860
> [   13.479230] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9d97c165fa80
> [   13.480026] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000
> [   13.480802] R10: ffff9d97c165fa80 R11: ffff9d97c165faf8 R12: ffff9d97df3f7250
> [   13.481568] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> [   13.482337] FS:  0000000000000000(0000) GS:ffff9d983bd40000(0000) knlGS:0000000000000000
> [   13.483218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.483853] CR2: 000055b34c8d6188 CR3: 000000011dfb8002 CR4: 0000000000770ef0
> [   13.484584] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   13.485327] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [   13.486070] PKRU: 55555554
> 

Ah. MAX_BUFS_PER_PAGE getting in the way.

Can you test with the attached patch?

Cheers,

Hannes

Luis Chamberlain May 13, 2024, 9:05 p.m. UTC | #12

On Mon, May 13, 2024 at 06:07:55PM +0200, Hannes Reinecke wrote:
> On 5/12/24 11:16, Luis Chamberlain wrote:
> > On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote:
> > > I'll try next going above 512 KiB.
> > 
> > At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split().
> > 
> > [   13.401651] ------------[ cut here ]------------
> > [   13.403298] kernel BUG at block/bio.c:1626!
> Ah. MAX_BUFS_PER_PAGE getting in the way.
> 
> Can you test with the attached patch?

Nope same crash:

I've enabled you to easily test with with NVMe on libvirt with kdevops,
please test.

 Luis

[   14.972734] ------------[ cut here ]------------
[   14.974731] kernel BUG at block/bio.c:1626!
[   14.976906] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[   14.978899] CPU: 3 PID: 59 Comm: kworker/u36:0 Not tainted 6.9.0-rc6+ #4
[   14.981005] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   14.983782] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[   14.985431] RIP: 0010:bio_split+0xd5/0xf0
[   14.986627] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00
[   14.992063] RSP: 0018:ffffbecc002378d0 EFLAGS: 00010246
[   14.993416] RAX: 0000000000000001 RBX: ffff9e2fe8583e40 RCX: ffff9e2fdcb73060
[   14.995181] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9e2fe8583e40
[   14.996960] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000
[   14.998715] R10: ffff9e2fe8583e40 R11: ffff9e2fe8583eb8 R12: ffff9e2fe884b750
[   15.000510] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[   15.002128] FS:  0000000000000000(0000) GS:ffff9e303bcc0000(0000) knlGS:0000000000000000
[   15.003956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.005294] CR2: 0000561b2b5ce478 CR3: 0000000102484002 CR4: 0000000000770ef0
[   15.006921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   15.008509] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   15.010001] PKRU: 55555554
[   15.010672] Call Trace:
[   15.011297]  <TASK>
[   15.011868]  ? die+0x32/0x80
[   15.012572]  ? do_trap+0xd9/0x100
[   15.013306]  ? bio_split+0xd5/0xf0
[   15.014051]  ? do_error_trap+0x6a/0x90
[   15.014854]  ? bio_split+0xd5/0xf0
[   15.015597]  ? exc_invalid_op+0x4c/0x60
[   15.016419]  ? bio_split+0xd5/0xf0
[   15.017113]  ? asm_exc_invalid_op+0x16/0x20
[   15.017932]  ? bio_split+0xd5/0xf0
[   15.018624]  __bio_split_to_limits+0x90/0x2d0
[   15.019474]  blk_mq_submit_bio+0x111/0x6a0
[   15.020280]  ? kmem_cache_alloc+0x254/0x2e0
[   15.021040]  submit_bio_noacct_nocheck+0x2f1/0x3d0
[   15.021893]  ? submit_bio_noacct+0x42/0x5b0
[   15.022658]  block_read_full_folio+0x2b7/0x350
[   15.023457]  ? __pfx_blkdev_get_block+0x10/0x10
[   15.024284]  ? __pfx_blkdev_read_folio+0x10/0x10
[   15.025073]  ? __pfx_blkdev_read_folio+0x10/0x10
[   15.025851]  filemap_read_folio+0x32/0xb0
[   15.026540]  do_read_cache_folio+0x108/0x200
[   15.027271]  ? __pfx_adfspart_check_ICS+0x10/0x10
[   15.028066]  read_part_sector+0x32/0xe0
[   15.028701]  adfspart_check_ICS+0x32/0x480
[   15.029334]  ? snprintf+0x49/0x70
[   15.029875]  ? __pfx_adfspart_check_ICS+0x10/0x10
[   15.030592]  bdev_disk_changed+0x2a2/0x6e0
[   15.031226]  blkdev_get_whole+0x5f/0xa0
[   15.031827]  bdev_open+0x201/0x3c0
[   15.032360]  bdev_file_open_by_dev+0xb5/0x110
[   15.032990]  disk_scan_partitions+0x65/0xe0
[   15.033598]  device_add_disk+0x3e0/0x3f0
[   15.034172]  nvme_scan_ns+0x5f0/0xe50 [nvme_core]
[   15.034862]  nvme_scan_work+0x26f/0x5a0 [nvme_core]
[   15.035568]  process_one_work+0x189/0x3b0
[   15.036168]  worker_thread+0x273/0x390
[   15.036713]  ? __pfx_worker_thread+0x10/0x10
[   15.037312]  kthread+0xda/0x110
[   15.037779]  ? __pfx_kthread+0x10/0x10
[   15.038316]  ret_from_fork+0x2d/0x50
[   15.038829]  ? __pfx_kthread+0x10/0x10
[   15.039364]  ret_from_fork_asm+0x1a/0x30
[   15.039924]  </TASK>

Hannes Reinecke May 13, 2024, 11:07 p.m. UTC | #13

On 5/13/24 23:05, Luis Chamberlain wrote:
> On Mon, May 13, 2024 at 06:07:55PM +0200, Hannes Reinecke wrote:
>> On 5/12/24 11:16, Luis Chamberlain wrote:
>>> On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote:
>>>> I'll try next going above 512 KiB.
>>>
>>> At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split().
>>>
>>> [   13.401651] ------------[ cut here ]------------
>>> [   13.403298] kernel BUG at block/bio.c:1626!
>> Ah. MAX_BUFS_PER_PAGE getting in the way.
>>
>> Can you test with the attached patch?
> 
> Nope same crash:
> 
> I've enabled you to easily test with with NVMe on libvirt with kdevops,
> please test.
> 
>   Luis
> 
> [   14.972734] ------------[ cut here ]------------
> [   14.974731] kernel BUG at block/bio.c:1626!
> [   14.976906] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [   14.978899] CPU: 3 PID: 59 Comm: kworker/u36:0 Not tainted 6.9.0-rc6+ #4
> [   14.981005] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   14.983782] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [   14.985431] RIP: 0010:bio_split+0xd5/0xf0
> [   14.986627] Code: 5b 4c 89 e0 5d 41 5c 41 5d c3 cc cc cc cc c7 43 28 00 00 00 00 eb db 0f 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 4c 89 e7 e8 bf ee ff ff eb e1 66 66 2e 0f 1f 84 00 00
> [   14.992063] RSP: 0018:ffffbecc002378d0 EFLAGS: 00010246
> [   14.993416] RAX: 0000000000000001 RBX: ffff9e2fe8583e40 RCX: ffff9e2fdcb73060
> [   14.995181] RDX: 0000000000000c00 RSI: 0000000000000000 RDI: ffff9e2fe8583e40
> [   14.996960] RBP: 0000000000000000 R08: 0000000000000080 R09: 0000000000000000
> [   14.998715] R10: ffff9e2fe8583e40 R11: ffff9e2fe8583eb8 R12: ffff9e2fe884b750
> [   15.000510] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
> [   15.002128] FS:  0000000000000000(0000) GS:ffff9e303bcc0000(0000) knlGS:0000000000000000
> [   15.003956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.005294] CR2: 0000561b2b5ce478 CR3: 0000000102484002 CR4: 0000000000770ef0
> [   15.006921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   15.008509] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [   15.010001] PKRU: 55555554
> [   15.010672] Call Trace:
> [   15.011297]  <TASK>
> [   15.011868]  ? die+0x32/0x80
> [   15.012572]  ? do_trap+0xd9/0x100
> [   15.013306]  ? bio_split+0xd5/0xf0
> [   15.014051]  ? do_error_trap+0x6a/0x90
> [   15.014854]  ? bio_split+0xd5/0xf0
> [   15.015597]  ? exc_invalid_op+0x4c/0x60
> [   15.016419]  ? bio_split+0xd5/0xf0
> [   15.017113]  ? asm_exc_invalid_op+0x16/0x20
> [   15.017932]  ? bio_split+0xd5/0xf0
> [   15.018624]  __bio_split_to_limits+0x90/0x2d0
> [   15.019474]  blk_mq_submit_bio+0x111/0x6a0
> [   15.020280]  ? kmem_cache_alloc+0x254/0x2e0
> [   15.021040]  submit_bio_noacct_nocheck+0x2f1/0x3d0
> [   15.021893]  ? submit_bio_noacct+0x42/0x5b0
> [   15.022658]  block_read_full_folio+0x2b7/0x350
> [   15.023457]  ? __pfx_blkdev_get_block+0x10/0x10
> [   15.024284]  ? __pfx_blkdev_read_folio+0x10/0x10
> [   15.025073]  ? __pfx_blkdev_read_folio+0x10/0x10
> [   15.025851]  filemap_read_folio+0x32/0xb0
> [   15.026540]  do_read_cache_folio+0x108/0x200
> [   15.027271]  ? __pfx_adfspart_check_ICS+0x10/0x10
> [   15.028066]  read_part_sector+0x32/0xe0
> [   15.028701]  adfspart_check_ICS+0x32/0x480
> [   15.029334]  ? snprintf+0x49/0x70
> [   15.029875]  ? __pfx_adfspart_check_ICS+0x10/0x10
> [   15.030592]  bdev_disk_changed+0x2a2/0x6e0
> [   15.031226]  blkdev_get_whole+0x5f/0xa0
> [   15.031827]  bdev_open+0x201/0x3c0
> [   15.032360]  bdev_file_open_by_dev+0xb5/0x110
> [   15.032990]  disk_scan_partitions+0x65/0xe0
> [   15.033598]  device_add_disk+0x3e0/0x3f0
> [   15.034172]  nvme_scan_ns+0x5f0/0xe50 [nvme_core]
> [   15.034862]  nvme_scan_work+0x26f/0x5a0 [nvme_core]
> [   15.035568]  process_one_work+0x189/0x3b0
> [   15.036168]  worker_thread+0x273/0x390
> [   15.036713]  ? __pfx_worker_thread+0x10/0x10
> [   15.037312]  kthread+0xda/0x110
> [   15.037779]  ? __pfx_kthread+0x10/0x10
> [   15.038316]  ret_from_fork+0x2d/0x50
> [   15.038829]  ? __pfx_kthread+0x10/0x10
> [   15.039364]  ret_from_fork_asm+0x1a/0x30
> [   15.039924]  </TASK>
> 

Ah. So this should fix it:

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 4e3483a16b75..4fac11edd0c8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -289,7 +289,7 @@ struct bio *bio_split_rw(struct bio *bio, const 
struct queue_limits *lim,

                 if (nsegs < lim->max_segments &&
                     bytes + bv.bv_len <= max_bytes &&
-                   bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
+                   bv.bv_offset + bv.bv_len <= lim->max_segment_size) {
                         nsegs++;
                         bytes += bv.bv_len;
                 } else {

Cheers,

Hannes

Matthew Wilcox May 14, 2024, 5:08 a.m. UTC | #14

On Mon, May 13, 2024 at 06:07:55PM +0200, Hannes Reinecke wrote:
> On 5/12/24 11:16, Luis Chamberlain wrote:
> > On Sat, May 11, 2024 at 07:43:26PM -0700, Luis Chamberlain wrote:
> > > I'll try next going above 512 KiB.
> > 
> > At 1 MiB NVMe LBA format we crash with the BUG_ON(sectors <= 0) on bio_split().
> 
> Ah. MAX_BUFS_PER_PAGE getting in the way.

That doesn't make sense.  We will need something like this patch for
filesystems which use large folios with buffer_heads.  But for a 1MB
block size device, we should have one buffer_head per folio.

[5/5] nvme: enable logical block size > PAGE_SIZE

Commit Message

Comments

Patch