Message ID | 20241218022626.3668119-1-mcgrof@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | fs/buffer: strack reduction on async read | expand |
On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > This splits up a minor enhancement from the bs > ps device support > series into its own series for better review / focus / testing. > This series just addresses the reducing the array size used and cleaning > up the async read to be easier to read and maintain. How about this approach instead -- get rid of the batch entirely? diff --git a/fs/buffer.c b/fs/buffer.c index cc8452f60251..f50ebbc1f518 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head; size_t blocksize; - int nr, i; + int i, submitted = 0; int fully_mapped = 1; bool page_error = false; loff_t limit = i_size_read(inode); @@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) iblock = div_u64(folio_pos(folio), blocksize); lblock = div_u64(limit + blocksize - 1, blocksize); bh = head; - nr = 0; i = 0; do { @@ -2411,40 +2410,30 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; + + lock_buffer(bh); + if (buffer_uptodate(bh)) { + unlock_buffer(bh); + continue; + } + + mark_buffer_async_read(bh); + submit_bh(REQ_OP_READ, bh); + submitted++; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) folio_set_mappedtodisk(folio); - if (!nr) { - /* - * All buffers are uptodate or get_block() returned an - * error when trying to map them - we can finish the read. - */ - folio_end_read(folio, !page_error); - return 0; - } - - /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - lock_buffer(bh); - mark_buffer_async_read(bh); - } - /* - * Stage 3: start the IO. Check for uptodateness - * inside the buffer lock in case another process reading - * the underlying blockdev brought it uptodate (the sct fix). + * All buffers are uptodate or get_block() returned an error + * when trying to map them - we must finish the read because + * end_buffer_async_read() will never be called on any buffer + * in this folio. */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - if (buffer_uptodate(bh)) - end_buffer_async_read(bh, 1); - else - submit_bh(REQ_OP_READ, bh); - } + if (!submitted) + folio_end_read(folio, !page_error); + return 0; } EXPORT_SYMBOL(block_read_full_folio);
On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote: > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > > This splits up a minor enhancement from the bs > ps device support > > series into its own series for better review / focus / testing. > > This series just addresses the reducing the array size used and cleaning > > up the async read to be easier to read and maintain. > > How about this approach instead -- get rid of the batch entirely? Less is more! I wish it worked, but we end up with a null pointer on ext4/032 (and indeed this is the test that helped me find most bugs in what I was working on): [ 105.942462] loop0: detected capacity change from 0 to 1342177280 [ 106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 106.036903] #PF: supervisor read access in kernel mode [ 106.038366] #PF: error_code(0x0000) - not-present page [ 106.039819] PGD 0 P4D 0 [ 106.040574] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 106.041967] CPU: 2 UID: 0 PID: 29 Comm: ksoftirqd/2 Not tainted 6.13.0-rc3+ #42 [ 106.044018] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2024.08-1 09/18/2024 [ 106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90 [ 106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09 [ 106.053016] RSP: 0018:ffffa85880137dd0 EFLAGS: 00010246 [ 106.054499] RAX: 0000000000000000 RBX: ffff95e38f22e5b0 RCX: ffff95e39c8753e0 [ 106.056507] RDX: ffff95e3809f8000 RSI: 0000000000000001 RDI: ffff95e38f22e5b0 [ 106.058530] RBP: 0000000000000400 R08: ffff95e3a326b040 R09: 0000000000000001 [ 106.060546] R10: ffffffffbb6070c0 R11: 00000000002dc6c0 R12: 0000000000000000 [ 106.062426] R13: ffff95e3960ea800 R14: ffff95e39586ae40 R15: 0000000000000400 [ 106.064223] FS: 0000000000000000(0000) GS:ffff95e3fbc80000(0000) knlGS:0000000000000000 [ 106.066155] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.067473] CR2: 0000000000000000 CR3: 00000001226e2006 CR4: 0000000000772ef0 [ 106.069085] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.070571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.072050] PKRU: 55555554 [ 106.072632] Call Trace: [ 106.073176] <TASK> [ 106.073611] ? __die_body.cold+0x19/0x26 [ 106.074383] ? page_fault_oops+0xa2/0x230 [ 106.075155] ? __smp_call_single_queue+0xa7/0x110 [ 106.076077] ? do_user_addr_fault+0x63/0x640 [ 106.076916] ? exc_page_fault+0x7a/0x190 [ 106.077639] ? asm_exc_page_fault+0x22/0x30 [ 106.078394] ? end_buffer_async_read_io+0x11/0x90 [ 106.079245] end_bio_bh_io_sync+0x23/0x40 [ 106.079973] blk_update_request+0x178/0x420 [ 106.080727] ? finish_task_switch.isra.0+0x94/0x290 [ 106.081600] blk_mq_end_request+0x18/0x30 [ 106.082281] blk_complete_reqs+0x3d/0x50 [ 106.082954] handle_softirqs+0xf9/0x2c0 [ 106.083607] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 106.084393] run_ksoftirqd+0x37/0x50 [ 106.085012] smpboot_thread_fn+0x184/0x220 [ 106.085688] kthread+0xda/0x110 [ 106.086208] ? __pfx_kthread+0x10/0x10 [ 106.086824] ret_from_fork+0x2d/0x50 [ 106.087409] ? __pfx_kthread+0x10/0x10 [ 106.088013] ret_from_fork_asm+0x1a/0x30 [ 106.088658] </TASK> [ 106.089045] Modules linked in: loop sunrpc 9p nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel gf128mul crypto_simd cryptd virtio_balloon 9pnet_virtio virtio_console joydev button evdev serio_raw nvme_fabrics nvme_core dm_mod drm nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover virtio_blk failover crc32_pclmul crc32c_intel psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring [ 106.097895] CR2: 0000000000000000 [ 106.098326] ---[ end trace 0000000000000000 ]--- Luis
On Wed, Dec 18, 2024 at 06:27:36PM -0800, Luis Chamberlain wrote: > On Wed, Dec 18, 2024 at 08:05:29PM +0000, Matthew Wilcox wrote: > > On Tue, Dec 17, 2024 at 06:26:21PM -0800, Luis Chamberlain wrote: > > > This splits up a minor enhancement from the bs > ps device support > > > series into its own series for better review / focus / testing. > > > This series just addresses the reducing the array size used and cleaning > > > up the async read to be easier to read and maintain. > > > > How about this approach instead -- get rid of the batch entirely? > > Less is more! I wish it worked, but we end up with a null pointer on > ext4/032 (and indeed this is the test that helped me find most bugs in > what I was working on): Yeah, I did no testing; just wanted to give people a different approach to consider. > [ 106.034851] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 106.046300] RIP: 0010:end_buffer_async_read_io+0x11/0x90 > [ 106.047819] Code: f2 ff 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 53 48 8b 47 10 48 89 fb 48 8b 40 18 <48> 8b 00 f6 40 0d 40 74 0d 0f b7 00 66 25 00 f0 66 3d 00 80 74 09 That decodes as: 5: 53 push %rbx 6: 48 8b 47 10 mov 0x10(%rdi),%rax a: 48 89 fb mov %rdi,%rbx d: 48 8b 40 18 mov 0x18(%rax),%rax 11:* 48 8b 00 mov (%rax),%rax <-- trapping instruction 14: f6 40 0d 40 testb $0x40,0xd(%rax) 6: bh->b_folio d: b_folio->mapping 11: mapping->host So folio->mapping is NULL. Ah, I see the problem. end_buffer_async_read() uses the buffer_async_read test to decide if all buffers on the page are uptodate or not. So both having no batch (ie this patch) and having a batch which is smaller than the number of buffers in the folio can lead to folio_end_read() being called prematurely (ie we'll unlock the folio before finishing reading every buffer in the folio). Once the folio is unlocked, it can be truncated. That's a second-order problem, but it's the one your test happened to hit. This should fix the problem; we always have at least one BH held in the submission path with the async_read flag set, so end_buffer_async_read() will not end it prematurely. By the way, do you have CONFIG_VM_DEBUG enabled in your testing? VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); in folio_end_read() should have tripped before hitting the race with truncate. diff --git a/fs/buffer.c b/fs/buffer.c index cc8452f60251..fd2633e4a5d2 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2361,9 +2361,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) { struct inode *inode = folio->mapping->host; sector_t iblock, lblock; - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; + struct buffer_head *bh, *head, *prev = NULL; size_t blocksize; - int nr, i; + int i; int fully_mapped = 1; bool page_error = false; loff_t limit = i_size_read(inode); @@ -2380,7 +2380,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) iblock = div_u64(folio_pos(folio), blocksize); lblock = div_u64(limit + blocksize - 1, blocksize); bh = head; - nr = 0; i = 0; do { @@ -2411,40 +2410,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block) if (buffer_uptodate(bh)) continue; } - arr[nr++] = bh; + + lock_buffer(bh); + if (buffer_uptodate(bh)) { + unlock_buffer(bh); + continue; + } + + mark_buffer_async_read(bh); + if (prev) + submit_bh(REQ_OP_READ, prev); + prev = bh; } while (i++, iblock++, (bh = bh->b_this_page) != head); if (fully_mapped) folio_set_mappedtodisk(folio); - if (!nr) { - /* - * All buffers are uptodate or get_block() returned an - * error when trying to map them - we can finish the read. - */ - folio_end_read(folio, !page_error); - return 0; - } - - /* Stage two: lock the buffers */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - lock_buffer(bh); - mark_buffer_async_read(bh); - } - /* - * Stage 3: start the IO. Check for uptodateness - * inside the buffer lock in case another process reading - * the underlying blockdev brought it uptodate (the sct fix). + * All buffers are uptodate or get_block() returned an error + * when trying to map them - we must finish the read because + * end_buffer_async_read() will never be called on any buffer + * in this folio. */ - for (i = 0; i < nr; i++) { - bh = arr[i]; - if (buffer_uptodate(bh)) - end_buffer_async_read(bh, 1); - else - submit_bh(REQ_OP_READ, bh); - } + if (prev) + submit_bh(REQ_OP_READ, prev); + else + folio_end_read(folio, !page_error); + return 0; } EXPORT_SYMBOL(block_read_full_folio);
What is this strack that gets reduced here?
On Thu, Dec 19, 2024 at 07:28:27AM +0100, Christoph Hellwig wrote:
> What is this strack that gets reduced here?
s/strack/stack
I seriously need a spell checker as part of my pipeline.
Luis
On Thu, Dec 19, 2024 at 03:51:34AM +0000, Matthew Wilcox wrote: > So folio->mapping is NULL. > > Ah, I see the problem. end_buffer_async_read() uses the buffer_async_read > test to decide if all buffers on the page are uptodate or not. So both > having no batch (ie this patch) and having a batch which is smaller than > the number of buffers in the folio can lead to folio_end_read() being > called prematurely (ie we'll unlock the folio before finishing reading > every buffer in the folio). > > Once the folio is unlocked, it can be truncated. That's a second-order > problem, but it's the one your test happened to hit. > > This should fix the problem; we always have at least one BH held in > the submission path with the async_read flag set, so > end_buffer_async_read() will not end it prematurely. Oh neat, yes. > By the way, do you have CONFIG_VM_DEBUG enabled in your testing? You mean DEBUG_VM ? Yes: grep DEBUG_VM .config CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y CONFIG_DEBUG_VM_IRQSOFF=y CONFIG_DEBUG_VM=y # CONFIG_DEBUG_VM_MAPLE_TREE is not set # CONFIG_DEBUG_VM_RB is not set CONFIG_DEBUG_VM_PGFLAGS=y # CONFIG_DEBUG_VM_PGTABLE is not set > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > in folio_end_read() should have tripped before hitting the race with > truncate. Odd that it did not, I had run into that folio_test_locked() splat but in my attempts to simplify this without your trick to only run into the similar truncate race, your resolution to this is nice. > diff --git a/fs/buffer.c b/fs/buffer.c This is a nice resolution and simplification, thanks, I've tested it and passes without regressions on ext4. I'll take this into this series as an alternative. Luis