Message ID | 20200320144014.3276-1-willy@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iomap: Submit the BIO at the end of each extent | expand |
On Fri, Mar 20, 2020 at 07:40:14AM -0700, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > By definition, an extent covers a range of consecutive blocks, so > it would be quite rare to be able to just add pages to the BIO from > a previous range. The only case we can think of is a mapped extent > followed by a hole extent, followed by another mapped extent which has > been allocated immediately after the first extent. We believe this to > be an unlikely layout for a filesystem to choose and, since the queue > is plugged, those two BIOs would be merged by the block layer. > > The reason we care is that ext2/ext4 choose to lay out blocks 0-11 > consecutively, followed by the indirect block, and we want to merge those > two BIOs. If we don't submit the data BIO before asking the filesystem > for the next extent, then the indirect BIO will be submitted first, > and waited for, leading to inefficient I/O patterns. Buffer heads solve > this with the BH_boundary flag, but iomap doesn't need that as long as > we submit the bio here. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de>
On Fri, Mar 20, 2020 at 07:40:14AM -0700, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > By definition, an extent covers a range of consecutive blocks, so > it would be quite rare to be able to just add pages to the BIO from > a previous range. The only case we can think of is a mapped extent > followed by a hole extent, followed by another mapped extent which has > been allocated immediately after the first extent. We believe this to > be an unlikely layout for a filesystem to choose and, since the queue > is plugged, those two BIOs would be merged by the block layer. > > The reason we care is that ext2/ext4 choose to lay out blocks 0-11 > consecutively, followed by the indirect block, and we want to merge those > two BIOs. If we don't submit the data BIO before asking the filesystem > for the next extent, then the indirect BIO will be submitted first, > and waited for, leading to inefficient I/O patterns. Buffer heads solve > this with the BH_boundary flag, but iomap doesn't need that as long as > we submit the bio here. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Looks ok, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > fs/iomap/buffered-io.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index f080f542911b..417115bfaf6b 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -420,6 +420,16 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length, > ctx, iomap, srcmap); > } > > + /* > + * Submitting the bio here leads to better I/O patterns for > + * filesystems which need to do metadata reads to find the > + * next extent. > + */ > + if (ctx->bio) { > + submit_bio(ctx->bio); > + ctx->bio = NULL; > + } > + > return done; > } > > @@ -449,8 +459,6 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages, > } > ret = 0; > done: > - if (ctx.bio) > - submit_bio(ctx.bio); > if (ctx.cur_page) { > if (!ctx.cur_page_in_bio) > unlock_page(ctx.cur_page); > -- > 2.25.1 >
On Fri, Mar 20, 2020 at 07:40:14AM -0700, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > By definition, an extent covers a range of consecutive blocks, so > it would be quite rare to be able to just add pages to the BIO from > a previous range. The only case we can think of is a mapped extent > followed by a hole extent, followed by another mapped extent which has > been allocated immediately after the first extent. We believe this to > be an unlikely layout for a filesystem to choose and, since the queue > is plugged, those two BIOs would be merged by the block layer. > > The reason we care is that ext2/ext4 choose to lay out blocks 0-11 > consecutively, followed by the indirect block, and we want to merge those > two BIOs. If we don't submit the data BIO before asking the filesystem > for the next extent, then the indirect BIO will be submitted first, > and waited for, leading to inefficient I/O patterns. Buffer heads solve > this with the BH_boundary flag, but iomap doesn't need that as long as > we submit the bio here. Hmm, I just received the following stack trace while running generic/418 on a v5 filesystem with 1k blocks: FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 alder-mtr01 5.6.0-rc4-djw #rc4 SMP PREEMPT Fri Mar 13 14:48:13 PDT 2020 MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdd MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdd /opt (Note that it seems to do this even with MKFS_OPTIONS='-m crc=0' and empty MOUNT_OPTIONS.) [ 32.931667] XFS (sdd): Mounting V5 Filesystem [ 32.940193] XFS (sdd): Ending clean mount [ 32.941350] XFS (sdd): Quotacheck needed: Please wait. [ 32.970724] XFS (sdd): Quotacheck: Done. [ 32.972550] xfs filesystem being mounted at /opt supports timestamps until 2038 (0x7fffffff) [ 32.985191] XFS (sdd): Unmounting Filesystem [ 33.162426] XFS (sde): EXPERIMENTAL online scrub feature in use. Use at your own risk! [ 33.273185] XFS (sde): Unmounting Filesystem [ 33.517957] XFS (sde): Mounting V5 Filesystem [ 33.526176] XFS (sde): Ending clean mount [ 33.527439] XFS (sde): Quotacheck needed: Please wait. [ 33.566117] XFS (sde): Quotacheck: Done. [ 33.569427] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff) [ 33.656942] run fstests generic/418 at 2020-03-20 14:42:29 [ 36.332268] BUG: kernel NULL pointer dereference, address: 0000000000000060 [ 36.334254] #PF: supervisor read access in kernel mode [ 36.334849] #PF: error_code(0x0000) - not-present page [ 36.335461] PGD 0 P4D 0 [ 36.335779] Oops: 0000 [#1] PREEMPT SMP [ 36.336246] CPU: 2 PID: 5144 Comm: dio-invalidate- Not tainted 5.6.0-rc4-djw #rc4 [ 36.337078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 36.338069] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0 [ 36.338671] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12 [ 36.340705] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246 [ 36.341309] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a [ 36.342105] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000 [ 36.342909] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c [ 36.343710] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50 [ 36.344505] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00 [ 36.345246] FS: 00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 [ 36.346087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 36.346696] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0 [ 36.347445] Call Trace: [ 36.347734] iomap_readpages_actor+0x1e3/0x250 [ 36.348699] iomap_apply+0x12c/0x4e3 [ 36.349097] ? iomap_readpage_actor+0x3c0/0x3c0 [ 36.349593] ? prep_new_page+0x3f/0x100 [ 36.350022] ? iomap_readpage_actor+0x3c0/0x3c0 [ 36.350519] iomap_readpages+0xc7/0x2b0 [ 36.350938] ? iomap_readpage_actor+0x3c0/0x3c0 [ 36.351438] read_pages+0x6e/0x1a0 [ 36.351824] __do_page_cache_readahead+0x1c3/0x1e0 [ 36.352343] ondemand_readahead+0x210/0x4b0 [ 36.352797] generic_file_read_iter+0x871/0xcd0 [ 36.353365] ? xfs_file_buffered_aio_read+0x54/0x170 [xfs] [ 36.353982] xfs_file_buffered_aio_read+0x5f/0x170 [xfs] [ 36.354591] xfs_file_read_iter+0xea/0x2a0 [xfs] [ 36.355139] ? xfs_file_write_iter+0xf2/0x1d0 [xfs] [ 36.355668] new_sync_read+0x12d/0x1d0 [ 36.356085] vfs_read+0xa6/0x180 [ 36.356454] ksys_pread64+0x64/0xa0 [ 36.356841] do_syscall_64+0x50/0x1a0 [ 36.357252] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 36.357792] RIP: 0033:0x7f8928524f64 [ 36.358189] Code: 15 61 80 20 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 8b 05 aa c4 20 00 49 89 ca 85 c0 75 13 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5c f3 c3 66 90 41 55 41 54 49 89 cd 55 53 49 [ 36.360053] RSP: 002b:00007fffc7893b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 [ 36.360841] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f8928524f64 [ 36.361594] RDX: 0000000000000400 RSI: 00005593e3c23000 RDI: 0000000000000003 [ 36.362344] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001 [ 36.363075] R10: 0000000000000000 R11: 0000000000000246 R12: 00005593e3c23000 [ 36.363817] R13: 0000000000000000 R14: 00005593e3c25000 R15: 0000000000000400 [ 36.364569] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables nfsv4 af_packet [ 36.366966] Dumping ftrace buffer: [ 36.367351] (ftrace buffer empty) [ 36.367742] CR2: 0000000000000060 [ 36.369050] ---[ end trace d599586d1259866c ]--- [ 36.369884] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0 [ 36.370694] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12 [ 36.373022] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246 [ 36.373615] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a [ 36.374362] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000 [ 36.375100] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c [ 36.375833] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50 [ 36.376577] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00 [ 36.377322] FS: 00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 [ 36.378190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 36.378795] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0 I'll email back if I find anything else but since this is the second Friday in a row of getting STOP SHIP bugs dropped in my lap at 2pm, I doubt I'm going to manage much. --D > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > --- > fs/iomap/buffered-io.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index f080f542911b..417115bfaf6b 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -420,6 +420,16 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length, > ctx, iomap, srcmap); > } > > + /* > + * Submitting the bio here leads to better I/O patterns for > + * filesystems which need to do metadata reads to find the > + * next extent. > + */ > + if (ctx->bio) { > + submit_bio(ctx->bio); > + ctx->bio = NULL; > + } > + > return done; > } > > @@ -449,8 +459,6 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages, > } > ret = 0; > done: > - if (ctx.bio) > - submit_bio(ctx.bio); > if (ctx.cur_page) { > if (!ctx.cur_page_in_bio) > unlock_page(ctx.cur_page); > -- > 2.25.1 >
On Fri, Mar 20, 2020 at 02:46:54PM -0700, Darrick J. Wong wrote: > Hmm, I just received the following stack trace while running generic/418 > on a v5 filesystem with 1k blocks: I can't persuade this to reproduce on my setup. > FSTYP -- xfs (debug) > PLATFORM -- Linux/x86_64 alder-mtr01 5.6.0-rc4-djw #rc4 SMP PREEMPT Fri Mar 13 14:48:13 PDT 2020 > MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdd > MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdd /opt > > (Note that it seems to do this even with MKFS_OPTIONS='-m crc=0' and > empty MOUNT_OPTIONS.) FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 bobo-kvm 5.6.0-rc4-00001-g907dfd1bfc6d #2 SMP Fri Mar 20 18:30:53 EDT 2020 MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024 /dev/sdc MOUNT_OPTIONS -- /dev/sdc /mnt/scratch > [ 33.656942] run fstests generic/418 at 2020-03-20 14:42:29 > [ 36.332268] BUG: kernel NULL pointer dereference, address: 0000000000000060 > [ 36.334254] #PF: supervisor read access in kernel mode > [ 36.334849] #PF: error_code(0x0000) - not-present page > [ 36.335461] PGD 0 P4D 0 > [ 36.335779] Oops: 0000 [#1] PREEMPT SMP > [ 36.336246] CPU: 2 PID: 5144 Comm: dio-invalidate- Not tainted 5.6.0-rc4-djw #rc4 > [ 36.337078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 > [ 36.338069] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0 > [ 36.338671] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12 > [ 36.340705] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246 > [ 36.341309] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a > [ 36.342105] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000 > [ 36.342909] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c > [ 36.343710] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50 > [ 36.344505] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00 > [ 36.345246] FS: 00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 > [ 36.346087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 36.346696] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0 > [ 36.347445] Call Trace: > [ 36.347734] iomap_readpages_actor+0x1e3/0x250 > [ 36.348699] iomap_apply+0x12c/0x4e3 > [ 36.349097] ? iomap_readpage_actor+0x3c0/0x3c0 > [ 36.349593] ? prep_new_page+0x3f/0x100 > [ 36.350022] ? iomap_readpage_actor+0x3c0/0x3c0 > [ 36.350519] iomap_readpages+0xc7/0x2b0 > [ 36.350938] ? iomap_readpage_actor+0x3c0/0x3c0 > [ 36.351438] read_pages+0x6e/0x1a0 > [ 36.351824] __do_page_cache_readahead+0x1c3/0x1e0 > [ 36.352343] ondemand_readahead+0x210/0x4b0 > [ 36.352797] generic_file_read_iter+0x871/0xcd0 > [ 36.353365] ? xfs_file_buffered_aio_read+0x54/0x170 [xfs] > [ 36.353982] xfs_file_buffered_aio_read+0x5f/0x170 [xfs] > [ 36.354591] xfs_file_read_iter+0xea/0x2a0 [xfs] > [ 36.355139] ? xfs_file_write_iter+0xf2/0x1d0 [xfs] > [ 36.355668] new_sync_read+0x12d/0x1d0 > [ 36.356085] vfs_read+0xa6/0x180 > [ 36.356454] ksys_pread64+0x64/0xa0 > [ 36.356841] do_syscall_64+0x50/0x1a0 > [ 36.357252] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ 36.357792] RIP: 0033:0x7f8928524f64 > [ 36.358189] Code: 15 61 80 20 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 8b 05 aa c4 20 00 49 89 ca 85 c0 75 13 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5c f3 c3 66 90 41 55 41 54 49 89 cd 55 53 49 > [ 36.360053] RSP: 002b:00007fffc7893b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 > [ 36.360841] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f8928524f64 > [ 36.361594] RDX: 0000000000000400 RSI: 00005593e3c23000 RDI: 0000000000000003 > [ 36.362344] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001 > [ 36.363075] R10: 0000000000000000 R11: 0000000000000246 R12: 00005593e3c23000 > [ 36.363817] R13: 0000000000000000 R14: 00005593e3c25000 R15: 0000000000000400 > [ 36.364569] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables nfsv4 af_packet > [ 36.366966] Dumping ftrace buffer: > [ 36.367351] (ftrace buffer empty) > [ 36.367742] CR2: 0000000000000060 > [ 36.369050] ---[ end trace d599586d1259866c ]--- > [ 36.369884] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0 > [ 36.370694] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12 > [ 36.373022] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246 > [ 36.373615] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a > [ 36.374362] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000 > [ 36.375100] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c > [ 36.375833] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50 > [ 36.376577] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00 > [ 36.377322] FS: 00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 > [ 36.378190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 36.378795] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0 This doesn't entirely match your backtrace, but this looks wrong: if (ctx->is_readahead) /* same as readahead_gfp_mask */ gfp |= __GFP_NORETRY | __GFP_NOWARN; ctx->bio = bio_alloc(gfp, min(BIO_MAX_PAGES, nr_vecs)); ctx->bio->bi_opf = REQ_OP_READ; If we're specifying NORETRY, then we have to handle bio_alloc() failure here, right? I say it doesn't entirely match because you've got a read from an offset of 0x60, and every access here is a write.
On Sat, Mar 21, 2020 at 07:20:34AM -0700, Matthew Wilcox wrote: > On Fri, Mar 20, 2020 at 02:46:54PM -0700, Darrick J. Wong wrote: > > Hmm, I just received the following stack trace while running generic/418 > > on a v5 filesystem with 1k blocks: > > I can't persuade this to reproduce on my setup. > > > FSTYP -- xfs (debug) > > PLATFORM -- Linux/x86_64 alder-mtr01 5.6.0-rc4-djw #rc4 SMP PREEMPT Fri Mar 13 14:48:13 PDT 2020 > > MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdd > > MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdd /opt > > > > (Note that it seems to do this even with MKFS_OPTIONS='-m crc=0' and > > empty MOUNT_OPTIONS.) > > FSTYP -- xfs (debug) > PLATFORM -- Linux/x86_64 bobo-kvm 5.6.0-rc4-00001-g907dfd1bfc6d #2 SMP Fri Mar 20 18:30:53 EDT 2020 > MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024 /dev/sdc > MOUNT_OPTIONS -- /dev/sdc /mnt/scratch > > > [ 33.656942] run fstests generic/418 at 2020-03-20 14:42:29 > > [ 36.332268] BUG: kernel NULL pointer dereference, address: 0000000000000060 > > [ 36.334254] #PF: supervisor read access in kernel mode > > [ 36.334849] #PF: error_code(0x0000) - not-present page > > [ 36.335461] PGD 0 P4D 0 > > [ 36.335779] Oops: 0000 [#1] PREEMPT SMP > > [ 36.336246] CPU: 2 PID: 5144 Comm: dio-invalidate- Not tainted 5.6.0-rc4-djw #rc4 > > [ 36.337078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 > > [ 36.338069] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0 > > [ 36.338671] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12 > > [ 36.340705] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246 > > [ 36.341309] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a > > [ 36.342105] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000 > > [ 36.342909] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c > > [ 36.343710] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50 > > [ 36.344505] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00 > > [ 36.345246] FS: 00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 > > [ 36.346087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 36.346696] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0 > > [ 36.347445] Call Trace: > > [ 36.347734] iomap_readpages_actor+0x1e3/0x250 > > [ 36.348699] iomap_apply+0x12c/0x4e3 > > [ 36.349097] ? iomap_readpage_actor+0x3c0/0x3c0 > > [ 36.349593] ? prep_new_page+0x3f/0x100 > > [ 36.350022] ? iomap_readpage_actor+0x3c0/0x3c0 > > [ 36.350519] iomap_readpages+0xc7/0x2b0 > > [ 36.350938] ? iomap_readpage_actor+0x3c0/0x3c0 > > [ 36.351438] read_pages+0x6e/0x1a0 > > [ 36.351824] __do_page_cache_readahead+0x1c3/0x1e0 > > [ 36.352343] ondemand_readahead+0x210/0x4b0 > > [ 36.352797] generic_file_read_iter+0x871/0xcd0 > > [ 36.353365] ? xfs_file_buffered_aio_read+0x54/0x170 [xfs] > > [ 36.353982] xfs_file_buffered_aio_read+0x5f/0x170 [xfs] > > [ 36.354591] xfs_file_read_iter+0xea/0x2a0 [xfs] > > [ 36.355139] ? xfs_file_write_iter+0xf2/0x1d0 [xfs] > > [ 36.355668] new_sync_read+0x12d/0x1d0 > > [ 36.356085] vfs_read+0xa6/0x180 > > [ 36.356454] ksys_pread64+0x64/0xa0 > > [ 36.356841] do_syscall_64+0x50/0x1a0 > > [ 36.357252] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > [ 36.357792] RIP: 0033:0x7f8928524f64 > > [ 36.358189] Code: 15 61 80 20 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 8b 05 aa c4 20 00 49 89 ca 85 c0 75 13 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5c f3 c3 66 90 41 55 41 54 49 89 cd 55 53 49 > > [ 36.360053] RSP: 002b:00007fffc7893b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 > > [ 36.360841] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f8928524f64 > > [ 36.361594] RDX: 0000000000000400 RSI: 00005593e3c23000 RDI: 0000000000000003 > > [ 36.362344] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001 > > [ 36.363075] R10: 0000000000000000 R11: 0000000000000246 R12: 00005593e3c23000 > > [ 36.363817] R13: 0000000000000000 R14: 00005593e3c25000 R15: 0000000000000400 > > [ 36.364569] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables nfsv4 af_packet > > [ 36.366966] Dumping ftrace buffer: > > [ 36.367351] (ftrace buffer empty) > > [ 36.367742] CR2: 0000000000000060 > > [ 36.369050] ---[ end trace d599586d1259866c ]--- > > [ 36.369884] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0 > > [ 36.370694] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12 > > [ 36.373022] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246 > > [ 36.373615] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a > > [ 36.374362] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000 > > [ 36.375100] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c > > [ 36.375833] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50 > > [ 36.376577] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00 > > [ 36.377322] FS: 00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 > > [ 36.378190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 36.378795] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0 > > This doesn't entirely match your backtrace, but this looks wrong: > > if (ctx->is_readahead) /* same as readahead_gfp_mask */ > gfp |= __GFP_NORETRY | __GFP_NOWARN; > ctx->bio = bio_alloc(gfp, min(BIO_MAX_PAGES, nr_vecs)); > ctx->bio->bi_opf = REQ_OP_READ; > > If we're specifying NORETRY, then we have to handle bio_alloc() failure > here, right? I say it doesn't entirely match because you've got a read > from an offset of 0x60, and every access here is a write. Yes. Interesting that this would only show up with your not exactly related patch. FYI, I'm tempted to just drop the gfp mask adustment vs trying to handle the error, as exiting early from the loop looks pretty nasty to implement..
On Fri, Mar 20, 2020 at 02:46:54PM -0700, Darrick J. Wong wrote: > On Fri, Mar 20, 2020 at 07:40:14AM -0700, Matthew Wilcox wrote: > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org> > > > > By definition, an extent covers a range of consecutive blocks, so > > it would be quite rare to be able to just add pages to the BIO from > > a previous range. The only case we can think of is a mapped extent > > followed by a hole extent, followed by another mapped extent which has > > been allocated immediately after the first extent. We believe this to > > be an unlikely layout for a filesystem to choose and, since the queue > > is plugged, those two BIOs would be merged by the block layer. > > > > The reason we care is that ext2/ext4 choose to lay out blocks 0-11 > > consecutively, followed by the indirect block, and we want to merge those > > two BIOs. If we don't submit the data BIO before asking the filesystem > > for the next extent, then the indirect BIO will be submitted first, > > and waited for, leading to inefficient I/O patterns. Buffer heads solve > > this with the BH_boundary flag, but iomap doesn't need that as long as > > we submit the bio here. > > Hmm, I just received the following stack trace while running generic/418 > on a v5 filesystem with 1k blocks: I hit this crash again, though I've figured out how to trigger it reliably. Boot a x64 VM with mem=512M to make memory constrained, and then kick off fstests with 1k block size: FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 alder-mtr00 5.7.0-rc4-djw #rc4 SMP PREEMPT Mon May 4 09:24:50 PDT 2020 MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdf MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdf /opt and then you get the following: xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff) run fstests generic/418 at 2020-05-04 17:27:51 rm (3338) used greatest stack depth: 11728 bytes left BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 4900 Comm: dio-invalidate- Not tainted 5.7.0-rc4-djw #rc4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014 RIP: 0010:iomap_set_range_uptodate+0x5d/0x170 Code: 07 00 60 00 00 75 13 f0 80 0f 04 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 47 18 44 8d 74 16 ff 41 89 f1 4c 8b 6f 28 <4c> 8b 38 41 0f b6 af ca 00 00 00 89 e9 41 d3 e9 40 80 fd 1f 0f 87 RSP: 0018:ffffc90004b0b8e8 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffffea0000292440 RCX: 0000000000000000 RDX: 0000000000000400 RSI: 0000000000000c00 RDI: ffffea0000292440 RBP: 0000000000001000 R08: ffffc90004b0b958 R09: 0000000000000c00 R10: 0000000000000002 R11: ffff888017806720 R12: ffffc90004b0ba20 R13: ffff888017806720 R14: 0000000000000fff R15: ffff88801872c610 FS: 00007f2091593740(0000) GS:ffff88801e800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000017abb005 CR4: 00000000001606a0 Call Trace: ? __raw_spin_lock_init+0x39/0x60 iomap_readpage_actor+0x113/0x3f0 iomap_readpages_actor+0x1dc/0x240 iomap_apply+0x12d/0x4e9 ? iomap_readpage_actor+0x3f0/0x3f0 ? mark_held_locks+0x45/0x70 iomap_readpages+0xc2/0x290 ? iomap_readpage_actor+0x3f0/0x3f0 ? xa_clear_mark+0x30/0x30 read_pages+0x75/0x1b0 __do_page_cache_readahead+0x1bb/0x1d0 ondemand_readahead+0x21a/0x540 ? pagecache_get_page+0x26/0x320 generic_file_read_iter+0x91a/0xd10 ? xfs_file_buffered_aio_read+0x88/0x170 [xfs] xfs_file_buffered_aio_read+0x65/0x170 [xfs] xfs_file_read_iter+0xe9/0x2a0 [xfs] new_sync_read+0x12d/0x1d0 vfs_read+0xc7/0x180 ksys_pread64+0x64/0xa0 do_syscall_64+0x50/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7f209179cbca Digging into this with gcc, the RIP value is: 0xffffffff813047cd is in iomap_set_range_uptodate (/storage/home/djwong/cdev/work/linux-djw/fs/iomap/buffered-io.c:147). 142 143 static void 144 iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len) 145 { 146 struct iomap_page *iop = to_iomap_page(page); 147 struct inode *inode = page->mapping->host; 148 unsigned first = off >> inode->i_blkbits; 149 unsigned last = (off + len - 1) >> inode->i_blkbits; 150 bool uptodate = true; 151 unsigned long flags; So now this makes me wonder, is it possible to be performing readahead into a page that doesn't have page->mapping set yet? I reran this a few times, got crashes in different places, but the common factor is that page->mapping is NULL, and we're doing readhead. I also tried this with the patch *not* applied and had the same problems, so it's not actually this patch. But there's something going wrong in the iomap code... --D > FSTYP -- xfs (debug) > PLATFORM -- Linux/x86_64 alder-mtr01 5.6.0-rc4-djw #rc4 SMP PREEMPT Fri Mar 13 14:48:13 PDT 2020 > MKFS_OPTIONS -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdd > MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdd /opt > > (Note that it seems to do this even with MKFS_OPTIONS='-m crc=0' and > empty MOUNT_OPTIONS.) > > [ 32.931667] XFS (sdd): Mounting V5 Filesystem > [ 32.940193] XFS (sdd): Ending clean mount > [ 32.941350] XFS (sdd): Quotacheck needed: Please wait. > [ 32.970724] XFS (sdd): Quotacheck: Done. > [ 32.972550] xfs filesystem being mounted at /opt supports timestamps until 2038 (0x7fffffff) > [ 32.985191] XFS (sdd): Unmounting Filesystem > [ 33.162426] XFS (sde): EXPERIMENTAL online scrub feature in use. Use at your own risk! > [ 33.273185] XFS (sde): Unmounting Filesystem > [ 33.517957] XFS (sde): Mounting V5 Filesystem > [ 33.526176] XFS (sde): Ending clean mount > [ 33.527439] XFS (sde): Quotacheck needed: Please wait. > [ 33.566117] XFS (sde): Quotacheck: Done. > [ 33.569427] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff) > [ 33.656942] run fstests generic/418 at 2020-03-20 14:42:29 > [ 36.332268] BUG: kernel NULL pointer dereference, address: 0000000000000060 > [ 36.334254] #PF: supervisor read access in kernel mode > [ 36.334849] #PF: error_code(0x0000) - not-present page > [ 36.335461] PGD 0 P4D 0 > [ 36.335779] Oops: 0000 [#1] PREEMPT SMP > [ 36.336246] CPU: 2 PID: 5144 Comm: dio-invalidate- Not tainted 5.6.0-rc4-djw #rc4 > [ 36.337078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 > [ 36.338069] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0 > [ 36.338671] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12 > [ 36.340705] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246 > [ 36.341309] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a > [ 36.342105] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000 > [ 36.342909] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c > [ 36.343710] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50 > [ 36.344505] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00 > [ 36.345246] FS: 00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 > [ 36.346087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 36.346696] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0 > [ 36.347445] Call Trace: > [ 36.347734] iomap_readpages_actor+0x1e3/0x250 > [ 36.348699] iomap_apply+0x12c/0x4e3 > [ 36.349097] ? iomap_readpage_actor+0x3c0/0x3c0 > [ 36.349593] ? prep_new_page+0x3f/0x100 > [ 36.350022] ? iomap_readpage_actor+0x3c0/0x3c0 > [ 36.350519] iomap_readpages+0xc7/0x2b0 > [ 36.350938] ? iomap_readpage_actor+0x3c0/0x3c0 > [ 36.351438] read_pages+0x6e/0x1a0 > [ 36.351824] __do_page_cache_readahead+0x1c3/0x1e0 > [ 36.352343] ondemand_readahead+0x210/0x4b0 > [ 36.352797] generic_file_read_iter+0x871/0xcd0 > [ 36.353365] ? xfs_file_buffered_aio_read+0x54/0x170 [xfs] > [ 36.353982] xfs_file_buffered_aio_read+0x5f/0x170 [xfs] > [ 36.354591] xfs_file_read_iter+0xea/0x2a0 [xfs] > [ 36.355139] ? xfs_file_write_iter+0xf2/0x1d0 [xfs] > [ 36.355668] new_sync_read+0x12d/0x1d0 > [ 36.356085] vfs_read+0xa6/0x180 > [ 36.356454] ksys_pread64+0x64/0xa0 > [ 36.356841] do_syscall_64+0x50/0x1a0 > [ 36.357252] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ 36.357792] RIP: 0033:0x7f8928524f64 > [ 36.358189] Code: 15 61 80 20 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 8b 05 aa c4 20 00 49 89 ca 85 c0 75 13 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5c f3 c3 66 90 41 55 41 54 49 89 cd 55 53 49 > [ 36.360053] RSP: 002b:00007fffc7893b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 > [ 36.360841] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f8928524f64 > [ 36.361594] RDX: 0000000000000400 RSI: 00005593e3c23000 RDI: 0000000000000003 > [ 36.362344] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001 > [ 36.363075] R10: 0000000000000000 R11: 0000000000000246 R12: 00005593e3c23000 > [ 36.363817] R13: 0000000000000000 R14: 00005593e3c25000 R15: 0000000000000400 > [ 36.364569] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables nfsv4 af_packet > [ 36.366966] Dumping ftrace buffer: > [ 36.367351] (ftrace buffer empty) > [ 36.367742] CR2: 0000000000000060 > [ 36.369050] ---[ end trace d599586d1259866c ]--- > [ 36.369884] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0 > [ 36.370694] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12 > [ 36.373022] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246 > [ 36.373615] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a > [ 36.374362] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000 > [ 36.375100] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c > [ 36.375833] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50 > [ 36.376577] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00 > [ 36.377322] FS: 00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 > [ 36.378190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 36.378795] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0 > > I'll email back if I find anything else but since this is the second > Friday in a row of getting STOP SHIP bugs dropped in my lap at 2pm, I > doubt I'm going to manage much. > > --D > > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > --- > > fs/iomap/buffered-io.c | 12 ++++++++++-- > > 1 file changed, 10 insertions(+), 2 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index f080f542911b..417115bfaf6b 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -420,6 +420,16 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length, > > ctx, iomap, srcmap); > > } > > > > + /* > > + * Submitting the bio here leads to better I/O patterns for > > + * filesystems which need to do metadata reads to find the > > + * next extent. > > + */ > > + if (ctx->bio) { > > + submit_bio(ctx->bio); > > + ctx->bio = NULL; > > + } > > + > > return done; > > } > > > > @@ -449,8 +459,6 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages, > > } > > ret = 0; > > done: > > - if (ctx.bio) > > - submit_bio(ctx.bio); > > if (ctx.cur_page) { > > if (!ctx.cur_page_in_bio) > > unlock_page(ctx.cur_page); > > -- > > 2.25.1 > >
On Mon, May 04, 2020 at 05:37:10PM -0700, Darrick J. Wong wrote: > run fstests generic/418 at 2020-05-04 17:27:51 > rm (3338) used greatest stack depth: 11728 bytes left > BUG: kernel NULL pointer dereference, address: 0000000000000000 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP > CPU: 1 PID: 4900 Comm: dio-invalidate- Not tainted 5.7.0-rc4-djw #rc4 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014 > RIP: 0010:iomap_set_range_uptodate+0x5d/0x170 > Code: 07 00 60 00 00 75 13 f0 80 0f 04 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 47 18 44 8d 74 16 ff 41 89 f1 4c 8b 6f 28 <4c> 8b 38 41 0f b6 af ca 00 00 00 89 e9 41 d3 e9 40 80 fd 1f 0f 87 > RSP: 0018:ffffc90004b0b8e8 EFLAGS: 00010206 > RAX: 0000000000000000 RBX: ffffea0000292440 RCX: 0000000000000000 > RDX: 0000000000000400 RSI: 0000000000000c00 RDI: ffffea0000292440 > RBP: 0000000000001000 R08: ffffc90004b0b958 R09: 0000000000000c00 > R10: 0000000000000002 R11: ffff888017806720 R12: ffffc90004b0ba20 > R13: ffff888017806720 R14: 0000000000000fff R15: ffff88801872c610 > FS: 00007f2091593740(0000) GS:ffff88801e800000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 0000000017abb005 CR4: 00000000001606a0 > Call Trace: > ? __raw_spin_lock_init+0x39/0x60 > iomap_readpage_actor+0x113/0x3f0 > iomap_readpages_actor+0x1dc/0x240 > iomap_apply+0x12d/0x4e9 > ? iomap_readpage_actor+0x3f0/0x3f0 > ? mark_held_locks+0x45/0x70 > iomap_readpages+0xc2/0x290 > ? iomap_readpage_actor+0x3f0/0x3f0 > ? xa_clear_mark+0x30/0x30 > read_pages+0x75/0x1b0 > __do_page_cache_readahead+0x1bb/0x1d0 > ondemand_readahead+0x21a/0x540 > ? pagecache_get_page+0x26/0x320 > generic_file_read_iter+0x91a/0xd10 > ? xfs_file_buffered_aio_read+0x88/0x170 [xfs] > xfs_file_buffered_aio_read+0x65/0x170 [xfs] > xfs_file_read_iter+0xe9/0x2a0 [xfs] > new_sync_read+0x12d/0x1d0 > vfs_read+0xc7/0x180 > ksys_pread64+0x64/0xa0 > do_syscall_64+0x50/0x1a0 > entry_SYSCALL_64_after_hwframe+0x49/0xb3 > RIP: 0033:0x7f209179cbca > > Digging into this with gcc, the RIP value is: > > 0xffffffff813047cd is in iomap_set_range_uptodate (/storage/home/djwong/cdev/work/linux-djw/fs/iomap/buffered-io.c:147). > 142 > 143 static void > 144 iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len) > 145 { > 146 struct iomap_page *iop = to_iomap_page(page); > 147 struct inode *inode = page->mapping->host; > 148 unsigned first = off >> inode->i_blkbits; > 149 unsigned last = (off + len - 1) >> inode->i_blkbits; > 150 bool uptodate = true; > 151 unsigned long flags; > > So now this makes me wonder, is it possible to be performing readahead > into a page that doesn't have page->mapping set yet? I reran this a few > times, got crashes in different places, but the common factor is that > page->mapping is NULL, and we're doing readhead. > > I also tried this with the patch *not* applied and had the same > problems, so it's not actually this patch. But there's something going > wrong in the iomap code... Thanks for tracking that down! I don't see a way for that to happen. The page is originally allocated in __do_page_cache_readahead() and (in 5.7) does not have page->mapping set. Instead, it gets put on the page_pool list head which gets passed into iomap_readpages(). iomap_next_page() calls add_to_page_cache_lru() which either sets page->mapping or returns an error. So I don't see how iomap_next_page() can give us a page which doesn't have ->mapping set. Is it possible that it's the second dereference, not the first that's NULL? ie mapping->host is NULL?
On Mon, May 04, 2020 at 07:24:15PM -0700, Matthew Wilcox wrote: > On Mon, May 04, 2020 at 05:37:10PM -0700, Darrick J. Wong wrote: > > run fstests generic/418 at 2020-05-04 17:27:51 > > rm (3338) used greatest stack depth: 11728 bytes left > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: 0000 [#1] PREEMPT SMP > > CPU: 1 PID: 4900 Comm: dio-invalidate- Not tainted 5.7.0-rc4-djw #rc4 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014 > > RIP: 0010:iomap_set_range_uptodate+0x5d/0x170 > > Code: 07 00 60 00 00 75 13 f0 80 0f 04 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 47 18 44 8d 74 16 ff 41 89 f1 4c 8b 6f 28 <4c> 8b 38 41 0f b6 af ca 00 00 00 89 e9 41 d3 e9 40 80 fd 1f 0f 87 > > RSP: 0018:ffffc90004b0b8e8 EFLAGS: 00010206 > > RAX: 0000000000000000 RBX: ffffea0000292440 RCX: 0000000000000000 > > RDX: 0000000000000400 RSI: 0000000000000c00 RDI: ffffea0000292440 > > RBP: 0000000000001000 R08: ffffc90004b0b958 R09: 0000000000000c00 > > R10: 0000000000000002 R11: ffff888017806720 R12: ffffc90004b0ba20 > > R13: ffff888017806720 R14: 0000000000000fff R15: ffff88801872c610 > > FS: 00007f2091593740(0000) GS:ffff88801e800000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000000000000 CR3: 0000000017abb005 CR4: 00000000001606a0 > > Call Trace: > > ? __raw_spin_lock_init+0x39/0x60 > > iomap_readpage_actor+0x113/0x3f0 > > iomap_readpages_actor+0x1dc/0x240 > > iomap_apply+0x12d/0x4e9 > > ? iomap_readpage_actor+0x3f0/0x3f0 > > ? mark_held_locks+0x45/0x70 > > iomap_readpages+0xc2/0x290 > > ? iomap_readpage_actor+0x3f0/0x3f0 > > ? xa_clear_mark+0x30/0x30 > > read_pages+0x75/0x1b0 > > __do_page_cache_readahead+0x1bb/0x1d0 > > ondemand_readahead+0x21a/0x540 > > ? pagecache_get_page+0x26/0x320 > > generic_file_read_iter+0x91a/0xd10 > > ? xfs_file_buffered_aio_read+0x88/0x170 [xfs] > > xfs_file_buffered_aio_read+0x65/0x170 [xfs] > > xfs_file_read_iter+0xe9/0x2a0 [xfs] > > new_sync_read+0x12d/0x1d0 > > vfs_read+0xc7/0x180 > > ksys_pread64+0x64/0xa0 > > do_syscall_64+0x50/0x1a0 > > entry_SYSCALL_64_after_hwframe+0x49/0xb3 > > RIP: 0033:0x7f209179cbca > > > > Digging into this with gcc, the RIP value is: > > > > 0xffffffff813047cd is in iomap_set_range_uptodate (/storage/home/djwong/cdev/work/linux-djw/fs/iomap/buffered-io.c:147). > > 142 > > 143 static void > > 144 iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len) > > 145 { > > 146 struct iomap_page *iop = to_iomap_page(page); > > 147 struct inode *inode = page->mapping->host; > > 148 unsigned first = off >> inode->i_blkbits; > > 149 unsigned last = (off + len - 1) >> inode->i_blkbits; > > 150 bool uptodate = true; > > 151 unsigned long flags; > > > > So now this makes me wonder, is it possible to be performing readahead > > into a page that doesn't have page->mapping set yet? I reran this a few > > times, got crashes in different places, but the common factor is that > > page->mapping is NULL, and we're doing readhead. > > > > I also tried this with the patch *not* applied and had the same > > problems, so it's not actually this patch. But there's something going > > wrong in the iomap code... > > Thanks for tracking that down! I don't see a way for that to happen. > The page is originally allocated in __do_page_cache_readahead() and > (in 5.7) does not have page->mapping set. Instead, it gets put on > the page_pool list head which gets passed into iomap_readpages(). > iomap_next_page() calls add_to_page_cache_lru() which either sets > page->mapping or returns an error. So I don't see how iomap_next_page() > can give us a page which doesn't have ->mapping set. > > Is it possible that it's the second dereference, not the first that's > NULL? ie mapping->host is NULL? Hmm, that's possible too. I haven't gotten around (it's 19:58 here) to digging further into the disassembly to figure out which pointer deference it really is. --D
On Mon, May 04, 2020 at 07:59:02PM -0700, Darrick J. Wong wrote: > On Mon, May 04, 2020 at 07:24:15PM -0700, Matthew Wilcox wrote: > > On Mon, May 04, 2020 at 05:37:10PM -0700, Darrick J. Wong wrote: > > > run fstests generic/418 at 2020-05-04 17:27:51 > > > rm (3338) used greatest stack depth: 11728 bytes left > > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > > > #PF: supervisor read access in kernel mode > > > #PF: error_code(0x0000) - not-present page > > > PGD 0 P4D 0 > > > Oops: 0000 [#1] PREEMPT SMP > > > CPU: 1 PID: 4900 Comm: dio-invalidate- Not tainted 5.7.0-rc4-djw #rc4 > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014 > > > RIP: 0010:iomap_set_range_uptodate+0x5d/0x170 > > > Code: 07 00 60 00 00 75 13 f0 80 0f 04 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 47 18 44 8d 74 16 ff 41 89 f1 4c 8b 6f 28 <4c> 8b 38 41 0f b6 af ca 00 00 00 89 e9 41 d3 e9 40 80 fd 1f 0f 87 > > > RSP: 0018:ffffc90004b0b8e8 EFLAGS: 00010206 > > > RAX: 0000000000000000 RBX: ffffea0000292440 RCX: 0000000000000000 > > > RDX: 0000000000000400 RSI: 0000000000000c00 RDI: ffffea0000292440 > > > RBP: 0000000000001000 R08: ffffc90004b0b958 R09: 0000000000000c00 > > > R10: 0000000000000002 R11: ffff888017806720 R12: ffffc90004b0ba20 > > > R13: ffff888017806720 R14: 0000000000000fff R15: ffff88801872c610 > > > FS: 00007f2091593740(0000) GS:ffff88801e800000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 0000000000000000 CR3: 0000000017abb005 CR4: 00000000001606a0 > > > Call Trace: > > > ? __raw_spin_lock_init+0x39/0x60 > > > iomap_readpage_actor+0x113/0x3f0 > > > iomap_readpages_actor+0x1dc/0x240 > > > iomap_apply+0x12d/0x4e9 > > > ? iomap_readpage_actor+0x3f0/0x3f0 > > > ? mark_held_locks+0x45/0x70 > > > iomap_readpages+0xc2/0x290 > > > ? iomap_readpage_actor+0x3f0/0x3f0 > > > ? xa_clear_mark+0x30/0x30 > > > read_pages+0x75/0x1b0 > > > __do_page_cache_readahead+0x1bb/0x1d0 > > > ondemand_readahead+0x21a/0x540 > > > ? pagecache_get_page+0x26/0x320 > > > generic_file_read_iter+0x91a/0xd10 > > > ? xfs_file_buffered_aio_read+0x88/0x170 [xfs] > > > xfs_file_buffered_aio_read+0x65/0x170 [xfs] > > > xfs_file_read_iter+0xe9/0x2a0 [xfs] > > > new_sync_read+0x12d/0x1d0 > > > vfs_read+0xc7/0x180 > > > ksys_pread64+0x64/0xa0 > > > do_syscall_64+0x50/0x1a0 > > > entry_SYSCALL_64_after_hwframe+0x49/0xb3 > > > RIP: 0033:0x7f209179cbca > > > > > > Digging into this with gcc, the RIP value is: > > > > > > 0xffffffff813047cd is in iomap_set_range_uptodate (/storage/home/djwong/cdev/work/linux-djw/fs/iomap/buffered-io.c:147). > > > 142 > > > 143 static void > > > 144 iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len) > > > 145 { > > > 146 struct iomap_page *iop = to_iomap_page(page); > > > 147 struct inode *inode = page->mapping->host; > > > 148 unsigned first = off >> inode->i_blkbits; > > > 149 unsigned last = (off + len - 1) >> inode->i_blkbits; > > > 150 bool uptodate = true; > > > 151 unsigned long flags; > > > > > > So now this makes me wonder, is it possible to be performing readahead > > > into a page that doesn't have page->mapping set yet? I reran this a few > > > times, got crashes in different places, but the common factor is that > > > page->mapping is NULL, and we're doing readhead. > > > > > > I also tried this with the patch *not* applied and had the same > > > problems, so it's not actually this patch. But there's something going > > > wrong in the iomap code... > > > > Thanks for tracking that down! I don't see a way for that to happen. > > The page is originally allocated in __do_page_cache_readahead() and > > (in 5.7) does not have page->mapping set. Instead, it gets put on > > the page_pool list head which gets passed into iomap_readpages(). > > iomap_next_page() calls add_to_page_cache_lru() which either sets > > page->mapping or returns an error. So I don't see how iomap_next_page() > > can give us a page which doesn't have ->mapping set. > > > > Is it possible that it's the second dereference, not the first that's > > NULL? ie mapping->host is NULL? > > Hmm, that's possible too. I haven't gotten around (it's 19:58 here) to > digging further into the disassembly to figure out which pointer > deference it really is. To summarize the live-debugging session willy and I just had on irc, this doesn't quite work because an extremely fragmented extent map on a blocksize < pagesize filesystem can cause problems: Thread A gets the mapping for the first block, increments read_count, and issues the IO. The IO completes immediately, so we decrement read_count, and since there are no other readers, we unlock the page. A's readahead context still points to the page. Meanwhile, a directio write in thread B wanders in and invalidates the page cache, which unmaps the page. Uhoh... Next, thread A continues its readahead loop. It gets the mapping for the second block and calls iomap_readpages_actor with the page that is still pointed to by the readahead context ... which if we're lucky is now unmapped and unlocked, and hasn't been reused yet. --D > --D
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index f080f542911b..417115bfaf6b 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -420,6 +420,16 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length, ctx, iomap, srcmap); } + /* + * Submitting the bio here leads to better I/O patterns for + * filesystems which need to do metadata reads to find the + * next extent. + */ + if (ctx->bio) { + submit_bio(ctx->bio); + ctx->bio = NULL; + } + return done; } @@ -449,8 +459,6 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages, } ret = 0; done: - if (ctx.bio) - submit_bio(ctx.bio); if (ctx.cur_page) { if (!ctx.cur_page_in_bio) unlock_page(ctx.cur_page);