Message ID | 20240903054808.126799-1-sunjunchao2870@gmail.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | iomap: clean preallocated blocks in iomap_end() when 0 bytes was written. | expand |
On Tue, Sep 03, 2024 at 01:48:08PM +0800, Julian Sun wrote: > Hi, all. > > Recently, syzbot reported a issue as following: > > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 __iomap_write_begin fs/iomap/buffered-io.c:727 [inline] > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 > CPU: 1 UID: 0 PID: 5222 Comm: syz-executor247 Not tainted 6.11.0-rc2-syzkaller-00111-gee9a43b7cfe2 #0 > RIP: 0010:__iomap_write_begin fs/iomap/buffered-io.c:727 [inline] > RIP: 0010:iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 > Call Trace: > <TASK> > iomap_unshare_iter fs/iomap/buffered-io.c:1351 [inline] > iomap_file_unshare+0x460/0x780 fs/iomap/buffered-io.c:1391 > xfs_reflink_unshare+0x173/0x5f0 fs/xfs/xfs_reflink.c:1681 > xfs_file_fallocate+0x6be/0xa50 fs/xfs/xfs_file.c:997 > vfs_fallocate+0x553/0x6c0 fs/open.c:334 > ksys_fallocate fs/open.c:357 [inline] > __do_sys_fallocate fs/open.c:365 [inline] > __se_sys_fallocate fs/open.c:363 [inline] > __x64_sys_fallocate+0xbd/0x110 fs/open.c:363 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f2d716a6899 > > syzbot constructed the following scenario: syzbot called the write() > system call, passed an illegal pointer, and attempted to write 0x1017 > bytes, resulting in 0 bytes written and returning EFAULT to user > space. Then, it called the write() system call again, passed another > illegal pointer, and attempted to write 0xfea7 bytes, resulting in > 0xe00 bytes written. Finally called copy_file_range() sys call and > fallocate() sys call with FALLOC_FL_UNSHARE_RANGE flag. > > What happened here is: during the first write, xfs_buffered_write_iomap_begin() > used preallocated 512 blocks, inserted an extent with a length of 512 and > reserved 512 blocks in the quota, with the iomap length being 1M. Why did XFS preallocate 512 blocks? The file was opened O_TRUNC, so it should be zero length and a write of just over 4100 bytes will not trigger speculative prealloc on a zero length file (threshold is 64kB). Indeed, the first speculative prealloc will only be 64kB in size... Hence it's not immediately obvious what precondition is causing this behaviour to occur. > However, when the write failed(0 byte was written), only 0x1017 bytes were > passed to iomap_end() instead of the preallocated 1M bytes/512 blocks. > This caused only 3 blocks to be unreserved for the quota in iomap_end(), > instead of 512, and the corresponding extent information also only removed > 3 blocks instead of 512. Why 3 blocks? what was the filesystem block size? 2kB? This also smells of delayed allocation, because if it were -preallocation- it would be unwritten extents and we don't punch them out on write failures. See my first question.... Regardless, the behaviour is perfectly fine. The remainder of the blocks are beyond EOF, and so have no impact on anything at this point in time. > As a result, during the second write, the iomap length was 3 blocks > instead of the expected 512 blocks, What was the mapping? A new delalloc extent from 0 to 6kB? If so, that is exactly as expected, because there's a hole in the file there. > which ultimately triggered the > issue reported by syzbot in the fallocate() system call. How? We wrote 3584 bytes, which means the first 2 blocks were written and marked dirty, and the third block should have been punched out by the same process that punched the delalloc blocks in the first write. We end up with a file size of 3584 bytes, and a delalloc mapping for those first two blocks followed by a hole, followed by another delalloc extent beyond EOF. There is absolutely nothing incorrect about this state - this is exactly how we want delalloc beyond EOF to be handled. i.e. it remains in place until it is explicitly removed. An normal application would fix the write buffer and rewrite the data, thereby using the space beyond EOF that we've already preallocated. IOWs, this change in this patch is papering over the issue by preventing short writes from leaving extents beyond EOF, rather than working out why either CFR or UNSHARE is going wrong when there are extents beyond EOF. So, onto the copy_file_range() bit. Then CFR was called with a length of 0xffffffffa003e45bul, which is almost 16EB in size. This should result in -EOVERFLOW, because that is out of the range that an loff_t can represent. i.e. generic_copy_file_checks() does this: if (pos_in + count < pos_in || pos_out + count < pos_out) return -EOVERFLOW; and pos_in is a loff_t which is signed. Hence (0 + 15EB) should overflow to a large negative offset and be less than 0. Implicit type casting rules always break my brain - this looks like it is casting everything to unsigned, thereby not actually checking if we're overflowing the max offset the kernel can operate on. This oversize length is not caught by a check against max file size supported by the superblock, either,, because the count gets truncated to EOF before the generic checks against supported maximum file sizes are done. That seems ... wrong. Look at fallocate() - after checking for overflow, it checks offset + len against inode->i_sb->s_maxbytes and returns -EFBIG at that point. IOWs, I think CFR should be doing nothing but returning either -EOVERFLOW of -EFBIG here because the length is way longer than maximum supported file offset for any file operation in Linux. Then unshare should do nothing because the file is not shared and should not have the reflink flag set on it..... However, the strace indicates that: copy_file_range(6, [0], 5, NULL, 18446744072099193947, 0) = 3584 That it is copying 3584 bytes. i.e. the overflow check is broken. It indicates, however, that the file size is as expected from the the short writes that occurred. Hence the unshare operation should only be operating on the range of 0-3584 bytes because that is the only possible range of the file that is shared. However, the fallocate() call is for offset 0, length 0x2000 bytes (8kB), and these ranges are passed directly from XFS to iomap_file_unshare(). iomap_file_unshare() never checks the range against EOF, either, and so we've passed a range beyond EOF to iomap_file_unshare() for it to process. That seems ... wrong. Hence the unshare does: first map: 0 - 4k - unshare successfully. second map: 4k - 6k - hole, skip. Beyond EOF. third map: 6k - 8k - delalloc, beyond EOF so needs zeroing. Fires warnings because UNSHARE. IOWs, iomap_file_unshare() will walk beyond EOF blissfully unaware it is trying to unshare blocks that cannot ever be shared first place because reflink will not share blocks beyond EOF. And because those blocks are beyond EOF, they will always trigger the "need zeroing" case in __iomap_write_begin() and fire warnings. So, yeah, either xfs_file_fallocate() or iomap_file_unshare() need to clamp the range being unshared to EOF. Hence this looks like a couple of contributing issues that need to be fixed: - The CFR overflow checks look broken. - The unshare range is never trimmed to EOF. but it's definitely not a bug caused by short writes leaving delalloc extents beyond EOF. There are many, many ways that we can create delalloc extents beyond EOF before running an UNSHARE operation that can trip over them like this. -Dave.
Just a quick notice that I'm on vacation this week and only doing an absolute minimal amount of work in the mornings, as I said before I'm planning to look into it when I get time. But just as Dave I suspect that the patch is papering over the real issue.
On Wed, 2024-09-04 at 10:35 +1000, Dave Chinner wrote: Hi, Dave. Thanks for your review and comments. > On Tue, Sep 03, 2024 at 01:48:08PM +0800, Julian Sun wrote: > > Hi, all. > > > > Recently, syzbot reported a issue as following: > > > > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 __iomap_write_begin fs/iomap/buffered-io.c:727 [inline] > > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 > > CPU: 1 UID: 0 PID: 5222 Comm: syz-executor247 Not tainted 6.11.0-rc2-syzkaller-00111-gee9a43b7cfe2 #0 > > RIP: 0010:__iomap_write_begin fs/iomap/buffered-io.c:727 [inline] > > RIP: 0010:iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 > > Call Trace: > > <TASK> > > iomap_unshare_iter fs/iomap/buffered-io.c:1351 [inline] > > iomap_file_unshare+0x460/0x780 fs/iomap/buffered-io.c:1391 > > xfs_reflink_unshare+0x173/0x5f0 fs/xfs/xfs_reflink.c:1681 > > xfs_file_fallocate+0x6be/0xa50 fs/xfs/xfs_file.c:997 > > vfs_fallocate+0x553/0x6c0 fs/open.c:334 > > ksys_fallocate fs/open.c:357 [inline] > > __do_sys_fallocate fs/open.c:365 [inline] > > __se_sys_fallocate fs/open.c:363 [inline] > > __x64_sys_fallocate+0xbd/0x110 fs/open.c:363 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > RIP: 0033:0x7f2d716a6899 > > > > syzbot constructed the following scenario: syzbot called the write() > > system call, passed an illegal pointer, and attempted to write 0x1017 > > bytes, resulting in 0 bytes written and returning EFAULT to user > > space. Then, it called the write() system call again, passed another > > illegal pointer, and attempted to write 0xfea7 bytes, resulting in > > 0xe00 bytes written. Finally called copy_file_range() sys call and > > fallocate() sys call with FALLOC_FL_UNSHARE_RANGE flag. > > > > What happened here is: during the first write, xfs_buffered_write_iomap_begin() > > used preallocated 512 blocks, inserted an extent with a length of 512 and > > reserved 512 blocks in the quota, with the iomap length being 1M. > > Why did XFS preallocate 512 blocks? The file was opened O_TRUNC, so > it should be zero length and a write of just over 4100 bytes will > not trigger speculative prealloc on a zero length file (threshold is > 64kB). Indeed, the first speculative prealloc will only be 64kB in > size... Perhaps my wording was misleading. This might not be actual preallocation, but rather it reached the following statement. if (xfs_has_allocsize(mp)) prealloc_blocks = mp->m_allocsize_blocks; And mp->m_allocsize_blocks is 512. > > Hence it's not immediately obvious what precondition is causing this > behaviour to occur. > > > However, when the write failed(0 byte was written), only 0x1017 bytes were > > passed to iomap_end() instead of the preallocated 1M bytes/512 blocks. > > This caused only 3 blocks to be unreserved for the quota in iomap_end(), > > instead of 512, and the corresponding extent information also only removed > > 3 blocks instead of 512. > > Why 3 blocks? what was the filesystem block size? 2kB? Exactly. GDB observed that sb_blocksize was 2048 and sb_blocklog was 11. > > This also smells of delayed allocation, because if it were > -preallocation- it would be unwritten extents and we don't punch > them out on write failures. See my first question.... > > Regardless, the behaviour is perfectly fine. The remainder of the > blocks are beyond EOF, and so have no impact on anything at this > point in time. > > > As a result, during the second write, the iomap length was 3 blocks > > instead of the expected 512 blocks, > > What was the mapping? A new delalloc extent from 0 to 6kB? If so, > that is exactly as expected, because there's a hole in the file > there. Yeah. There's a delalloc extent from 0 to 6KB. And below is the log of extent changes during the reproduction process before the panic. [ 24.466778][ T2491] write begin for ino 9292 offset 0 count 4119 offset_fsb 0 end_fsb 3 [ 24.467325][ T2491] insert extent pos:0 off:0 count:512 start:4503599627239434 [ 24.467909][ T2491] iomap->offset is 0 iomap->length is 1048576 [ 24.468603][ T2491] update extent before pos:0 off:0 count:512 start:4503599627239434 [ 24.469089][ T2491] update extent after pos:0 off:3 count:509 start:4503599627239434 write: Bad address [ 24.469734][ T2491] write begin for ino 9292 offset 0 count 65191 offset_fsb 0 end_fsb 32 [ 24.470206][ T2491] update extent before pos:0 off:3 count:509 start:4503599627239434 [ 24.470673][ T2491] update extent after pos:0 off:0 count:512 start:4503599627239434 [ 24.471110][ T2491] iomap->offset is 0 iomap->length is 6144 [ 24.471916][ T2491] update extent before pos:0 off:0 count:512 start:4503599627239434 [ 24.472382][ T2491] update extent after pos:0 off:0 count:2 start:4503599627239429 [ 24.472838][ T2491] insert extent pos:1 off:3 count:509 start:4503599627239430 [ 24.473270][ T2491] write begin for ino 9292 offset 3584 count 61607 offset_fsb 1 end_fsb 32 write ret is 0xe00 [ 24.474517][ T2491] update extent before pos:0 off:0 count:2 start:4503599627239429 [ 24.474994][ T2491] update extent after pos:0 off:0 count:2 start:13 [ 24.476451][ T8] update extent before pos:0 off:0 count:2 start:13 [ 24.476902][ T8] update extent after pos:0 off:0 count:2 start:13 [ 24.478880][ T2491] insert extent pos:0 off:0 count:2 start:13 copy_file_range ret is 0xe00 [ 24.481463][ T2491] write begin for ino 9292 offset 0 count 8192 offset_fsb 0 end_fsb 4 [ 24.481968][ T2491] insert extent pos:0 off:0 count:32 start:4503599627239430 [ 24.482422][ T2491] write begin for ino 9292 offset 4096 count 4096 offset_fsb 2 end_fsb 4 [ 24.482882][ T2491] write begin for ino 9292 offset 6144 count 2048 offset_fsb 3 end_fsb 4 Note the last two logs: fallocate() successfully unshared 4k bytes, then skipped the last 2k bytes because they were not being shared, and then attempted to unshare 2k bytes beyond EOF. > > > which ultimately triggered the > > issue reported by syzbot in the fallocate() system call. > > How? We wrote 3584 bytes, which means the first 2 blocks were > written and marked dirty, and the third block should have been > punched out by the same process that punched the delalloc blocks in > the first write. We end up with a file size of 3584 bytes, and > a delalloc mapping for those first two blocks followed by a hole, > followed by another delalloc extent beyond EOF. > > There is absolutely nothing incorrect about this state - this is > exactly how we want delalloc beyond EOF to be handled. i.e. it > remains in place until it is explicitly removed. An normal > application would fix the write buffer and rewrite the data, thereby > using the space beyond EOF that we've already preallocated. Got it. Thanks for your explanation. > > IOWs, this change in this patch is papering over the issue by > preventing short writes from leaving extents beyond EOF, rather than > working out why either CFR or UNSHARE is going wrong when there are > extents beyond EOF. Yeah, totally agreed. > > So, onto the copy_file_range() bit. > > Then CFR was called with a length of 0xffffffffa003e45bul, which is > almost 16EB in size. This should result in -EOVERFLOW, because that > is out of the range that an loff_t can represent. > > i.e. generic_copy_file_checks() does this: > > if (pos_in + count < pos_in || pos_out + count < pos_out) > return -EOVERFLOW; > > and pos_in is a loff_t which is signed. Hence (0 + 15EB) should > overflow to a large negative offset and be less than 0. Implicit > type casting rules always break my brain - this looks like it is > casting everything to unsigned, thereby not actually checking if > we're overflowing the max offset the kernel can operate on. Yes! The check was supposed to fail here, but it didn't. Maybe what we should do is like this: if (pos_in + (loff_t)count < pos_in || pos_out + (loff_t)count < pos_out) return -EOVERFLOW; My tests showed that CFR returned -EOVERFLOW with the change. A search for the keyword "Ensure offsets don't wrap" shows that there is also this kind of implicit conversion issue in generic_remap_checks(). And xfs_exchange_range_checks() uses check_add_overflow() to check for overflow. Clearly, the latter is a more elegant way of checking. I'm wondering if we have any tools or can we create a tool to check for such implicit conversions. GCC provides -Wconversion, but it reports too many issues, and people might get overwhelmed by such information. Or can we find a method to filter the warning that we really need to fix? > > This oversize length is not caught by a check against max file size > supported by the superblock, either,, because the count gets > truncated to EOF before the generic checks against supported maximum > file sizes are done. > > That seems ... wrong. Look at fallocate() - after checking for > overflow, it checks offset + len against inode->i_sb->s_maxbytes and > returns -EFBIG at that point. In this case, offset is 0 and len is 8192, so the check in fallocate() should passed. > > IOWs, I think CFR should be doing nothing but returning either > -EOVERFLOW of -EFBIG here because the length is way longer than > maximum supported file offset for any file operation in Linux. Then > unshare should do nothing because the file is not shared and should > not have the reflink flag set on it..... > > However, the strace indicates that: > > copy_file_range(6, [0], 5, NULL, 18446744072099193947, 0) = 3584 > > That it is copying 3584 bytes. i.e. the overflow check is broken. > It indicates, however, that the file size is as expected from the > the short writes that occurred. > > Hence the unshare operation should only be operating on the range of > 0-3584 bytes because that is the only possible range of the file > that is shared. > > However, the fallocate() call is for offset 0, length 0x2000 bytes > (8kB), and these ranges are passed directly from XFS to > iomap_file_unshare(). iomap_file_unshare() never checks the range > against EOF, either, and so we've passed a range beyond EOF to > iomap_file_unshare() for it to process. That seems ... wrong. > > Hence the unshare does: > > first map: 0 - 4k - unshare successfully. > second map: 4k - 6k - hole, skip. Beyond EOF. > third map: 6k - 8k - delalloc, beyond EOF so needs zeroing. > Fires warnings because UNSHARE. Yes. A slight difference: the second mapping was skipped because it wasn't marked with IOMAP_F_SHARED. > > IOWs, iomap_file_unshare() will walk beyond EOF blissfully unaware > it is trying to unshare blocks that cannot ever be shared first > place because reflink will not share blocks beyond EOF. And because > those blocks are beyond EOF, they will always trigger the "need > zeroing" case in __iomap_write_begin() and fire warnings. > > So, yeah, either xfs_file_fallocate() or iomap_file_unshare() need > to clamp the range being unshared to EOF. > > Hence this looks like a couple of contributing issues that need to > be fixed: > > - The CFR overflow checks look broken. > - The unshare range is never trimmed to EOF. Exactly. The first can be fixed with using check_add_overflow() macro. The latter one maybe can be fixed with the following patch: diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index f420c53d86ac..8898d5ec606f 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1340,6 +1340,9 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter) /* don't bother with holes or unwritten extents */ if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) return length; + /* don't try to unshare any extents that beyond EOF. */ + if (pos > i_size_read(iter->inode)) + return length; do { struct folio *folio; > > but it's definitely not a bug caused by short writes leaving > delalloc extents beyond EOF. There are many, many ways that > we can create delalloc extents beyond EOF before running an UNSHARE > operation that can trip over them like this. > I see. Thanks again for your detailed and patient explanation. > -Dave. Thanks,
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c index 79a0614eaab7..6e3f6109cac5 100644 --- a/fs/iomap/iter.c +++ b/fs/iomap/iter.c @@ -76,7 +76,8 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops) int ret; if (iter->iomap.length && ops->iomap_end) { - ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter), + ret = ops->iomap_end(iter->inode, iter->pos, + iter->processed > 0 ? iomap_length(iter) : iter->iomap.length, iter->processed > 0 ? iter->processed : 0, iter->flags, &iter->iomap); if (ret < 0 && !iter->processed)
Hi, all. Recently, syzbot reported a issue as following: WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 __iomap_write_begin fs/iomap/buffered-io.c:727 [inline] WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 CPU: 1 UID: 0 PID: 5222 Comm: syz-executor247 Not tainted 6.11.0-rc2-syzkaller-00111-gee9a43b7cfe2 #0 RIP: 0010:__iomap_write_begin fs/iomap/buffered-io.c:727 [inline] RIP: 0010:iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 Call Trace: <TASK> iomap_unshare_iter fs/iomap/buffered-io.c:1351 [inline] iomap_file_unshare+0x460/0x780 fs/iomap/buffered-io.c:1391 xfs_reflink_unshare+0x173/0x5f0 fs/xfs/xfs_reflink.c:1681 xfs_file_fallocate+0x6be/0xa50 fs/xfs/xfs_file.c:997 vfs_fallocate+0x553/0x6c0 fs/open.c:334 ksys_fallocate fs/open.c:357 [inline] __do_sys_fallocate fs/open.c:365 [inline] __se_sys_fallocate fs/open.c:363 [inline] __x64_sys_fallocate+0xbd/0x110 fs/open.c:363 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f2d716a6899 syzbot constructed the following scenario: syzbot called the write() system call, passed an illegal pointer, and attempted to write 0x1017 bytes, resulting in 0 bytes written and returning EFAULT to user space. Then, it called the write() system call again, passed another illegal pointer, and attempted to write 0xfea7 bytes, resulting in 0xe00 bytes written. Finally called copy_file_range() sys call and fallocate() sys call with FALLOC_FL_UNSHARE_RANGE flag. What happened here is: during the first write, xfs_buffered_write_iomap_begin() used preallocated 512 blocks, inserted an extent with a length of 512 and reserved 512 blocks in the quota, with the iomap length being 1M. However, when the write failed(0 byte was written), only 0x1017 bytes were passed to iomap_end() instead of the preallocated 1M bytes/512 blocks. This caused only 3 blocks to be unreserved for the quota in iomap_end(), instead of 512, and the corresponding extent information also only removed 3 blocks instead of 512. As a result, during the second write, the iomap length was 3 blocks instead of the expected 512 blocks, which ultimately triggered the issue reported by syzbot in the fallocate() system call. To resolve this issue, when a write fails, we should pass iomap.length to iomap_end() to indicate it to clean up all the resources, rather than just the length requested by the user. This patch has already passed xfstests -g quick test on both xfs and ext4 without causing additional failures. Reported-and-tested-by: syzbot+296b1c84b9cbf306e5a0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=296b1c84b9cbf306e5a0 Fixes: f4b896c213f0 ("iomap: add the new iomap_iter model") Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> --- fs/iomap/iter.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)