diff mbox series

fs: generic_copy_file_checks: Do not adjust count based on file size

Message ID 20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid (mailing list archive)
State New, archived
Headers show
Series fs: generic_copy_file_checks: Do not adjust count based on file size | expand

Commit Message

Nicolas Boichat Jan. 26, 2021, 5:50 a.m. UTC
copy_file_range (which calls generic_copy_file_checks) uses the
inode file size to adjust the copy count parameter. This breaks
with special filesystems like procfs/sysfs, where the file size
appears to be zero, but content is actually returned when a read
operation is performed.

This commit ignores the source file size, and makes copy_file_range
match the end of file behaviour documented in POSIX's "read",
where 0 is returned to mark EOF. This would allow "cp" and other
standard tools to make use of copy_file_range with the exact same
behaviour as they had in the past.

Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
---
This can be reproduced with this simple test case:
 #define _GNU_SOURCE
 #include <fcntl.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/stat.h>
 #include <unistd.h>

 int
 main(int argc, char **argv)
 {
   int fd_in, fd_out;
   loff_t ret;

   fd_in = open("/proc/version", O_RDONLY);
   fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);

   do {
     ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
     printf("%d bytes copied\n", (int)ret);
   } while (ret > 0);

   return 0;
 }

Without this patch, `version` output file is empty, and no bytes
are copied:
0 bytes copied

With this patch, the loop runs twice and the content of the file
is copied:
315 bytes copied
0 bytes copied

We hit this issue when upgrading Go compiler from 1.13 to 1.15 [1],
as we use Go's `io.Copy` to copy the content of
`/sys/kernel/debug/tracing/trace` to a temporary file.

Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
copy operation. However, that fails to copy any content when the input
file is from sysfs/tracefs, with an apparent size of 0 (but there is
still content when you `cat` it, of course).

[1] http://issuetracker.google.com/issues/178332739

 fs/read_write.c | 8 --------
 1 file changed, 8 deletions(-)

Comments

Nicolas Boichat Jan. 28, 2021, 12:46 a.m. UTC | #1
On Wed, Jan 27, 2021 at 7:38 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Jan 26, 2021 at 01:50:22PM +0800, Nicolas Boichat wrote:
> > copy_file_range (which calls generic_copy_file_checks) uses the
> > inode file size to adjust the copy count parameter. This breaks
> > with special filesystems like procfs/sysfs, where the file size
> > appears to be zero, but content is actually returned when a read
> > operation is performed.
> >
> > This commit ignores the source file size, and makes copy_file_range
> > match the end of file behaviour documented in POSIX's "read",
> > where 0 is returned to mark EOF. This would allow "cp" and other
> > standard tools to make use of copy_file_range with the exact same
> > behaviour as they had in the past.
> >
> > Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
> > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
>
> Nack.

Thanks Dave and Al for the detailed explanations.

>
> As I've explained, this is intentional and bypassing it is not a
> work around for enabling cfr on filesystems that produce ephemeral,
> volatile read-once data using seq-file pipes that masquerade as
> regular files with zero size. These files are behaving like pipes
> and only work because the VFS has to support read() and friends from
> pipes that don't publish the amount of data they contain to the VFS
> inode.
>
> copy_file_range() does not support such behaviour.
>
> copy_file_range() -writes- data, so we have to check that those
> writes do not extend past boundaries that the destination inode
> imposes on the operation. e.g. maximum offset limits, whether the
> ranges overlap in the same file, etc.
>
> Hence we need to know how much data there is present to copy before
> we can check if it is safe to perform the -write- of the data we are
> going to read. Hence we cannot safely support data sources that
> cannot tell us how much data is present before we start the copy
> operation.
>
> IOWs, these source file EOF restrictions are required by the write
> side of copy_file_range(), not the read side.
>
> > ---
> > This can be reproduced with this simple test case:
> >  #define _GNU_SOURCE
> >  #include <fcntl.h>
> >  #include <stdio.h>
> >  #include <stdlib.h>
> >  #include <sys/stat.h>
> >  #include <unistd.h>
> >
> >  int
> >  main(int argc, char **argv)
> >  {
> >    int fd_in, fd_out;
> >    loff_t ret;
> >
> >    fd_in = open("/proc/version", O_RDONLY);
> >    fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);
> >
> >    do {
> >      ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
> >      printf("%d bytes copied\n", (int)ret);
> >    } while (ret > 0);
> >
> >    return 0;
> >  }
> >
> > Without this patch, `version` output file is empty, and no bytes
> > are copied:
> > 0 bytes copied
>
> $ ls -l /proc/version
> -r--r--r-- 1 root root 0 Jan 20 17:25 /proc/version
> $
>
> It's a zero length file.
>
> sysfs does this just fine - it's regular files have a size of
> at least PAGE_SIZE rather than zero, and so copy_file_range works
> just fine on them:
>
> $ ls -l /sys/block/nvme0n1/capability
> -r--r--r-- 1 root root 4096 Jan 27 08:41 /sys/block/nvme0n1/capability
> $ cat /sys/block/nvme0n1/capability
> 50
> $ xfs_io -f -c "copy_range -s 0 -d 0 -l 4096 /sys/block/nvme0n1/capability" /tmp/foo
> $ sudo cat /tmp/foo
> 50
>
> And the behaviour is exactly as you'd expect a read() loop to copy
> the file to behave:
>
> openat(AT_FDCWD, "/tmp/foo", O_RDWR|O_CREAT, 0600) = 3
> ....
> openat(AT_FDCWD, "/sys/block/nvme0n1/capability", O_RDONLY) = 4
> copy_file_range(4, [0], 3, [0], 4096, 0) = 3
> copy_file_range(4, [3], 3, [3], 4093, 0) = 0
> close(4)
>
> See? Inode size of 4096 means there's a maximum of 4kB of data that
> can be read from this file.  copy_file_range() now behaves exactly
> as read() would, returning a short copy and then 0 bytes to indicate
> EOF.

Unless the content happens to be larger than PAGE_SIZE, then
copy_file_range would only copy the beginning of the file. And as Al
explained, this will still break in case of short writes.

>
> If you want ephemeral data pipes masquerading as regular files to
> work with copy_file_range, then the filesystem implementation needs
> to provide the VFS with a data size that indicates the maximum
> amount of data that the pipe can produce in a continuous read loop.
> Otherwise we cannot validate the range of the write we may be asked
> to perform...
>
> > Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
> > copy operation. However, that fails to copy any content when the input
> > file is from sysfs/tracefs, with an apparent size of 0 (but there is
> > still content when you `cat` it, of course).
>
> Libraries using copy_file_range() must be prepared for it to fail
> and fall back to normal copy mechanisms.

How is userspace suppose to detect that? (checking for 0 file size
won't work with the example above)

> Of course, with these
> special zero length files that contain ephemeral data, userspace can't
> actually tell that they contain data from userspace using stat(). So
> as far as userspace is concerned, copy_file_range() correctly
> returned zero bytes copied from a zero byte long file and there's
> nothing more to do.
>
> This zero length file behaviour is, fundamentally, a kernel
> filesystem implementation bug, not a copy_file_range() bug.

Okay, so, based on this and Al's reply, I see 2 things we can do:
 1. Go should probably not use copy_file_range in a common library
function, as I don't see any easy way to detect this scenario
currently (detect 0 size? sure, but that won't work with the example
you provide above). And the man page should document this behaviour
more explicitly to prevent further incorrect usage.
 2. Can procfs/sysfs/debugfs and friends explicitly prevent usage of
copy_file_range? (based on Al's reply, there seems to be no way to
implement it correctly as seeking in such files will not work in case
of short writes)

Thanks,

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong Jan. 28, 2021, 5:57 a.m. UTC | #2
On Thu, Jan 28, 2021 at 08:46:04AM +0800, Nicolas Boichat wrote:
> On Wed, Jan 27, 2021 at 7:38 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Jan 26, 2021 at 01:50:22PM +0800, Nicolas Boichat wrote:
> > > copy_file_range (which calls generic_copy_file_checks) uses the
> > > inode file size to adjust the copy count parameter. This breaks
> > > with special filesystems like procfs/sysfs, where the file size
> > > appears to be zero, but content is actually returned when a read
> > > operation is performed.
> > >
> > > This commit ignores the source file size, and makes copy_file_range
> > > match the end of file behaviour documented in POSIX's "read",
> > > where 0 is returned to mark EOF. This would allow "cp" and other
> > > standard tools to make use of copy_file_range with the exact same
> > > behaviour as they had in the past.
> > >
> > > Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
> > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
> >
> > Nack.
> 
> Thanks Dave and Al for the detailed explanations.
> 
> >
> > As I've explained, this is intentional and bypassing it is not a
> > work around for enabling cfr on filesystems that produce ephemeral,
> > volatile read-once data using seq-file pipes that masquerade as
> > regular files with zero size. These files are behaving like pipes
> > and only work because the VFS has to support read() and friends from
> > pipes that don't publish the amount of data they contain to the VFS
> > inode.
> >
> > copy_file_range() does not support such behaviour.
> >
> > copy_file_range() -writes- data, so we have to check that those
> > writes do not extend past boundaries that the destination inode
> > imposes on the operation. e.g. maximum offset limits, whether the
> > ranges overlap in the same file, etc.
> >
> > Hence we need to know how much data there is present to copy before
> > we can check if it is safe to perform the -write- of the data we are
> > going to read. Hence we cannot safely support data sources that
> > cannot tell us how much data is present before we start the copy
> > operation.
> >
> > IOWs, these source file EOF restrictions are required by the write
> > side of copy_file_range(), not the read side.
> >
> > > ---
> > > This can be reproduced with this simple test case:
> > >  #define _GNU_SOURCE
> > >  #include <fcntl.h>
> > >  #include <stdio.h>
> > >  #include <stdlib.h>
> > >  #include <sys/stat.h>
> > >  #include <unistd.h>
> > >
> > >  int
> > >  main(int argc, char **argv)
> > >  {
> > >    int fd_in, fd_out;
> > >    loff_t ret;
> > >
> > >    fd_in = open("/proc/version", O_RDONLY);
> > >    fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);
> > >
> > >    do {
> > >      ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
> > >      printf("%d bytes copied\n", (int)ret);
> > >    } while (ret > 0);
> > >
> > >    return 0;
> > >  }
> > >
> > > Without this patch, `version` output file is empty, and no bytes
> > > are copied:
> > > 0 bytes copied
> >
> > $ ls -l /proc/version
> > -r--r--r-- 1 root root 0 Jan 20 17:25 /proc/version
> > $
> >
> > It's a zero length file.
> >
> > sysfs does this just fine - it's regular files have a size of
> > at least PAGE_SIZE rather than zero, and so copy_file_range works
> > just fine on them:
> >
> > $ ls -l /sys/block/nvme0n1/capability
> > -r--r--r-- 1 root root 4096 Jan 27 08:41 /sys/block/nvme0n1/capability
> > $ cat /sys/block/nvme0n1/capability
> > 50
> > $ xfs_io -f -c "copy_range -s 0 -d 0 -l 4096 /sys/block/nvme0n1/capability" /tmp/foo
> > $ sudo cat /tmp/foo
> > 50
> >
> > And the behaviour is exactly as you'd expect a read() loop to copy
> > the file to behave:
> >
> > openat(AT_FDCWD, "/tmp/foo", O_RDWR|O_CREAT, 0600) = 3
> > ....
> > openat(AT_FDCWD, "/sys/block/nvme0n1/capability", O_RDONLY) = 4
> > copy_file_range(4, [0], 3, [0], 4096, 0) = 3
> > copy_file_range(4, [3], 3, [3], 4093, 0) = 0
> > close(4)
> >
> > See? Inode size of 4096 means there's a maximum of 4kB of data that
> > can be read from this file.  copy_file_range() now behaves exactly
> > as read() would, returning a short copy and then 0 bytes to indicate
> > EOF.
> 
> Unless the content happens to be larger than PAGE_SIZE, then
> copy_file_range would only copy the beginning of the file. And as Al
> explained, this will still break in case of short writes.
> 
> >
> > If you want ephemeral data pipes masquerading as regular files to
> > work with copy_file_range, then the filesystem implementation needs
> > to provide the VFS with a data size that indicates the maximum
> > amount of data that the pipe can produce in a continuous read loop.
> > Otherwise we cannot validate the range of the write we may be asked
> > to perform...
> >
> > > Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
> > > copy operation. However, that fails to copy any content when the input
> > > file is from sysfs/tracefs, with an apparent size of 0 (but there is
> > > still content when you `cat` it, of course).
> >
> > Libraries using copy_file_range() must be prepared for it to fail
> > and fall back to normal copy mechanisms.
> 
> How is userspace suppose to detect that? (checking for 0 file size
> won't work with the example above)
> 
> > Of course, with these
> > special zero length files that contain ephemeral data, userspace can't
> > actually tell that they contain data from userspace using stat(). So
> > as far as userspace is concerned, copy_file_range() correctly
> > returned zero bytes copied from a zero byte long file and there's
> > nothing more to do.
> >
> > This zero length file behaviour is, fundamentally, a kernel
> > filesystem implementation bug, not a copy_file_range() bug.
> 
> Okay, so, based on this and Al's reply, I see 2 things we can do:
>  1. Go should probably not use copy_file_range in a common library
> function, as I don't see any easy way to detect this scenario
> currently (detect 0 size? sure, but that won't work with the example
> you provide above). And the man page should document this behaviour
> more explicitly to prevent further incorrect usage.
>  2. Can procfs/sysfs/debugfs and friends explicitly prevent usage of
> copy_file_range? (based on Al's reply, there seems to be no way to
> implement it correctly as seeking in such files will not work in case
> of short writes)

One /could/ make those three provide a phony CFR implementation that
would return -EOPNOTSUPP, though like others have said, it's weird to
have regular files that aren't quite regular.  Not sure where that
leaves them, though...

--D

> 
> Thanks,
> 
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
Nicolas Boichat Feb. 12, 2021, 4:48 a.m. UTC | #3
On Thu, Jan 28, 2021 at 1:57 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Jan 28, 2021 at 08:46:04AM +0800, Nicolas Boichat wrote:
[snip]
> > Okay, so, based on this and Al's reply, I see 2 things we can do:
> >  1. Go should probably not use copy_file_range in a common library
> > function, as I don't see any easy way to detect this scenario
> > currently (detect 0 size? sure, but that won't work with the example
> > you provide above). And the man page should document this behaviour
> > more explicitly to prevent further incorrect usage.
> >  2. Can procfs/sysfs/debugfs and friends explicitly prevent usage of
> > copy_file_range? (based on Al's reply, there seems to be no way to
> > implement it correctly as seeking in such files will not work in case
> > of short writes)
>
> One /could/ make those three provide a phony CFR implementation that
> would return -EOPNOTSUPP, though like others have said, it's weird to
> have regular files that aren't quite regular.  Not sure where that
> leaves them, though...

Not that simple, as the issue happens on cross-filesystem operations
where file_operations->copy_file_range is not called (and also, that'd
require modifying operations for every single generated file...

Anyway, made an attempt here:
https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@chromium.org/T/#t

> --D
>
> >
> > Thanks,
> >
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..7236146f6ad7 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1424,7 +1424,6 @@  static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
 	uint64_t count = *req_count;
-	loff_t size_in;
 	int ret;
 
 	ret = generic_file_rw_checks(file_in, file_out);
@@ -1442,13 +1441,6 @@  static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 	if (pos_in + count < pos_in || pos_out + count < pos_out)
 		return -EOVERFLOW;
 
-	/* Shorten the copy to EOF */
-	size_in = i_size_read(inode_in);
-	if (pos_in >= size_in)
-		count = 0;
-	else
-		count = min(count, size_in - (uint64_t)pos_in);
-
 	ret = generic_write_check_limits(file_out, pos_out, &count);
 	if (ret)
 		return ret;