Message ID | 20181109005410.GG19305@dastard (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | : dedupe/copy_file_range fixes | expand |
On Fri, Nov 09, 2018 at 11:54:10AM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > When copy_file_range() is called on files that have been opened with > O_DIRECT, do_splice_direct() does a manual copy of the range one > pipe buffer at a time. The default is 16 pages, which means on > x86_64 it is limited to 64kB IO. This is extremely slow - 64k > synchrnous read/write will run at maybe 5-10MB/s on a spinning disk > and be seek bound. It will be faster on SSDs, but still very > inefficient. > > Increase the pipe size to the maximum allowed user size so that we > can get decent throughput for this highly sub-optimal copy loop. Add > a new function to the pipe code that lets us set the pipe size to > the maximum allowed without root permissions to keep things really > simple. We also don't care if changing the pipe size fails - that > will just result in a slower copy. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/pipe.c | 10 ++++++++++ > fs/splice.c | 7 +++++++ > include/linux/pipe_fs_i.h | 1 + > 3 files changed, 18 insertions(+) > > diff --git a/fs/pipe.c b/fs/pipe.c > index bdc5d3c0977d..436bc0464569 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > return ret; > } > > +/* > + * Set the pipe to the maximum allowable user size. Advisory only, will > + * swallow any errors and return the resultant pipe size. > + */ > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) > +{ > + pipe_set_size(pipe, pipe_max_size); > + return pipe->buffers * PAGE_SIZE; > +} > + > /* > * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same > * location, so checking ->i_pipe is not enough to verify that this is a > diff --git a/fs/splice.c b/fs/splice.c > index 3553f1956508..9749139da731 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, > current->splice_pipe = pipe; > } > > + /* > + * Try to increase the data holding capacity of the pipe so we can do > + * larger IOs. This may not increase the size at all because maximum > + * user pipe size is administrator controlled, but we still should try. > + */ > + pipe_set_max_safe_size(pipe); I get where you're going with this, but I have two questions: - Is it safe to be enlarging the pipe buffer size unconditionally? - Especially if we didn't just create the splice pipe? Suppose someone comes along later trying to splice things and doesn't realize the pipe is now 1MB... Then I started wondering about the splice_pipe lifetime and couldn't figure out if it ever gets detached from current prior to do_exit. I don't think it does, which means that we're stuck with the 1MB kernel memory allocation until the process dies. --D > + > /* > * Do the splice. > */ > diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h > index 5a3bb3b7c9ad..962ba4cfcb74 100644 > --- a/include/linux/pipe_fs_i.h > +++ b/include/linux/pipe_fs_i.h > @@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file); > > int create_pipe_files(struct file **, int); > unsigned int round_pipe_size(unsigned long size); > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe); > > #endif
On Tue, Nov 13, 2018 at 08:22:37AM -0800, Darrick J. Wong wrote: > On Fri, Nov 09, 2018 at 11:54:10AM +1100, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > When copy_file_range() is called on files that have been opened with > > O_DIRECT, do_splice_direct() does a manual copy of the range one > > pipe buffer at a time. The default is 16 pages, which means on > > x86_64 it is limited to 64kB IO. This is extremely slow - 64k > > synchrnous read/write will run at maybe 5-10MB/s on a spinning disk > > and be seek bound. It will be faster on SSDs, but still very > > inefficient. > > > > Increase the pipe size to the maximum allowed user size so that we > > can get decent throughput for this highly sub-optimal copy loop. Add > > a new function to the pipe code that lets us set the pipe size to > > the maximum allowed without root permissions to keep things really > > simple. We also don't care if changing the pipe size fails - that > > will just result in a slower copy. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/pipe.c | 10 ++++++++++ > > fs/splice.c | 7 +++++++ > > include/linux/pipe_fs_i.h | 1 + > > 3 files changed, 18 insertions(+) > > > > diff --git a/fs/pipe.c b/fs/pipe.c > > index bdc5d3c0977d..436bc0464569 100644 > > --- a/fs/pipe.c > > +++ b/fs/pipe.c > > @@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > > return ret; > > } > > > > +/* > > + * Set the pipe to the maximum allowable user size. Advisory only, will > > + * swallow any errors and return the resultant pipe size. > > + */ > > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) > > +{ > > + pipe_set_size(pipe, pipe_max_size); > > + return pipe->buffers * PAGE_SIZE; > > +} > > + > > /* > > * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same > > * location, so checking ->i_pipe is not enough to verify that this is a > > diff --git a/fs/splice.c b/fs/splice.c > > index 3553f1956508..9749139da731 100644 > > --- a/fs/splice.c > > +++ b/fs/splice.c > > @@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, > > current->splice_pipe = pipe; > > } > > > > + /* > > + * Try to increase the data holding capacity of the pipe so we can do > > + * larger IOs. This may not increase the size at all because maximum > > + * user pipe size is administrator controlled, but we still should try. > > + */ > > + pipe_set_max_safe_size(pipe); > > I get where you're going with this, but I have two questions: > > - Is it safe to be enlarging the pipe buffer size unconditionally? Don't see why it would be unsafe. > - Especially if we didn't just create the splice pipe? Suppose someone > comes along later trying to splice things and doesn't realize the pipe > is now 1MB... The splice code is supposed to handle arbitrary pipe sizes correctly. if something breaks because it has assumptions about how much data a pipe can hold, it's already broken. > Then I started wondering about the splice_pipe lifetime and couldn't > figure out if it ever gets detached from current prior to do_exit. > I don't think it does, which means that we're stuck with the 1MB > kernel memory allocation until the process dies. If you are using do_splice_direct(), you either have a short term process (i.e. a cp type utility) or you are moving bulk data around, in which case 1MB of extra memory isn't a big deal. And given that the default pipe size is dependent on PAGE_SIZE (i.e. the default is 16 pages, not 64kB) then on 64k page architectures we are already using pipes of 1MB capacity by default. I could make this contingent on O_DIRECT, but then we have the problem that 64k pipes aren't big enough for efficient buffered IO with 64k block size filesystems, either. IOWs, the pipe size in do_splice_direct needs to be increased whichever way we look at it. This all said, I really think we need to imlpement our own ->copy_file_range() code that uses iomap to iterate data-only extents (i.e. hole-preserving) copying and does well formed IO. IOWs, only fall back to do_splice_direct() when doing copies to/from non-XFS filesystems. Cheers, Dave.
> +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) > +{ > + pipe_set_size(pipe, pipe_max_size); > + return pipe->buffers * PAGE_SIZE; > +} This should probably return an unsigned value, given that we don't return errors. Then again the callers ignores the return value entirely. Wouldn't it be easier to just call pipe_set_size from splice.c after removing the static marker?
On Thu, Nov 15, 2018 at 02:17:35AM -0800, Christoph Hellwig wrote: > > +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) > > +{ > > + pipe_set_size(pipe, pipe_max_size); > > + return pipe->buffers * PAGE_SIZE; > > +} > > This should probably return an unsigned value, given that we don't return > errors. Then again the callers ignores the return value entirely. > Wouldn't it be easier to just call pipe_set_size from splice.c after > removing the static marker? I didn't want to use pipe_max_size outside of the pipe code. it's definitely simpler without a wrapper if using pipe_max_size directly is acceptable. Cheers, Dave.
diff --git a/fs/pipe.c b/fs/pipe.c index bdc5d3c0977d..436bc0464569 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1109,6 +1109,16 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) return ret; } +/* + * Set the pipe to the maximum allowable user size. Advisory only, will + * swallow any errors and return the resultant pipe size. + */ +long pipe_set_max_safe_size(struct pipe_inode_info *pipe) +{ + pipe_set_size(pipe, pipe_max_size); + return pipe->buffers * PAGE_SIZE; +} + /* * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same * location, so checking ->i_pipe is not enough to verify that this is a diff --git a/fs/splice.c b/fs/splice.c index 3553f1956508..9749139da731 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -931,6 +931,13 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, current->splice_pipe = pipe; } + /* + * Try to increase the data holding capacity of the pipe so we can do + * larger IOs. This may not increase the size at all because maximum + * user pipe size is administrator controlled, but we still should try. + */ + pipe_set_max_safe_size(pipe); + /* * Do the splice. */ diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 5a3bb3b7c9ad..962ba4cfcb74 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -191,5 +191,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file); int create_pipe_files(struct file **, int); unsigned int round_pipe_size(unsigned long size); +long pipe_set_max_safe_size(struct pipe_inode_info *pipe); #endif