diff mbox series

fs/splice: don't block splice_direct_to_actor() after data was read

Message ID 20230919081259.1094971-1-max.kellermann@ionos.com (mailing list archive)
State New, archived
Headers show
Series fs/splice: don't block splice_direct_to_actor() after data was read | expand

Commit Message

Max Kellermann Sept. 19, 2023, 8:12 a.m. UTC
If userspace calls sendfile() with a very large "count" parameter, the
kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
have been read from the hard disk and pushed into the socket buffer.

Usually, that is not a problem, because the socket write buffer gets
filled quickly, and if the socket is non-blocking, the last
direct_splice_actor() call will return -EAGAIN, causing
splice_direct_to_actor() to break from the loop, and sendfile() will
return a partial transfer.

However, if the network happens to be faster than the hard disk, and
the socket buffer keeps getting drained between two
generic_file_read_iter() calls, the sendfile() system call can keep
running for a long time, blocking for disk I/O over and over.

That is undesirable, because it can block the calling process for too
long.  I discovered a problem where nginx would block for so long that
it would drop the HTTP connection because the kernel had just
transferred 2 GiB in one call, and the HTTP socket was not writable
(EPOLLOUT) for more than 60 seconds, resulting in a timeout:

  sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
  sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
  epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
  gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
  sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
  writev(4, [], 0) = 0 <0.000439>
  epoll_wait(9, [], 512, 60000) = 0 <60.060430>
  gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
  write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
  close(12) = 0 <0.000063>
  close(4)  = 0 <0.000091>

In newer nginx versions (since 1.21.4), this problem was worked around
by defaulting "sendfile_max_chunk" to 2 MiB:

 https://github.com/nginx/nginx/commit/5636e7f7b4

Instead of asking userspace to provide an artificial upper limit, I'd
like the kernel to block for disk I/O at most once, and then pass back
control to userspace.

There is prior art for this kind of behavior in filemap_read():

	/*
	 * If we've already successfully copied some data, then we
	 * can no longer safely return -EIOCBQUEUED. Hence mark
	 * an async read NOWAIT at that point.
	 */
	if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
		iocb->ki_flags |= IOCB_NOWAIT;

This modifies the caller-provided "struct kiocb", which has an effect
on repeated filemap_read() calls.  This effect however vanishes
because the "struct kiocb" is not persistent; splice_direct_to_actor()
doesn't have one, and each generic_file_splice_read() call initializes
a new one, losing the "IOCB_NOWAIT" flag that was injected by
filemap_read().

There was no way to make generic_file_splice_read() aware that
IOCB_NOWAIT was desired because some data had already been transferred
in a previous call:

- checking whether the input file has O_NONBLOCK doesn't work because
  this should be fixed even if the input file is not non-blocking

- the SPLICE_F_NONBLOCK flag is not appropriate because it affects
  only whether pipe operations are non-blocking, not whether
  file/socket operations are non-blocking

Since there are no other parameters, I suggest adding the
SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
affects the "non-pipe" file descriptor passed to sendfile() or
splice().  It translates to IOCB_NOWAIT for regular files.  For now, I
have documented the flag to be kernel-internal with a high bit, like
io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
system call ABI may be a good idea as well.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/splice.c            | 14 ++++++++++++++
 include/linux/splice.h |  6 ++++++
 2 files changed, 20 insertions(+)

Comments

Christian Brauner Sept. 19, 2023, 2:18 p.m. UTC | #1
[+Cc Jens]

On Tue, Sep 19, 2023 at 10:12:58AM +0200, Max Kellermann wrote:
> If userspace calls sendfile() with a very large "count" parameter, the
> kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
> have been read from the hard disk and pushed into the socket buffer.
> 
> Usually, that is not a problem, because the socket write buffer gets
> filled quickly, and if the socket is non-blocking, the last
> direct_splice_actor() call will return -EAGAIN, causing
> splice_direct_to_actor() to break from the loop, and sendfile() will
> return a partial transfer.
> 
> However, if the network happens to be faster than the hard disk, and
> the socket buffer keeps getting drained between two
> generic_file_read_iter() calls, the sendfile() system call can keep
> running for a long time, blocking for disk I/O over and over.
> 
> That is undesirable, because it can block the calling process for too
> long.  I discovered a problem where nginx would block for so long that
> it would drop the HTTP connection because the kernel had just
> transferred 2 GiB in one call, and the HTTP socket was not writable
> (EPOLLOUT) for more than 60 seconds, resulting in a timeout:
> 
>   sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
>   sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
>   epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
>   gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
>   sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
>   writev(4, [], 0) = 0 <0.000439>
>   epoll_wait(9, [], 512, 60000) = 0 <60.060430>
>   gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
>   write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
>   close(12) = 0 <0.000063>
>   close(4)  = 0 <0.000091>
> 
> In newer nginx versions (since 1.21.4), this problem was worked around
> by defaulting "sendfile_max_chunk" to 2 MiB:
> 
>  https://github.com/nginx/nginx/commit/5636e7f7b4
> 
> Instead of asking userspace to provide an artificial upper limit, I'd
> like the kernel to block for disk I/O at most once, and then pass back
> control to userspace.
> 
> There is prior art for this kind of behavior in filemap_read():
> 
> 	/*
> 	 * If we've already successfully copied some data, then we
> 	 * can no longer safely return -EIOCBQUEUED. Hence mark
> 	 * an async read NOWAIT at that point.
> 	 */
> 	if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
> 		iocb->ki_flags |= IOCB_NOWAIT;
> 
> This modifies the caller-provided "struct kiocb", which has an effect
> on repeated filemap_read() calls.  This effect however vanishes
> because the "struct kiocb" is not persistent; splice_direct_to_actor()
> doesn't have one, and each generic_file_splice_read() call initializes
> a new one, losing the "IOCB_NOWAIT" flag that was injected by
> filemap_read().
> 
> There was no way to make generic_file_splice_read() aware that
> IOCB_NOWAIT was desired because some data had already been transferred
> in a previous call:
> 
> - checking whether the input file has O_NONBLOCK doesn't work because
>   this should be fixed even if the input file is not non-blocking
> 
> - the SPLICE_F_NONBLOCK flag is not appropriate because it affects
>   only whether pipe operations are non-blocking, not whether
>   file/socket operations are non-blocking
> 
> Since there are no other parameters, I suggest adding the
> SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
> affects the "non-pipe" file descriptor passed to sendfile() or
> splice().  It translates to IOCB_NOWAIT for regular files.  For now, I
> have documented the flag to be kernel-internal with a high bit, like
> io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
> system call ABI may be a good idea as well.
> 
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
>  fs/splice.c            | 14 ++++++++++++++
>  include/linux/splice.h |  6 ++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index d983d375ff11..c192321d5e37 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -361,6 +361,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
>  	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
>  	init_sync_kiocb(&kiocb, in);
>  	kiocb.ki_pos = *ppos;
> +	if (flags & SPLICE_F_NOWAIT)
> +		kiocb.ki_flags |= IOCB_NOWAIT;
>  	ret = call_read_iter(in, &kiocb, &to);
>  
>  	if (ret > 0) {
> @@ -1070,6 +1072,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
>  		if (unlikely(ret <= 0))
>  			goto read_failure;
>  
> +		/*
> +		 * After at least one byte was read from the input
> +		 * file, don't wait for blocking I/O in the following
> +		 * loop iterations; instead of blocking for arbitrary
> +		 * amounts of time in the kernel, let userspace decide
> +		 * how to proceed.  This avoids excessive latency if
> +		 * the output is being consumed faster than the input
> +		 * file can fill it (e.g. sendfile() from a slow hard
> +		 * disk to a fast network).
> +		 */
> +		flags |= SPLICE_F_NOWAIT;
> +
>  		read_len = ret;
>  		sd->total_len = read_len;
>  
> diff --git a/include/linux/splice.h b/include/linux/splice.h
> index 6c461573434d..abdf94759138 100644
> --- a/include/linux/splice.h
> +++ b/include/linux/splice.h
> @@ -23,6 +23,12 @@
>  
>  #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
>  
> +/*
> + * Don't wait for I/O (internal flag for the splice_direct_to_actor()
> + * loop).
> + */
> +#define SPLICE_F_NOWAIT	(1U << 30)
> +
>  /*
>   * Passed to the actors
>   */
> -- 
> 2.39.2
>
Jens Axboe Sept. 20, 2023, 5:28 p.m. UTC | #2
On 9/19/23 8:18 AM, Christian Brauner wrote:
> [+Cc Jens]
> 
> On Tue, Sep 19, 2023 at 10:12:58AM +0200, Max Kellermann wrote:
>> If userspace calls sendfile() with a very large "count" parameter, the
>> kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
>> have been read from the hard disk and pushed into the socket buffer.
>>
>> Usually, that is not a problem, because the socket write buffer gets
>> filled quickly, and if the socket is non-blocking, the last
>> direct_splice_actor() call will return -EAGAIN, causing
>> splice_direct_to_actor() to break from the loop, and sendfile() will
>> return a partial transfer.
>>
>> However, if the network happens to be faster than the hard disk, and
>> the socket buffer keeps getting drained between two
>> generic_file_read_iter() calls, the sendfile() system call can keep
>> running for a long time, blocking for disk I/O over and over.
>>
>> That is undesirable, because it can block the calling process for too
>> long.  I discovered a problem where nginx would block for so long that
>> it would drop the HTTP connection because the kernel had just
>> transferred 2 GiB in one call, and the HTTP socket was not writable
>> (EPOLLOUT) for more than 60 seconds, resulting in a timeout:
>>
>>   sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
>>   sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
>>   epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
>>   gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
>>   sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
>>   writev(4, [], 0) = 0 <0.000439>
>>   epoll_wait(9, [], 512, 60000) = 0 <60.060430>
>>   gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
>>   write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
>>   close(12) = 0 <0.000063>
>>   close(4)  = 0 <0.000091>
>>
>> In newer nginx versions (since 1.21.4), this problem was worked around
>> by defaulting "sendfile_max_chunk" to 2 MiB:
>>
>>  https://github.com/nginx/nginx/commit/5636e7f7b4
>>
>> Instead of asking userspace to provide an artificial upper limit, I'd
>> like the kernel to block for disk I/O at most once, and then pass back
>> control to userspace.
>>
>> There is prior art for this kind of behavior in filemap_read():
>>
>> 	/*
>> 	 * If we've already successfully copied some data, then we
>> 	 * can no longer safely return -EIOCBQUEUED. Hence mark
>> 	 * an async read NOWAIT at that point.
>> 	 */
>> 	if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
>> 		iocb->ki_flags |= IOCB_NOWAIT;
>>
>> This modifies the caller-provided "struct kiocb", which has an effect
>> on repeated filemap_read() calls.  This effect however vanishes
>> because the "struct kiocb" is not persistent; splice_direct_to_actor()
>> doesn't have one, and each generic_file_splice_read() call initializes
>> a new one, losing the "IOCB_NOWAIT" flag that was injected by
>> filemap_read().
>>
>> There was no way to make generic_file_splice_read() aware that
>> IOCB_NOWAIT was desired because some data had already been transferred
>> in a previous call:
>>
>> - checking whether the input file has O_NONBLOCK doesn't work because
>>   this should be fixed even if the input file is not non-blocking
>>
>> - the SPLICE_F_NONBLOCK flag is not appropriate because it affects
>>   only whether pipe operations are non-blocking, not whether
>>   file/socket operations are non-blocking
>>
>> Since there are no other parameters, I suggest adding the
>> SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
>> affects the "non-pipe" file descriptor passed to sendfile() or
>> splice().  It translates to IOCB_NOWAIT for regular files.  For now, I
>> have documented the flag to be kernel-internal with a high bit, like
>> io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the
>> system call ABI may be a good idea as well.

I think adding the flag for this case makes sense, and also exposing it
on the UAPI side. My only concern is full coverage of it. We can't
really have a SPLICE_F_NOWAIT flag that only applies to some cases.

That said, asking for a 2G splice, and getting a 2G splice no matter how
slow it may be, is a bit of a "doctor it hurts when I..." scenario.
Max Kellermann Sept. 20, 2023, 6:16 p.m. UTC | #3
On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <axboe@kernel.dk> wrote:
> I think adding the flag for this case makes sense, and also exposing it
> on the UAPI side.

OK. I suggest we get this patch merged first, and then I prepare a
patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the
lowest free bit), add it to SPLICE_F_ALL and document it.

(If you prefer to have it all in this initial patch, I can amend and
resubmit it with the uapi feature.)

> My only concern is full coverage of it. We can't
> really have a SPLICE_F_NOWAIT flag that only applies to some cases.

The feature is already part of uapi - via RWF_NOWAIT, which maps to
IOCB_NOWAIT, just like my proposed SPLICE_F_NOWAIT flag. The semantics
(and the concerns) are the same, aren't they?

> That said, asking for a 2G splice, and getting a 2G splice no matter how
> slow it may be, is a bit of a "doctor it hurts when I..." scenario.

I understand this argument, but I disagree. Compare recv(socket) with
read(regular_file).
A read(regular_file) must block until the given buffer is filled
completely (or EOF is reached), which is good for some programs which
do not handle partial reads, but other programs might be happy with a
partial read and prefer lower latency. There is preadv2(RWF_NOWAIT),
but if it returns EAGAIN, userspace cannot know when data will be
available, can't epoll() regular files. There's no way that a read()
returns at least one byte, but doesn't wait for more (not even with
preadv2(), unfortunately).
recv(socket) (or reading on a pipe) behaves differently - it blocks
only until at least one byte arrives, and callers must be able to deal
with partial reads. That's good for latency - imagine recv() would
behave like read(); how much data do you ask the kernel to receive? If
it's too little, you need many system calls; if it's too much, your
process may block indefinitely.

read(regular_file) behaves that way for historical reasons and we
can't change it, only add new APIs like preadv2(); but splice() is a
modern API that we can optimize for how we want it to behave - and
that is: copy as much as the kernel already has, but don't block after
that (in order to avoid huge latencies).

My point is: splice(2G) is a very reasonable thing to do if userspace
wants the kernel to transfer as much as possible with a single system
call, because there's no way for userspace to know what the best
number is, so let's just pass the largest valid value.

Max
Christian Brauner Sept. 25, 2023, 1:10 p.m. UTC | #4
On Wed, Sep 20, 2023 at 08:16:04PM +0200, Max Kellermann wrote:
> On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <axboe@kernel.dk> wrote:
> > I think adding the flag for this case makes sense, and also exposing it
> > on the UAPI side.
> 
> OK. I suggest we get this patch merged first, and then I prepare a
> patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the
> lowest free bit), add it to SPLICE_F_ALL and document it.
> 
> (If you prefer to have it all in this initial patch, I can amend and
> resubmit it with the uapi feature.)

Please resend it all in one patch. That's easier for all to review.
Max Kellermann Sept. 26, 2023, 6:39 a.m. UTC | #5
On Mon, Sep 25, 2023 at 3:10 PM Christian Brauner <brauner@kernel.org> wrote:
> > OK. I suggest we get this patch merged first, and then I prepare a
> > patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the
> > lowest free bit), add it to SPLICE_F_ALL and document it.
> >
> > (If you prefer to have it all in this initial patch, I can amend and
> > resubmit it with the uapi feature.)
>
> Please resend it all in one patch. That's easier for all to review.

Done. Though I was surprised there's no "uapi" header for splice, so I
only changed the value to 0x10 and added it to SPLICE_F_ALL.

Since the splice manpage is in a different repository, I'll submit a
separate patch for that; can't be one patch.

Max
diff mbox series

Patch

diff --git a/fs/splice.c b/fs/splice.c
index d983d375ff11..c192321d5e37 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -361,6 +361,8 @@  ssize_t copy_splice_read(struct file *in, loff_t *ppos,
 	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;
+	if (flags & SPLICE_F_NOWAIT)
+		kiocb.ki_flags |= IOCB_NOWAIT;
 	ret = call_read_iter(in, &kiocb, &to);
 
 	if (ret > 0) {
@@ -1070,6 +1072,18 @@  ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		if (unlikely(ret <= 0))
 			goto read_failure;
 
+		/*
+		 * After at least one byte was read from the input
+		 * file, don't wait for blocking I/O in the following
+		 * loop iterations; instead of blocking for arbitrary
+		 * amounts of time in the kernel, let userspace decide
+		 * how to proceed.  This avoids excessive latency if
+		 * the output is being consumed faster than the input
+		 * file can fill it (e.g. sendfile() from a slow hard
+		 * disk to a fast network).
+		 */
+		flags |= SPLICE_F_NOWAIT;
+
 		read_len = ret;
 		sd->total_len = read_len;
 
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 6c461573434d..abdf94759138 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -23,6 +23,12 @@ 
 
 #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
 
+/*
+ * Don't wait for I/O (internal flag for the splice_direct_to_actor()
+ * loop).
+ */
+#define SPLICE_F_NOWAIT	(1U << 30)
+
 /*
  * Passed to the actors
  */