diff mbox series

[v2] fs/splice: don't block splice_direct_to_actor() after data was read

Message ID 20230926063609.2451260-1-max.kellermann@ionos.com (mailing list archive)
State New, archived
Headers show
Series [v2] fs/splice: don't block splice_direct_to_actor() after data was read | expand

Commit Message

Max Kellermann Sept. 26, 2023, 6:36 a.m. UTC
If userspace calls sendfile() with a very large "count" parameter, the
kernel can block for a very long time until 2 GiB (0x7ffff000 bytes)
have been read from the hard disk and pushed into the socket buffer.

Usually, that is not a problem, because the socket write buffer gets
filled quickly, and if the socket is non-blocking, the last
direct_splice_actor() call will return -EAGAIN, causing
splice_direct_to_actor() to break from the loop, and sendfile() will
return a partial transfer.

However, if the network happens to be faster than the hard disk, and
the socket buffer keeps getting drained between two
generic_file_read_iter() calls, the sendfile() system call can keep
running for a long time, blocking for disk I/O over and over.

That is undesirable, because it can block the calling process for too
long.  I discovered a problem where nginx would block for so long that
it would drop the HTTP connection because the kernel had just
transferred 2 GiB in one call, and the HTTP socket was not writable
(EPOLLOUT) for more than 60 seconds, resulting in a timeout:

  sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067>
  sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037>
  epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355>
  gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024>
  sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970>
  writev(4, [], 0) = 0 <0.000439>
  epoll_wait(9, [], 512, 60000) = 0 <60.060430>
  gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078>
  write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097>
  close(12) = 0 <0.000063>
  close(4)  = 0 <0.000091>

In newer nginx versions (since 1.21.4), this problem was worked around
by defaulting "sendfile_max_chunk" to 2 MiB:

 https://github.com/nginx/nginx/commit/5636e7f7b4

Instead of asking userspace to provide an artificial upper limit, I'd
like the kernel to block for disk I/O at most once, and then pass back
control to userspace.

There is prior art for this kind of behavior in filemap_read():

	/*
	 * If we've already successfully copied some data, then we
	 * can no longer safely return -EIOCBQUEUED. Hence mark
	 * an async read NOWAIT at that point.
	 */
	if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
		iocb->ki_flags |= IOCB_NOWAIT;

This modifies the caller-provided "struct kiocb", which has an effect
on repeated filemap_read() calls.  This effect however vanishes
because the "struct kiocb" is not persistent; splice_direct_to_actor()
doesn't have one, and each generic_file_splice_read() call initializes
a new one, losing the "IOCB_NOWAIT" flag that was injected by
filemap_read().

There was no way to make generic_file_splice_read() aware that
IOCB_NOWAIT was desired because some data had already been transferred
in a previous call:

- checking whether the input file has O_NONBLOCK doesn't work because
  this should be fixed even if the input file is not non-blocking

- the SPLICE_F_NONBLOCK flag is not appropriate because it affects
  only whether pipe operations are non-blocking, not whether
  file/socket operations are non-blocking

Since there are no other parameters, I suggest adding the
SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but
affects the "non-pipe" file descriptor passed to sendfile() or
splice().  It translates to IOCB_NOWAIT for regular files, just like
RWF_NOWAIT does.

Changes v1 -> v2:
- value of SPLICE_F_NOWAIT changed to 0x10
- added SPLICE_F_NOWAIT to SPLICE_F_ALL to make it part of uapi

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/splice.c            | 14 ++++++++++++++
 include/linux/splice.h |  4 +++-
 2 files changed, 17 insertions(+), 1 deletion(-)

Comments

Christian Brauner Sept. 26, 2023, 10:21 a.m. UTC | #1
> +		/*
> +		 * After at least one byte was read from the input
> +		 * file, don't wait for blocking I/O in the following
> +		 * loop iterations; instead of blocking for arbitrary
> +		 * amounts of time in the kernel, let userspace decide
> +		 * how to proceed.  This avoids excessive latency if
> +		 * the output is being consumed faster than the input
> +		 * file can fill it (e.g. sendfile() from a slow hard
> +		 * disk to a fast network).
> +		 */
> +		flags |= SPLICE_F_NOWAIT;
> +

Hm, so the thing that is worrysome about this change is that this may
cause regressions afaict as this is a pretty significant change from
current behavior.
Max Kellermann Sept. 26, 2023, 10:41 a.m. UTC | #2
On Tue, Sep 26, 2023 at 12:21 PM Christian Brauner <brauner@kernel.org> wrote:
> Hm, so the thing that is worrysome about this change is that this may
> cause regressions afaict as this is a pretty significant change from
> current behavior.

Would you prefer a new flag for explicitly selecting "wait until at
least one byte was transferred, but don't wait further"? Because many
applications need this behavior, and some (like nginx) have already
worked around the problem by limiting the maximum transaction size,
which I consider a bad workaround, because it leads to unnecessary
system calls and still doesn't really solve the latency problem.

On the other hand, what exactly would the absence of this flag mean...
the old behavior, without my patch, can lead to partial transfers, and
the absence of the flag doesn't mean it can't happen; my patch tackles
just one corner case, but one that is important for me.

We have been running this patch in production for nearly a year (and
will continue to do so until upstream kernels have a proper solution)
and never observed a problem, and I consider it safe, but I
acknowledge the risk that this may reveal obscure application bugs if
applied globally to all Linux kernels, so I understand your worries.

Max
Christian Brauner Sept. 26, 2023, 12:26 p.m. UTC | #3
On Tue, Sep 26, 2023 at 12:41:42PM +0200, Max Kellermann wrote:
> On Tue, Sep 26, 2023 at 12:21 PM Christian Brauner <brauner@kernel.org> wrote:
> > Hm, so the thing that is worrysome about this change is that this may
> > cause regressions afaict as this is a pretty significant change from
> > current behavior.
> 
> Would you prefer a new flag for explicitly selecting "wait until at
> least one byte was transferred, but don't wait further"? Because many

I had thought about it but afaict it'd be rather annoying as one can get
into that code from copy_file_range() as well so we'd need a new flag
for that system call as well afaict.

> applications need this behavior, and some (like nginx) have already
> worked around the problem by limiting the maximum transaction size,
> which I consider a bad workaround, because it leads to unnecessary
> system calls and still doesn't really solve the latency problem.
> 
> On the other hand, what exactly would the absence of this flag mean...
> the old behavior, without my patch, can lead to partial transfers, and
> the absence of the flag doesn't mean it can't happen; my patch tackles
> just one corner case, but one that is important for me.
> 
> We have been running this patch in production for nearly a year (and
> will continue to do so until upstream kernels have a proper solution)
> and never observed a problem, and I consider it safe, but I
> acknowledge the risk that this may reveal obscure application bugs if
> applied globally to all Linux kernels, so I understand your worries.

I think hanging for an insane amount of time is indeed a problem and
tweaking the code in this way might actually be useful but we'd need to
let this soak for quite a while to see whether this causes any issues.

@Jens, what do you think? Is this worth it?
diff mbox series

Patch

diff --git a/fs/splice.c b/fs/splice.c
index d983d375ff11..c192321d5e37 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -361,6 +361,8 @@  ssize_t copy_splice_read(struct file *in, loff_t *ppos,
 	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;
+	if (flags & SPLICE_F_NOWAIT)
+		kiocb.ki_flags |= IOCB_NOWAIT;
 	ret = call_read_iter(in, &kiocb, &to);
 
 	if (ret > 0) {
@@ -1070,6 +1072,18 @@  ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		if (unlikely(ret <= 0))
 			goto read_failure;
 
+		/*
+		 * After at least one byte was read from the input
+		 * file, don't wait for blocking I/O in the following
+		 * loop iterations; instead of blocking for arbitrary
+		 * amounts of time in the kernel, let userspace decide
+		 * how to proceed.  This avoids excessive latency if
+		 * the output is being consumed faster than the input
+		 * file can fill it (e.g. sendfile() from a slow hard
+		 * disk to a fast network).
+		 */
+		flags |= SPLICE_F_NOWAIT;
+
 		read_len = ret;
 		sd->total_len = read_len;
 
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 6c461573434d..06ce58b1f408 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -21,7 +21,9 @@ 
 #define SPLICE_F_MORE	(0x04)	/* expect more data */
 #define SPLICE_F_GIFT	(0x08)	/* pages passed in are a gift */
 
-#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
+#define SPLICE_F_NOWAIT	(0x10) /* do not wait for data which is not immediately available */
+
+#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT|SPLICE_F_NOWAIT)
 
 /*
  * Passed to the actors