Message ID | 20230926063609.2451260-1-max.kellermann@ionos.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] fs/splice: don't block splice_direct_to_actor() after data was read | expand |
> + /* > + * After at least one byte was read from the input > + * file, don't wait for blocking I/O in the following > + * loop iterations; instead of blocking for arbitrary > + * amounts of time in the kernel, let userspace decide > + * how to proceed. This avoids excessive latency if > + * the output is being consumed faster than the input > + * file can fill it (e.g. sendfile() from a slow hard > + * disk to a fast network). > + */ > + flags |= SPLICE_F_NOWAIT; > + Hm, so the thing that is worrysome about this change is that this may cause regressions afaict as this is a pretty significant change from current behavior.
On Tue, Sep 26, 2023 at 12:21 PM Christian Brauner <brauner@kernel.org> wrote: > Hm, so the thing that is worrysome about this change is that this may > cause regressions afaict as this is a pretty significant change from > current behavior. Would you prefer a new flag for explicitly selecting "wait until at least one byte was transferred, but don't wait further"? Because many applications need this behavior, and some (like nginx) have already worked around the problem by limiting the maximum transaction size, which I consider a bad workaround, because it leads to unnecessary system calls and still doesn't really solve the latency problem. On the other hand, what exactly would the absence of this flag mean... the old behavior, without my patch, can lead to partial transfers, and the absence of the flag doesn't mean it can't happen; my patch tackles just one corner case, but one that is important for me. We have been running this patch in production for nearly a year (and will continue to do so until upstream kernels have a proper solution) and never observed a problem, and I consider it safe, but I acknowledge the risk that this may reveal obscure application bugs if applied globally to all Linux kernels, so I understand your worries. Max
On Tue, Sep 26, 2023 at 12:41:42PM +0200, Max Kellermann wrote: > On Tue, Sep 26, 2023 at 12:21 PM Christian Brauner <brauner@kernel.org> wrote: > > Hm, so the thing that is worrysome about this change is that this may > > cause regressions afaict as this is a pretty significant change from > > current behavior. > > Would you prefer a new flag for explicitly selecting "wait until at > least one byte was transferred, but don't wait further"? Because many I had thought about it but afaict it'd be rather annoying as one can get into that code from copy_file_range() as well so we'd need a new flag for that system call as well afaict. > applications need this behavior, and some (like nginx) have already > worked around the problem by limiting the maximum transaction size, > which I consider a bad workaround, because it leads to unnecessary > system calls and still doesn't really solve the latency problem. > > On the other hand, what exactly would the absence of this flag mean... > the old behavior, without my patch, can lead to partial transfers, and > the absence of the flag doesn't mean it can't happen; my patch tackles > just one corner case, but one that is important for me. > > We have been running this patch in production for nearly a year (and > will continue to do so until upstream kernels have a proper solution) > and never observed a problem, and I consider it safe, but I > acknowledge the risk that this may reveal obscure application bugs if > applied globally to all Linux kernels, so I understand your worries. I think hanging for an insane amount of time is indeed a problem and tweaking the code in this way might actually be useful but we'd need to let this soak for quite a while to see whether this causes any issues. @Jens, what do you think? Is this worth it?
diff --git a/fs/splice.c b/fs/splice.c index d983d375ff11..c192321d5e37 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -361,6 +361,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; + if (flags & SPLICE_F_NOWAIT) + kiocb.ki_flags |= IOCB_NOWAIT; ret = call_read_iter(in, &kiocb, &to); if (ret > 0) { @@ -1070,6 +1072,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, if (unlikely(ret <= 0)) goto read_failure; + /* + * After at least one byte was read from the input + * file, don't wait for blocking I/O in the following + * loop iterations; instead of blocking for arbitrary + * amounts of time in the kernel, let userspace decide + * how to proceed. This avoids excessive latency if + * the output is being consumed faster than the input + * file can fill it (e.g. sendfile() from a slow hard + * disk to a fast network). + */ + flags |= SPLICE_F_NOWAIT; + read_len = ret; sd->total_len = read_len; diff --git a/include/linux/splice.h b/include/linux/splice.h index 6c461573434d..06ce58b1f408 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -21,7 +21,9 @@ #define SPLICE_F_MORE (0x04) /* expect more data */ #define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */ -#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) +#define SPLICE_F_NOWAIT (0x10) /* do not wait for data which is not immediately available */ + +#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT|SPLICE_F_NOWAIT) /* * Passed to the actors
If userspace calls sendfile() with a very large "count" parameter, the kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) have been read from the hard disk and pushed into the socket buffer. Usually, that is not a problem, because the socket write buffer gets filled quickly, and if the socket is non-blocking, the last direct_splice_actor() call will return -EAGAIN, causing splice_direct_to_actor() to break from the loop, and sendfile() will return a partial transfer. However, if the network happens to be faster than the hard disk, and the socket buffer keeps getting drained between two generic_file_read_iter() calls, the sendfile() system call can keep running for a long time, blocking for disk I/O over and over. That is undesirable, because it can block the calling process for too long. I discovered a problem where nginx would block for so long that it would drop the HTTP connection because the kernel had just transferred 2 GiB in one call, and the HTTP socket was not writable (EPOLLOUT) for more than 60 seconds, resulting in a timeout: sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> writev(4, [], 0) = 0 <0.000439> epoll_wait(9, [], 512, 60000) = 0 <60.060430> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> close(12) = 0 <0.000063> close(4) = 0 <0.000091> In newer nginx versions (since 1.21.4), this problem was worked around by defaulting "sendfile_max_chunk" to 2 MiB: https://github.com/nginx/nginx/commit/5636e7f7b4 Instead of asking userspace to provide an artificial upper limit, I'd like the kernel to block for disk I/O at most once, and then pass back control to userspace. There is prior art for this kind of behavior in filemap_read(): /* * If we've already successfully copied some data, then we * can no longer safely return -EIOCBQUEUED. Hence mark * an async read NOWAIT at that point. */ if ((iocb->ki_flags & IOCB_WAITQ) && already_read) iocb->ki_flags |= IOCB_NOWAIT; This modifies the caller-provided "struct kiocb", which has an effect on repeated filemap_read() calls. This effect however vanishes because the "struct kiocb" is not persistent; splice_direct_to_actor() doesn't have one, and each generic_file_splice_read() call initializes a new one, losing the "IOCB_NOWAIT" flag that was injected by filemap_read(). There was no way to make generic_file_splice_read() aware that IOCB_NOWAIT was desired because some data had already been transferred in a previous call: - checking whether the input file has O_NONBLOCK doesn't work because this should be fixed even if the input file is not non-blocking - the SPLICE_F_NONBLOCK flag is not appropriate because it affects only whether pipe operations are non-blocking, not whether file/socket operations are non-blocking Since there are no other parameters, I suggest adding the SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but affects the "non-pipe" file descriptor passed to sendfile() or splice(). It translates to IOCB_NOWAIT for regular files, just like RWF_NOWAIT does. Changes v1 -> v2: - value of SPLICE_F_NOWAIT changed to 0x10 - added SPLICE_F_NOWAIT to SPLICE_F_ALL to make it part of uapi Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Max Kellermann <max.kellermann@ionos.com> --- fs/splice.c | 14 ++++++++++++++ include/linux/splice.h | 4 +++- 2 files changed, 17 insertions(+), 1 deletion(-)