Message ID | 20230919081259.1094971-1-max.kellermann@ionos.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs/splice: don't block splice_direct_to_actor() after data was read | expand |
[+Cc Jens] On Tue, Sep 19, 2023 at 10:12:58AM +0200, Max Kellermann wrote: > If userspace calls sendfile() with a very large "count" parameter, the > kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) > have been read from the hard disk and pushed into the socket buffer. > > Usually, that is not a problem, because the socket write buffer gets > filled quickly, and if the socket is non-blocking, the last > direct_splice_actor() call will return -EAGAIN, causing > splice_direct_to_actor() to break from the loop, and sendfile() will > return a partial transfer. > > However, if the network happens to be faster than the hard disk, and > the socket buffer keeps getting drained between two > generic_file_read_iter() calls, the sendfile() system call can keep > running for a long time, blocking for disk I/O over and over. > > That is undesirable, because it can block the calling process for too > long. I discovered a problem where nginx would block for so long that > it would drop the HTTP connection because the kernel had just > transferred 2 GiB in one call, and the HTTP socket was not writable > (EPOLLOUT) for more than 60 seconds, resulting in a timeout: > > sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> > sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> > epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> > gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> > sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> > writev(4, [], 0) = 0 <0.000439> > epoll_wait(9, [], 512, 60000) = 0 <60.060430> > gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> > write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> > close(12) = 0 <0.000063> > close(4) = 0 <0.000091> > > In newer nginx versions (since 1.21.4), this problem was worked around > by defaulting "sendfile_max_chunk" to 2 MiB: > > https://github.com/nginx/nginx/commit/5636e7f7b4 > > Instead of asking userspace to provide an artificial upper limit, I'd > like the kernel to block for disk I/O at most once, and then pass back > control to userspace. > > There is prior art for this kind of behavior in filemap_read(): > > /* > * If we've already successfully copied some data, then we > * can no longer safely return -EIOCBQUEUED. Hence mark > * an async read NOWAIT at that point. > */ > if ((iocb->ki_flags & IOCB_WAITQ) && already_read) > iocb->ki_flags |= IOCB_NOWAIT; > > This modifies the caller-provided "struct kiocb", which has an effect > on repeated filemap_read() calls. This effect however vanishes > because the "struct kiocb" is not persistent; splice_direct_to_actor() > doesn't have one, and each generic_file_splice_read() call initializes > a new one, losing the "IOCB_NOWAIT" flag that was injected by > filemap_read(). > > There was no way to make generic_file_splice_read() aware that > IOCB_NOWAIT was desired because some data had already been transferred > in a previous call: > > - checking whether the input file has O_NONBLOCK doesn't work because > this should be fixed even if the input file is not non-blocking > > - the SPLICE_F_NONBLOCK flag is not appropriate because it affects > only whether pipe operations are non-blocking, not whether > file/socket operations are non-blocking > > Since there are no other parameters, I suggest adding the > SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but > affects the "non-pipe" file descriptor passed to sendfile() or > splice(). It translates to IOCB_NOWAIT for regular files. For now, I > have documented the flag to be kernel-internal with a high bit, like > io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the > system call ABI may be a good idea as well. > > Signed-off-by: Max Kellermann <max.kellermann@ionos.com> > --- > fs/splice.c | 14 ++++++++++++++ > include/linux/splice.h | 6 ++++++ > 2 files changed, 20 insertions(+) > > diff --git a/fs/splice.c b/fs/splice.c > index d983d375ff11..c192321d5e37 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -361,6 +361,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, > iov_iter_bvec(&to, ITER_DEST, bv, npages, len); > init_sync_kiocb(&kiocb, in); > kiocb.ki_pos = *ppos; > + if (flags & SPLICE_F_NOWAIT) > + kiocb.ki_flags |= IOCB_NOWAIT; > ret = call_read_iter(in, &kiocb, &to); > > if (ret > 0) { > @@ -1070,6 +1072,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, > if (unlikely(ret <= 0)) > goto read_failure; > > + /* > + * After at least one byte was read from the input > + * file, don't wait for blocking I/O in the following > + * loop iterations; instead of blocking for arbitrary > + * amounts of time in the kernel, let userspace decide > + * how to proceed. This avoids excessive latency if > + * the output is being consumed faster than the input > + * file can fill it (e.g. sendfile() from a slow hard > + * disk to a fast network). > + */ > + flags |= SPLICE_F_NOWAIT; > + > read_len = ret; > sd->total_len = read_len; > > diff --git a/include/linux/splice.h b/include/linux/splice.h > index 6c461573434d..abdf94759138 100644 > --- a/include/linux/splice.h > +++ b/include/linux/splice.h > @@ -23,6 +23,12 @@ > > #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) > > +/* > + * Don't wait for I/O (internal flag for the splice_direct_to_actor() > + * loop). > + */ > +#define SPLICE_F_NOWAIT (1U << 30) > + > /* > * Passed to the actors > */ > -- > 2.39.2 >
On 9/19/23 8:18 AM, Christian Brauner wrote: > [+Cc Jens] > > On Tue, Sep 19, 2023 at 10:12:58AM +0200, Max Kellermann wrote: >> If userspace calls sendfile() with a very large "count" parameter, the >> kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) >> have been read from the hard disk and pushed into the socket buffer. >> >> Usually, that is not a problem, because the socket write buffer gets >> filled quickly, and if the socket is non-blocking, the last >> direct_splice_actor() call will return -EAGAIN, causing >> splice_direct_to_actor() to break from the loop, and sendfile() will >> return a partial transfer. >> >> However, if the network happens to be faster than the hard disk, and >> the socket buffer keeps getting drained between two >> generic_file_read_iter() calls, the sendfile() system call can keep >> running for a long time, blocking for disk I/O over and over. >> >> That is undesirable, because it can block the calling process for too >> long. I discovered a problem where nginx would block for so long that >> it would drop the HTTP connection because the kernel had just >> transferred 2 GiB in one call, and the HTTP socket was not writable >> (EPOLLOUT) for more than 60 seconds, resulting in a timeout: >> >> sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> >> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> >> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> >> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> >> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> >> writev(4, [], 0) = 0 <0.000439> >> epoll_wait(9, [], 512, 60000) = 0 <60.060430> >> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> >> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> >> close(12) = 0 <0.000063> >> close(4) = 0 <0.000091> >> >> In newer nginx versions (since 1.21.4), this problem was worked around >> by defaulting "sendfile_max_chunk" to 2 MiB: >> >> https://github.com/nginx/nginx/commit/5636e7f7b4 >> >> Instead of asking userspace to provide an artificial upper limit, I'd >> like the kernel to block for disk I/O at most once, and then pass back >> control to userspace. >> >> There is prior art for this kind of behavior in filemap_read(): >> >> /* >> * If we've already successfully copied some data, then we >> * can no longer safely return -EIOCBQUEUED. Hence mark >> * an async read NOWAIT at that point. >> */ >> if ((iocb->ki_flags & IOCB_WAITQ) && already_read) >> iocb->ki_flags |= IOCB_NOWAIT; >> >> This modifies the caller-provided "struct kiocb", which has an effect >> on repeated filemap_read() calls. This effect however vanishes >> because the "struct kiocb" is not persistent; splice_direct_to_actor() >> doesn't have one, and each generic_file_splice_read() call initializes >> a new one, losing the "IOCB_NOWAIT" flag that was injected by >> filemap_read(). >> >> There was no way to make generic_file_splice_read() aware that >> IOCB_NOWAIT was desired because some data had already been transferred >> in a previous call: >> >> - checking whether the input file has O_NONBLOCK doesn't work because >> this should be fixed even if the input file is not non-blocking >> >> - the SPLICE_F_NONBLOCK flag is not appropriate because it affects >> only whether pipe operations are non-blocking, not whether >> file/socket operations are non-blocking >> >> Since there are no other parameters, I suggest adding the >> SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but >> affects the "non-pipe" file descriptor passed to sendfile() or >> splice(). It translates to IOCB_NOWAIT for regular files. For now, I >> have documented the flag to be kernel-internal with a high bit, like >> io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the >> system call ABI may be a good idea as well. I think adding the flag for this case makes sense, and also exposing it on the UAPI side. My only concern is full coverage of it. We can't really have a SPLICE_F_NOWAIT flag that only applies to some cases. That said, asking for a 2G splice, and getting a 2G splice no matter how slow it may be, is a bit of a "doctor it hurts when I..." scenario.
On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <axboe@kernel.dk> wrote: > I think adding the flag for this case makes sense, and also exposing it > on the UAPI side. OK. I suggest we get this patch merged first, and then I prepare a patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the lowest free bit), add it to SPLICE_F_ALL and document it. (If you prefer to have it all in this initial patch, I can amend and resubmit it with the uapi feature.) > My only concern is full coverage of it. We can't > really have a SPLICE_F_NOWAIT flag that only applies to some cases. The feature is already part of uapi - via RWF_NOWAIT, which maps to IOCB_NOWAIT, just like my proposed SPLICE_F_NOWAIT flag. The semantics (and the concerns) are the same, aren't they? > That said, asking for a 2G splice, and getting a 2G splice no matter how > slow it may be, is a bit of a "doctor it hurts when I..." scenario. I understand this argument, but I disagree. Compare recv(socket) with read(regular_file). A read(regular_file) must block until the given buffer is filled completely (or EOF is reached), which is good for some programs which do not handle partial reads, but other programs might be happy with a partial read and prefer lower latency. There is preadv2(RWF_NOWAIT), but if it returns EAGAIN, userspace cannot know when data will be available, can't epoll() regular files. There's no way that a read() returns at least one byte, but doesn't wait for more (not even with preadv2(), unfortunately). recv(socket) (or reading on a pipe) behaves differently - it blocks only until at least one byte arrives, and callers must be able to deal with partial reads. That's good for latency - imagine recv() would behave like read(); how much data do you ask the kernel to receive? If it's too little, you need many system calls; if it's too much, your process may block indefinitely. read(regular_file) behaves that way for historical reasons and we can't change it, only add new APIs like preadv2(); but splice() is a modern API that we can optimize for how we want it to behave - and that is: copy as much as the kernel already has, but don't block after that (in order to avoid huge latencies). My point is: splice(2G) is a very reasonable thing to do if userspace wants the kernel to transfer as much as possible with a single system call, because there's no way for userspace to know what the best number is, so let's just pass the largest valid value. Max
On Wed, Sep 20, 2023 at 08:16:04PM +0200, Max Kellermann wrote: > On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <axboe@kernel.dk> wrote: > > I think adding the flag for this case makes sense, and also exposing it > > on the UAPI side. > > OK. I suggest we get this patch merged first, and then I prepare a > patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the > lowest free bit), add it to SPLICE_F_ALL and document it. > > (If you prefer to have it all in this initial patch, I can amend and > resubmit it with the uapi feature.) Please resend it all in one patch. That's easier for all to review.
On Mon, Sep 25, 2023 at 3:10 PM Christian Brauner <brauner@kernel.org> wrote: > > OK. I suggest we get this patch merged first, and then I prepare a > > patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the > > lowest free bit), add it to SPLICE_F_ALL and document it. > > > > (If you prefer to have it all in this initial patch, I can amend and > > resubmit it with the uapi feature.) > > Please resend it all in one patch. That's easier for all to review. Done. Though I was surprised there's no "uapi" header for splice, so I only changed the value to 0x10 and added it to SPLICE_F_ALL. Since the splice manpage is in a different repository, I'll submit a separate patch for that; can't be one patch. Max
diff --git a/fs/splice.c b/fs/splice.c index d983d375ff11..c192321d5e37 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -361,6 +361,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; + if (flags & SPLICE_F_NOWAIT) + kiocb.ki_flags |= IOCB_NOWAIT; ret = call_read_iter(in, &kiocb, &to); if (ret > 0) { @@ -1070,6 +1072,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, if (unlikely(ret <= 0)) goto read_failure; + /* + * After at least one byte was read from the input + * file, don't wait for blocking I/O in the following + * loop iterations; instead of blocking for arbitrary + * amounts of time in the kernel, let userspace decide + * how to proceed. This avoids excessive latency if + * the output is being consumed faster than the input + * file can fill it (e.g. sendfile() from a slow hard + * disk to a fast network). + */ + flags |= SPLICE_F_NOWAIT; + read_len = ret; sd->total_len = read_len; diff --git a/include/linux/splice.h b/include/linux/splice.h index 6c461573434d..abdf94759138 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -23,6 +23,12 @@ #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) +/* + * Don't wait for I/O (internal flag for the splice_direct_to_actor() + * loop). + */ +#define SPLICE_F_NOWAIT (1U << 30) + /* * Passed to the actors */
If userspace calls sendfile() with a very large "count" parameter, the kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) have been read from the hard disk and pushed into the socket buffer. Usually, that is not a problem, because the socket write buffer gets filled quickly, and if the socket is non-blocking, the last direct_splice_actor() call will return -EAGAIN, causing splice_direct_to_actor() to break from the loop, and sendfile() will return a partial transfer. However, if the network happens to be faster than the hard disk, and the socket buffer keeps getting drained between two generic_file_read_iter() calls, the sendfile() system call can keep running for a long time, blocking for disk I/O over and over. That is undesirable, because it can block the calling process for too long. I discovered a problem where nginx would block for so long that it would drop the HTTP connection because the kernel had just transferred 2 GiB in one call, and the HTTP socket was not writable (EPOLLOUT) for more than 60 seconds, resulting in a timeout: sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> writev(4, [], 0) = 0 <0.000439> epoll_wait(9, [], 512, 60000) = 0 <60.060430> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> close(12) = 0 <0.000063> close(4) = 0 <0.000091> In newer nginx versions (since 1.21.4), this problem was worked around by defaulting "sendfile_max_chunk" to 2 MiB: https://github.com/nginx/nginx/commit/5636e7f7b4 Instead of asking userspace to provide an artificial upper limit, I'd like the kernel to block for disk I/O at most once, and then pass back control to userspace. There is prior art for this kind of behavior in filemap_read(): /* * If we've already successfully copied some data, then we * can no longer safely return -EIOCBQUEUED. Hence mark * an async read NOWAIT at that point. */ if ((iocb->ki_flags & IOCB_WAITQ) && already_read) iocb->ki_flags |= IOCB_NOWAIT; This modifies the caller-provided "struct kiocb", which has an effect on repeated filemap_read() calls. This effect however vanishes because the "struct kiocb" is not persistent; splice_direct_to_actor() doesn't have one, and each generic_file_splice_read() call initializes a new one, losing the "IOCB_NOWAIT" flag that was injected by filemap_read(). There was no way to make generic_file_splice_read() aware that IOCB_NOWAIT was desired because some data had already been transferred in a previous call: - checking whether the input file has O_NONBLOCK doesn't work because this should be fixed even if the input file is not non-blocking - the SPLICE_F_NONBLOCK flag is not appropriate because it affects only whether pipe operations are non-blocking, not whether file/socket operations are non-blocking Since there are no other parameters, I suggest adding the SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but affects the "non-pipe" file descriptor passed to sendfile() or splice(). It translates to IOCB_NOWAIT for regular files. For now, I have documented the flag to be kernel-internal with a high bit, like io_uring does with SPLICE_F_FD_IN_FIXED, but making this part of the system call ABI may be a good idea as well. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> --- fs/splice.c | 14 ++++++++++++++ include/linux/splice.h | 6 ++++++ 2 files changed, 20 insertions(+)