Message ID | 20240604092431.2183929-1-max.kellermann@ionos.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] fs/splice: don't block splice_direct_to_actor() after data was read | expand |
On Tue 04-06-24 11:24:31, Max Kellermann wrote: > If userspace calls sendfile() with a very large "count" parameter, the > kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) > have been read from the hard disk and pushed into the socket buffer. > > Usually, that is not a problem, because the socket write buffer gets > filled quickly, and if the socket is non-blocking, the last > direct_splice_actor() call will return -EAGAIN, causing > splice_direct_to_actor() to break from the loop, and sendfile() will > return a partial transfer. > > However, if the network happens to be faster than the hard disk, and > the socket buffer keeps getting drained between two > generic_file_read_iter() calls, the sendfile() system call can keep > running for a long time, blocking for disk I/O over and over. > > That is undesirable, because it can block the calling process for too > long. I discovered a problem where nginx would block for so long that > it would drop the HTTP connection because the kernel had just > transferred 2 GiB in one call, and the HTTP socket was not writable > (EPOLLOUT) for more than 60 seconds, resulting in a timeout: > > sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> > sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> > epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> > gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> > sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> > writev(4, [], 0) = 0 <0.000439> > epoll_wait(9, [], 512, 60000) = 0 <60.060430> > gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> > write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> > close(12) = 0 <0.000063> > close(4) = 0 <0.000091> > > In newer nginx versions (since 1.21.4), this problem was worked around > by defaulting "sendfile_max_chunk" to 2 MiB: > > https://github.com/nginx/nginx/commit/5636e7f7b4 Well, I can see your pain but after all the kernel does exactly what userspace has asked for? > Instead of asking userspace to provide an artificial upper limit, I'd > like the kernel to block for disk I/O at most once, and then pass back > control to userspace. > > There is prior art for this kind of behavior in filemap_read(): > > /* > * If we've already successfully copied some data, then we > * can no longer safely return -EIOCBQUEUED. Hence mark > * an async read NOWAIT at that point. > */ > if ((iocb->ki_flags & IOCB_WAITQ) && already_read) > iocb->ki_flags |= IOCB_NOWAIT; Do note the IOCB_WAITQ test - this means that the request is coming from io_uring and that has a retry logic to handle short reads. I'm really nervous about changing sendfile behavior to unconditionally returning short writes. In particular see this in do_sendfile(): #if 0 /* * We need to debate whether we can enable this or not. The * man page documents EAGAIN return for the output at least, * and the application is arguably buggy if it doesn't expect * EAGAIN on a non-blocking file descriptor. */ if (in.file->f_flags & O_NONBLOCK) fl = SPLICE_F_NONBLOCK; #endif We have been through these discussions in the past and so far we have always decided the chances for userspace breakage are too big. After all there's no substantial difference between userspace issuing a 2GB read(2) and 2GB sendfile(2). In both cases you are going to be blocked for a long time and there are too many userspace applications that depend on this behavior... So I don't think we want to change the current behavior. We could create a new interface to provide the semantics you want (like a flag to sendfile(2) - which would mean a new syscall) but I'm wondering whether these days io_uring isn't a better answer if you want to more closely control IO behavior of your application and are willing to change used interface. Honza
On Tue, Jun 4, 2024 at 12:41 PM Jan Kara <jack@suse.cz> wrote: > Well, I can see your pain but after all the kernel does exactly what > userspace has asked for? That is a valid point of view; indeed the kernel's behavior is correct according to the specification, but that was not my point. This is about an exotic problem that occurs only in very rare circumstances (depending on hard disk speed, network speed and timing), but when it occurs, it blocks the calling process for a very long time, which can then cause problems more serious than user unhappiness (e.g. expiring timeouts). (As I said, nginx had to work around this problem.) I'd like to optimize this special case, and adjust the kernel to always behave like the common case. > After all there's no substantial difference between userspace issuing a 2GB read(2) and 2GB sendfile(2). I understand your fear of breaking userspace, but this doesn't apply here, because yes, there is indeed a substantial difference: in the normal case, sendfile() stops when the destination socket buffer is full. That is the normal mode of operation, which all applications must be prepared for, because short sendfile() calls happen all the time, that's the common case. My patch is ONLY about fixing that exotic special case where the socket buffer is drained over and over while sendfile() still runs. > there are too many userspace applications that depend on this behavior... True for read() - but which application depends on this very special behavior that only occurs in very rare exceptional cases? I think we have a slight misunderstanding about the circumstances of the problem.
On Tue 04-06-24 13:14:05, Max Kellermann wrote: > On Tue, Jun 4, 2024 at 12:41 PM Jan Kara <jack@suse.cz> wrote: > > Well, I can see your pain but after all the kernel does exactly what > > userspace has asked for? > > That is a valid point of view; indeed the kernel's behavior is correct > according to the specification, but that was not my point. > > This is about an exotic problem that occurs only in very rare > circumstances (depending on hard disk speed, network speed and > timing), but when it occurs, it blocks the calling process for a very > long time, which can then cause problems more serious than user > unhappiness (e.g. expiring timeouts). (As I said, nginx had to work > around this problem.) > > I'd like to optimize this special case, and adjust the kernel to > always behave like the common case. > > > After all there's no substantial difference between userspace issuing a 2GB read(2) and 2GB sendfile(2). > > I understand your fear of breaking userspace, but this doesn't apply > here, because yes, there is indeed a substantial difference: in the > normal case, sendfile() stops when the destination socket buffer is > full. That is the normal mode of operation, which all applications > must be prepared for, because short sendfile() calls happen all the > time, that's the common case. > > My patch is ONLY about fixing that exotic special case where the > socket buffer is drained over and over while sendfile() still runs. OK, so that was not clear to me (and this may well be just my ignorance of networking details). Do you say that your patch changes the behavior only for this cornercase? Even if the socket fd is blocking? AFAIU with your patch we'd return short write in that case as well (roughly 64k AFAICT because that's the amount the internal splice pipe will take) but currently we block waiting for more space in the socket bufs? Honza
On Tue, Jun 4, 2024 at 3:27 PM Jan Kara <jack@suse.cz> wrote: > OK, so that was not clear to me (and this may well be just my ignorance of > networking details). Do you say that your patch changes the behavior only > for this cornercase? Even if the socket fd is blocking? AFAIU with your > patch we'd return short write in that case as well (roughly 64k AFAICT > because that's the amount the internal splice pipe will take) but currently > we block waiting for more space in the socket bufs? My patch changes only the file-read side, not the socket-write side. It adds IOCB_NOWAIT for reading from the file, just like filemap_read() does. Therefore, it does not matter whether the socket is non-blocking. But thanks for the reply - this was very helpful input for me because I have to admit that part of my explanation was wrong: I misunderstood how sending to a blocking socket works. I thought that send() and sendfile() would return after sending at least one byte (only recv() works that way), but in fact both block until everything has been submitted. That is a rather embarrassing misunderstanding of socket basics, but, uh, it just shows I've never really used blocking sockets! That means my patch can indeed change the behavior of sendfile() in a way that might surprise (badly written) applications and should NOT be merged as-is. Your concerns were correct and thanks again! That leaves me wondering how to solve this. Of course, io_uring is the proper solution, but that part of my software isn't ready for io_uring yet. I could change this to only use IOCB_NOWAIT if the destination is non-blocking, but something about this sounds wrong - it changes the read side just because the write side is non-blocking. We can't change the behavior out of fear of breaking applications; but can we have a per-file flag so applications can opt into partial reads/writes? This would be useful for all I/O on regular files (and sockets and everything else). There would not be any guarantees, just allowing the kernel to use relaxed semantics for those who can deal with partial I/O. Maybe I'm overthinking things and I should just fast-track full io_uring support in my code... Max
On Tue 04-06-24 21:24:14, Max Kellermann wrote: > On Tue, Jun 4, 2024 at 3:27 PM Jan Kara <jack@suse.cz> wrote: > > OK, so that was not clear to me (and this may well be just my ignorance of > > networking details). Do you say that your patch changes the behavior only > > for this cornercase? Even if the socket fd is blocking? AFAIU with your > > patch we'd return short write in that case as well (roughly 64k AFAICT > > because that's the amount the internal splice pipe will take) but currently > > we block waiting for more space in the socket bufs? > > My patch changes only the file-read side, not the socket-write side. > It adds IOCB_NOWAIT for reading from the file, just like > filemap_read() does. Therefore, it does not matter whether the socket > is non-blocking. > > But thanks for the reply - this was very helpful input for me because > I have to admit that part of my explanation was wrong: > I misunderstood how sending to a blocking socket works. I thought that > send() and sendfile() would return after sending at least one byte > (only recv() works that way), but in fact both block until everything > has been submitted. Yeah, this was exactly what I was trying to point at... > I could change this to only use IOCB_NOWAIT if the destination is > non-blocking, but something about this sounds wrong - it changes the > read side just because the write side is non-blocking. > We can't change the behavior out of fear of breaking applications; but > can we have a per-file flag so applications can opt into partial > reads/writes? This would be useful for all I/O on regular files (and > sockets and everything else). There would not be any guarantees, just > allowing the kernel to use relaxed semantics for those who can deal > with partial I/O. > Maybe I'm overthinking things and I should just fast-track full > io_uring support in my code... Adding open flags is messy (because they are different on different architectures AFAIR 8-|) and I'm not sure we have much left. In principle it is possible and I agree it seems useful. But I'm not sure how widespread use it would get as, as you write above, these days the time may be better spent on converting code with such needs to io_uring... Honza
diff --git a/fs/splice.c b/fs/splice.c index 60aed8de21f8..6257fef93ec4 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -362,6 +362,8 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos, iov_iter_bvec(&to, ITER_DEST, bv, npages, len); init_sync_kiocb(&kiocb, in); kiocb.ki_pos = *ppos; + if (flags & SPLICE_F_NOWAIT) + kiocb.ki_flags |= IOCB_NOWAIT; ret = in->f_op->read_iter(&kiocb, &to); if (ret > 0) { @@ -1090,6 +1092,18 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, if (unlikely(ret <= 0)) goto read_failure; + /* + * After at least one byte was read from the input + * file, don't wait for blocking I/O in the following + * loop iterations; instead of blocking for arbitrary + * amounts of time in the kernel, let userspace decide + * how to proceed. This avoids excessive latency if + * the output is being consumed faster than the input + * file can fill it (e.g. sendfile() from a slow hard + * disk to a fast network). + */ + flags |= SPLICE_F_NOWAIT; + read_len = ret; sd->total_len = read_len; diff --git a/include/linux/splice.h b/include/linux/splice.h index 9dec4861d09f..0deb87fb8055 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -21,7 +21,9 @@ #define SPLICE_F_MORE (0x04) /* expect more data */ #define SPLICE_F_GIFT (0x08) /* pages passed in are a gift */ -#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) +#define SPLICE_F_NOWAIT (0x10) /* do not wait for data which is not immediately available */ + +#define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT|SPLICE_F_NOWAIT) /* * Passed to the actors
If userspace calls sendfile() with a very large "count" parameter, the kernel can block for a very long time until 2 GiB (0x7ffff000 bytes) have been read from the hard disk and pushed into the socket buffer. Usually, that is not a problem, because the socket write buffer gets filled quickly, and if the socket is non-blocking, the last direct_splice_actor() call will return -EAGAIN, causing splice_direct_to_actor() to break from the loop, and sendfile() will return a partial transfer. However, if the network happens to be faster than the hard disk, and the socket buffer keeps getting drained between two generic_file_read_iter() calls, the sendfile() system call can keep running for a long time, blocking for disk I/O over and over. That is undesirable, because it can block the calling process for too long. I discovered a problem where nginx would block for so long that it would drop the HTTP connection because the kernel had just transferred 2 GiB in one call, and the HTTP socket was not writable (EPOLLOUT) for more than 60 seconds, resulting in a timeout: sendfile(4, 12, [5518919528] => [5884939344], 1813448856) = 366019816 <3.033067> sendfile(4, 12, [5884939344], 1447429040) = -1 EAGAIN (Resource temporarily unavailable) <0.000037> epoll_wait(9, [{EPOLLOUT, {u32=2181955104, u64=140572166585888}}], 512, 60000) = 1 <0.003355> gettimeofday({tv_sec=1667508799, tv_usec=201201}, NULL) = 0 <0.000024> sendfile(4, 12, [5884939344] => [8032418896], 2147480496) = 2147479552 <10.727970> writev(4, [], 0) = 0 <0.000439> epoll_wait(9, [], 512, 60000) = 0 <60.060430> gettimeofday({tv_sec=1667508869, tv_usec=991046}, NULL) = 0 <0.000078> write(5, "10.40.5.23 - - [03/Nov/2022:21:5"..., 124) = 124 <0.001097> close(12) = 0 <0.000063> close(4) = 0 <0.000091> In newer nginx versions (since 1.21.4), this problem was worked around by defaulting "sendfile_max_chunk" to 2 MiB: https://github.com/nginx/nginx/commit/5636e7f7b4 Instead of asking userspace to provide an artificial upper limit, I'd like the kernel to block for disk I/O at most once, and then pass back control to userspace. There is prior art for this kind of behavior in filemap_read(): /* * If we've already successfully copied some data, then we * can no longer safely return -EIOCBQUEUED. Hence mark * an async read NOWAIT at that point. */ if ((iocb->ki_flags & IOCB_WAITQ) && already_read) iocb->ki_flags |= IOCB_NOWAIT; This modifies the caller-provided "struct kiocb", which has an effect on repeated filemap_read() calls. This effect however vanishes because the "struct kiocb" is not persistent; splice_direct_to_actor() doesn't have one, and each generic_file_splice_read() call initializes a new one, losing the "IOCB_NOWAIT" flag that was injected by filemap_read(). There was no way to make generic_file_splice_read() aware that IOCB_NOWAIT was desired because some data had already been transferred in a previous call: - checking whether the input file has O_NONBLOCK doesn't work because this should be fixed even if the input file is not non-blocking - the SPLICE_F_NONBLOCK flag is not appropriate because it affects only whether pipe operations are non-blocking, not whether file/socket operations are non-blocking Since there are no other parameters, I suggest adding the SPLICE_F_NOWAIT flag, which is similar to SPLICE_F_NONBLOCK, but affects the "non-pipe" file descriptor passed to sendfile() or splice(). It translates to IOCB_NOWAIT for regular files, just like RWF_NOWAIT does. --- Changes v1 -> v2: - value of SPLICE_F_NOWAIT changed to 0x10 - added SPLICE_F_NOWAIT to SPLICE_F_ALL to make it part of uapi v2 -> v3: repost and rebase on linus/master Signed-off-by: Max Kellermann <max.kellermann@ionos.com> --- fs/splice.c | 14 ++++++++++++++ include/linux/splice.h | 4 +++- 2 files changed, 17 insertions(+), 1 deletion(-)