Message ID | 499791.1685485603@warthog.procyon.org.uk (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | Bug in short splice to socket? | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Tue, 30 May 2023 23:26:43 +0100 David Howells wrote: > Interesting. Now that you've pointed me at it, I've tried running it. Mostly > it passes, but I'm having some problems with the multi_chunk_sendfile tests > that time out. I think that splice_direct_to_actor() has a bug. The problem > is this bit of code: > > /* > * If more data is pending, set SPLICE_F_MORE > * If this is the last data and SPLICE_F_MORE was not set > * initially, clears it. > */ > if (read_len < len) > sd->flags |= SPLICE_F_MORE; > else if (!more) > sd->flags &= ~SPLICE_F_MORE; > > When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be > passed to the network protocol) if we haven't yet read everything that the > user requested and clears it if we fulfilled what the user requested. > > This has the weird effect that MSG_MORE gets kind of inverted. It's never > seen by the actor if we can read the entire request into the pipe - except if > we hit the EOF first. If we hit the EOF before we fulfil the entire request, > we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set. The > upstream TLS code ignores it - but I'm changing this with my patches as > sendmsg() then uses it to mark the EOR. > > I think we probably need to fix this in some way to check the size of source > file - which may not be a regular file:-/ With the attached change, all tests > pass; without it, a bunch of tests fail with timeouts. Yeah.. it's one of those 'known warts' which we worked around in TLS because I don't know enough about VFS to confidently fix it in fs/. Proper fix would be pretty nice to have. The original-original report of the problem is here, FWIW: https://lore.kernel.org/netdev/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/ And my somewhat hacky fix was d452d48b9f8.
From: David Howells > Sent: 30 May 2023 23:27 > > Jakub Kicinski <kuba@kernel.org> wrote: > > > Will the TLS selftests under tools/.../net/tls.c exercise this? > > Interesting. Now that you've pointed me at it, I've tried running it. Mostly > it passes, but I'm having some problems with the multi_chunk_sendfile tests > that time out. I think that splice_direct_to_actor() has a bug. The problem > is this bit of code: > > /* > * If more data is pending, set SPLICE_F_MORE > * If this is the last data and SPLICE_F_MORE was not set > * initially, clears it. > */ > if (read_len < len) > sd->flags |= SPLICE_F_MORE; > else if (!more) > sd->flags &= ~SPLICE_F_MORE; > > When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be > passed to the network protocol) if we haven't yet read everything that the > user requested and clears it if we fulfilled what the user requested. > > This has the weird effect that MSG_MORE gets kind of inverted. It's never > seen by the actor if we can read the entire request into the pipe - except if > we hit the EOF first. If we hit the EOF before we fulfil the entire request, > we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set. The > upstream TLS code ignores it - but I'm changing this with my patches as > sendmsg() then uses it to mark the EOR. Isn't MSG_MORE supposed to be just a hint that more data will follow. So you'd expect a final send with MSG_MORE to get sent, but possibly after a short timeout. Using it as a record marker seems wrong. I'm not sure how to clear 'Oh bugger I set MSG_MORE but have no data' to avoid the timeout. A zero length semdmsg() won't DTRT with protocols like SCTP. (Does splice even do anything sensible with SCTP?) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, May 30, 2023 at 6:26 PM David Howells <dhowells@redhat.com> wrote: > > When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be > passed to the network protocol) if we haven't yet read everything that the > user requested and clears it if we fulfilled what the user requested. > > This has the weird effect that MSG_MORE gets kind of inverted. [...] I hate this patch. The old code is unquestionably garbage, but this just ends up resulting in more of the same. The reason the old code is garbage is that it sets SPLICE_F_MORE entirely in the wrong place. It sets it *after* it has done the splice(). That's just crazy. And that craziness is the direct cause of the bug. You try to fix the bug by just extending on the craziness. No. The crazy should be removed, not extended upon. The "flag" setting should be done *before* the send, if we know that more data is pending even after the "do_splice_to()". Not after. And the networking code should do its own "oh, splice gave me X bytes, but I only used Y, so I know more data is pending, so I'll set MSG_MORE on the *current* packet". But that's entirely inside of whatever networking code that does th esplice. So no. I absolutely refuse to entertain this patch because it's completely broken. The fact that the old code was also completely broken is *not* an excuse to make for more nonsensical breakage that may or may just hide the crazy. Linus
On Thu, Jun 1, 2023 at 9:09 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The reason the old code is garbage is that it sets SPLICE_F_MORE > entirely in the wrong place. It sets it *after* it has done the > splice(). That's just crazy. Clarification, because there are two splice's (from and to): by "after the splice" I mean after the splice source, of course. It's still set before the splice _to_ the network. (But it is still true that I hope the networking code itself then sets MSG_MORE even if SPLICE_F_MORE wasn't set, if it gets a buffer that is bigger than what it can handle right now - so there are two *different* reasons for "more data" - either the source knows it has more to give, or the destination knows it didn't use everything it got). The point is that the splice *source* knows whether there is more data to be had, and that's where the "there is more" should be set. But the generic code does *not* know. You add a completely horrendous hack to kind of approximate that knowledge, but it's *wrong*. The old code was wrong too, of course. No question about that. Basically my argument is that the whole "there is more data" should be set by "->splice_read()" not by some hack in some generic splice_direct_to_actor() function that is absolutely not supposed to know about the internals of the source or the destination. Do we have a good interface for that? No. I get the feeling that to actually fix this, we'd have to pass in the 'struct splice_desc" pointer to ->splice_read(). Then the reading side can say "yup, I have more data for you after this", and it all makes sense at least conceptually. So please, can we just fix the layering violation, rather than make it worse? Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jun 1, 2023 at 9:09 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > The reason the old code is garbage is that it sets SPLICE_F_MORE > > entirely in the wrong place. It sets it *after* it has done the > > splice(). That's just crazy. > > Clarification, because there are two splice's (from and to): by "after > the splice" I mean after the splice source, of course. It's still set > before the splice _to_ the network. > > (But it is still true that I hope the networking code itself then sets > MSG_MORE even if SPLICE_F_MORE wasn't set, if it gets a buffer that is > bigger than what it can handle right now - so there are two > *different* reasons for "more data" - either the source knows it has > more to give, or the destination knows it didn't use everything it > got). I'm in the process of changing things so that ->sendpage() is removed and replaced with sendmsg() with MSG_SPLICE_PAGES. The aim is to end up with a splice_to_socket() function (see attached) that transcribes a chunk of the pipe into a BVEC iterator and does a single sendmsg() that pushes a whole chunk of data to the socket in one go. In the network protocol, the loop inside sendmsg splices those pages into socket buffers, sleeping as necessary to gain sufficient memory to transcribe all of them. It can return partly done and the fully consumed pages will be consumed and then it'll return to gain more data. At the moment, it transcribes 16 pages at a time. I could make it set MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b) there's yet more data in the buffer. However, this might well cause a malfunction in UDP, for example. MSG_MORE corks the current packet, so if I ask sendfile() say shove 32K into a packet, if, say, 16K is read from the source and entirely transcribed into the packet, if I understand what you're proposing, MSG_MORE wouldn't get set and the packet would be transmitted early. A similar issue might exist for AF_TLS, where the lack of MSG_MORE triggers an end-of-record. The problem isn't that the buffer supplied is bigger than the protocol could handle in one go, it's that what we read was less than the amount that the user requested - but we don't know if there will be more data. One solution might be to pass MSG_MORE regardless, and then follow up with a null sendmsg() if we then hit a zero-length read (ie. EOF) or -ENODATA (ie. a hole). > The point is that the splice *source* knows whether there is more data > to be had, and that's where the "there is more" should be set. Actually, that's not necessarily true. If the source is a pipe or a socket, there may not be more, but we don't know that until the far end is closed - but for a file it probably is (we could be racing with a writer). And then there's seqfile-based things like the trace buffer or procfiles which are really a class of their own. > Basically my argument is that the whole "there is more data" should be > set by "->splice_read()" not by some hack in some generic > splice_direct_to_actor() function that is absolutely not supposed to > know about the internals of the source or the destination. > > Do we have a good interface for that? No. I get the feeling that to > actually fix this, we'd have to pass in the 'struct splice_desc" > pointer to ->splice_read(). That might become more feasible, actually. In Jens's tree are the patches to clean up splice_read a lot and get rid of ITER_PIPE. Most of the ->splice_reads are going to be direct calls to copy_splice_read() and filemap_splice_read() or wrappers around filemap_splice_read(). The latter, at least, might be in a good position to note EOF, perhaps by marking a pipe buffer as "no more". copy_splice_read() is more problematic because ->read_iter() doesn't indicate if it hit EOF (or hit a hole). David --- ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags) { struct socket *sock = sock_from_file(out); struct bio_vec bvec[16]; struct msghdr msg = {}; ssize_t ret; size_t spliced = 0; bool need_wakeup = false; pipe_lock(pipe); while (len > 0) { unsigned int head, tail, mask, bc = 0; size_t remain = len; /* * Check for signal early to make process killable when there * are always buffers available */ ret = -ERESTARTSYS; if (signal_pending(current)) break; while (pipe_empty(pipe->head, pipe->tail)) { ret = 0; if (!pipe->writers) goto out; if (spliced) goto out; ret = -EAGAIN; if (flags & SPLICE_F_NONBLOCK) goto out; ret = -ERESTARTSYS; if (signal_pending(current)) goto out; if (need_wakeup) { wakeup_pipe_writers(pipe); need_wakeup = false; } pipe_wait_readable(pipe); } head = pipe->head; tail = pipe->tail; mask = pipe->ring_size - 1; while (!pipe_empty(head, tail)) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; size_t seg; if (!buf->len) { tail++; continue; } seg = min_t(size_t, remain, buf->len); seg = min_t(size_t, seg, PAGE_SIZE); ret = pipe_buf_confirm(pipe, buf); if (unlikely(ret)) { if (ret == -ENODATA) ret = 0; break; } bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset); remain -= seg; if (seg >= buf->len) tail++; if (bc >= ARRAY_SIZE(bvec)) break; } if (!bc) break; msg.msg_flags = 0; if (flags & SPLICE_F_MORE) msg.msg_flags = MSG_MORE; if (remain && pipe_occupancy(pipe->head, tail) > 0) msg.msg_flags = MSG_MORE; msg.msg_flags |= MSG_SPLICE_PAGES; iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain); ret = sock_sendmsg(sock, &msg); if (ret <= 0) break; spliced += ret; len -= ret; tail = pipe->tail; while (ret > 0) { struct pipe_buffer *buf = &pipe->bufs[tail & mask]; size_t seg = min_t(size_t, ret, buf->len); buf->offset += seg; buf->len -= seg; ret -= seg; if (!buf->len) { pipe_buf_release(pipe, buf); tail++; } } if (tail != pipe->tail) { pipe->tail = tail; if (pipe->files) need_wakeup = true; } } out: pipe_unlock(pipe); if (need_wakeup) wakeup_pipe_writers(pipe); return spliced ?: ret; }
On Thu, Jun 1, 2023 at 10:34 AM David Howells <dhowells@redhat.com> wrote: > > At the moment, it transcribes 16 pages at a time. I could make it set > MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b) > there's yet more data in the buffer. That would at least be a good first step. > However, this might well cause a malfunction in UDP, for example. MSG_MORE > corks the current packet, so if I ask sendfile() say shove 32K into a packet, > if, say, 16K is read from the source and entirely transcribed into the packet, If you use splice() for UDP, I don't think you would normally expect to get all that well-defined packet boundaries. That said, I think *this* part of splice_direct_to_actor() is correct: if (read_len < len) sd->flags |= SPLICE_F_MORE; <- WRONG else if (!more) sd->flags &= ~SPLICE_F_MORE; <- CORRECT ie if we've used up all of the 'len' argument, *and* 'more' wasn't set in the incoming flags, then at that point we should clear SPLICE_F_MORE. So that means that UDP packets boundaries will be honored at the 'splice()' system call 'len' argument. Obviously packet boundaries might happen before that - ie depending on what the packet size limits are. But the "set SPLICE_F_MORE" bit is just wrong. The generic code simply does not know enough to make that determination. > if I understand what you're proposing, MSG_MORE wouldn't get set and the > packet would be transmitted early. No, I'm saying that MSG_MORE should be set depending on what the splice *input* says. If the splice input knows that it has more to give but stopped early for whatever reason (typically that the splice pipe buffers filled up, but that's not necessarily the *only* reason), then it should set SPLICE_F_MORE. But this is literally only something that the input side *can* know. And as you mention, some input sides cannot know even that. Regular files typically know if there is more data. Other dynamic sources may simply not know. And if they know, they just shouldn't set SPLICE_F_MORE. Of course, SPLICE_F_MORE may then be set because the *user* passed in that flag, but that's a completely separate issue. The user may pass in that flag because the user wants maximally sized packets, and knows that other things will be fed into the destination (not even necessarily through splice) after the splice. So you really have multiple different reasons why SPLICE_F_MORE might get set, but that if (read_len < len) is *not* a valid reason. And no, extending that logic with more random logic is still not a valid reason. Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > However, this might well cause a malfunction in UDP, for example. > > MSG_MORE corks the current packet, so if I ask sendfile() say shove 32K > > into a packet, if, say, 16K is read from the source and entirely > > transcribed into the packet, > > If you use splice() for UDP, I don't think you would normally expect > to get all that well-defined packet boundaries. Actually, it will. Attempting to overfill a UDP packet with splice will get you -EMSGSIZE. It won't turn a splice into more than one UDP packet. I wonder if the right solution actually is to declare that the problem is userspace's. If you ask it to splice Z amount of data and it can't manage that because the source dries up prematurely, then make it so that you assume it always passed MSG_MORE and returns a short splice to userspace. Userspace can retry the splice/sendfile or do an empty sendmsg() to cap the message (or cancel it). Perhaps flushing a short message is actually a *bad* idea. The answer then might be to make TLS handle a zero-length send() and fix the test cases. Would the attached changes then work for you? David --- diff --git a/fs/splice.c b/fs/splice.c index 3e06611d19ae..237688b0700b 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -956,13 +956,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, */ bytes = 0; len = sd->total_len; + + /* Don't block on output, we have to drain the direct pipe. */ flags = sd->flags; + sd->flags &= ~SPLICE_F_NONBLOCK; /* - * Don't block on output, we have to drain the direct pipe. + * We signal MORE until we've read sufficient data to fulfill the + * request and we keep signalling it if the caller set it. */ - sd->flags &= ~SPLICE_F_NONBLOCK; more = sd->flags & SPLICE_F_MORE; + sd->flags |= SPLICE_F_MORE; WARN_ON_ONCE(!pipe_empty(pipe->head, pipe->tail)); @@ -978,14 +982,12 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, sd->total_len = read_len; /* - * If more data is pending, set SPLICE_F_MORE - * If this is the last data and SPLICE_F_MORE was not set - * initially, clears it. + * If we now have sufficient data to fulfill the request then + * we clear SPLICE_F_MORE if it was not set initially. */ - if (read_len < len) - sd->flags |= SPLICE_F_MORE; - else if (!more) + if (read_len >= len && !more) sd->flags &= ~SPLICE_F_MORE; + /* * NOTE: nonblocking mode only applies to the input. We * must not do the output in nonblocking mode as then we diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index f63e4405cf34..5d48391da16c 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -995,6 +995,9 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg, } } + if (!msg_data_left(msg) && eor) + goto copied; + while (msg_data_left(msg)) { if (sk->sk_err) { ret = -sk->sk_err; diff --git a/tools/testing/selftests/net/tls.c b/tools/testing/selftests/net/tls.c index e699548d4247..7df31583f2a4 100644 --- a/tools/testing/selftests/net/tls.c +++ b/tools/testing/selftests/net/tls.c @@ -377,7 +377,7 @@ static void chunked_sendfile(struct __test_metadata *_metadata, char buf[TLS_PAYLOAD_MAX_LEN]; uint16_t test_payload_size; int size = 0; - int ret; + int ret = 0; char filename[] = "/tmp/mytemp.XXXXXX"; int fd = mkstemp(filename); off_t offset = 0; @@ -398,6 +398,9 @@ static void chunked_sendfile(struct __test_metadata *_metadata, size -= ret; } + if (ret < chunk_size) + EXPECT_EQ(send(self->fd, "", 0, 0), 0); + EXPECT_EQ(recv(self->cfd, buf, test_payload_size, MSG_WAITALL), test_payload_size);
On Thu, 01 Jun 2023 18:14:40 +0100 David Howells wrote:
> The answer then might be to make TLS handle a zero-length send()
IDK. Eric added MSG_SENDPAGE_NOTLAST 11 years ago, to work around
this exact problem. Your refactoring happens to break it and what
you're saying sounds to me more or less like "MSG_SENDPAGE_NOTLAST
is unnecessary, it's user's fault".
A bit unconvincing. Maybe Eric would chime in, I'm not too familiar
with the deadly mess of the unchecked sendmsg()/sendpage() flags.
Jakub Kicinski <kuba@kernel.org> wrote: > On Thu, 01 Jun 2023 18:14:40 +0100 David Howells wrote: > > The answer then might be to make TLS handle a zero-length send() > > IDK. Eric added MSG_SENDPAGE_NOTLAST 11 years ago, to work around > this exact problem. Your refactoring happens to break it and what > you're saying sounds to me more or less like "MSG_SENDPAGE_NOTLAST > is unnecessary, it's user's fault". > > A bit unconvincing. Maybe Eric would chime in, I'm not too familiar > with the deadly mess of the unchecked sendmsg()/sendpage() flags. Not so much the "user's fault" as we couldn't fulfill what the user asked - so should we leave it to the user to work out how to clean it up rather than automatically allowing the socket to flush (if cancellation might be an option instead)? The problem I have with NOTLAST is that it won't be asserted if the short read only occupies a single pipe_buf. We don't know that we won't get some more data on the next pass. An alternative way to maintain the current behaviour might be to have splice_direct_to_actor() call the actor with sd->total_len == 0 if do_splice_to() returned 0 and SPLICE_F_MORE wasn't set by the caller (ie. !more). Attached is a change to do that. It becomes simpler if/once my splice_to_socket() patches are applied - but I don't really want to push that until all the protocols that support sendpage() support sendmsg() + MSG_SPLICE_PAGES as well[*]. [*] Though if you're okay having a small window where TLS copies data rather than splicing, I could push the splice_to_socket() patches *first*. TCP and AF_UNIX splice already support MSG_SPLICE_PAGES so that would bump their efficiency. David --- diff --git a/fs/splice.c b/fs/splice.c index 3e06611d19ae..84e9ca06db47 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -643,6 +643,22 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des wakeup_pipe_writers(pipe); } +/* + * Pass a zero-length record to the splice-write actor with SPLICE_F_MORE + * turned off to allow the network to see MSG_MORE deasserted. + */ +static ssize_t splice_from_pipe_zero(struct pipe_inode_info *pipe, + struct splice_desc *sd, + splice_actor *actor) +{ + struct pipe_buffer buf = { + .page = ZERO_PAGE(0), + .ops = &nosteal_pipe_buf_ops, + }; + + return actor(pipe, &buf, sd); +} + /** * __splice_from_pipe - splice data from a pipe to given actor * @pipe: pipe to splice from @@ -662,6 +678,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd, int ret; splice_from_pipe_begin(sd); + if (!sd->total_len) + return splice_from_pipe_zero(pipe, sd, actor); + do { cond_resched(); ret = splice_from_pipe_next(pipe, sd); @@ -956,13 +975,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, */ bytes = 0; len = sd->total_len; + + /* Don't block on output, we have to drain the direct pipe. */ flags = sd->flags; + sd->flags &= ~SPLICE_F_NONBLOCK; /* - * Don't block on output, we have to drain the direct pipe. + * We signal MORE until we've read sufficient data to fulfill the + * request and we keep signalling it if the caller set it. */ - sd->flags &= ~SPLICE_F_NONBLOCK; more = sd->flags & SPLICE_F_MORE; + sd->flags |= SPLICE_F_MORE; WARN_ON_ONCE(!pipe_empty(pipe->head, pipe->tail)); @@ -971,21 +994,19 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, loff_t pos = sd->pos, prev_pos = pos; ret = do_splice_to(in, &pos, pipe, len, flags); - if (unlikely(ret <= 0)) + if (unlikely(ret < 0)) goto out_release; read_len = ret; sd->total_len = read_len; /* - * If more data is pending, set SPLICE_F_MORE - * If this is the last data and SPLICE_F_MORE was not set - * initially, clears it. + * If we now have sufficient data to fulfill the request then + * we clear SPLICE_F_MORE if it was not set initially. */ - if (read_len < len) - sd->flags |= SPLICE_F_MORE; - else if (!more) + if ((read_len == 0 || read_len >= len) && !more) sd->flags &= ~SPLICE_F_MORE; + /* * NOTE: nonblocking mode only applies to the input. We * must not do the output in nonblocking mode as then we @@ -1005,6 +1026,8 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, sd->pos = prev_pos + ret; goto out_release; } + if (read_len < 0) + goto out_release; } done: diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index f63e4405cf34..5d48391da16c 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -995,6 +995,9 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg, } } + if (!msg_data_left(msg) && eor) + goto copied; + while (msg_data_left(msg)) { if (sk->sk_err) { ret = -sk->sk_err;
On Fri, Jun 2, 2023 at 4:23 AM David Howells <dhowells@redhat.com> wrote: > > +/* > + * Pass a zero-length record to the splice-write actor with SPLICE_F_MORE > + * turned off to allow the network to see MSG_MORE deasserted. > + */ > +static ssize_t splice_from_pipe_zero(struct pipe_inode_info *pipe, David, STOP. Stop doing these crazy hacks to generic splice code. None of this makes sense. Those zero-sized splices have never worked, don't try to make them work. The splice() system call has always had that if (!len) return 0; since being introduced. We're not adding any crazy stuff now. Do what I already suggested: making SPLICE_F_MORE reflect reality. Or don't do anything at all. Because this kind of hacky garbage is not even amusing. Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Do what I already suggested: making SPLICE_F_MORE reflect reality.
I'm trying to. I need MSG_MORE to behave sensibly for what I want.
What I have signals SPLICE_F_MORE (and thus MSG_MORE) as long as we haven't
yet read enough data to fulfill the request - and will break out of the loop
if we get a zero-length read.
But this causes a change in behaviour because we then leave the protocol
having seen MSG_MORE set where it didn't previously see that.
This causes "tls -r tls.12_aes_gcm.multi_chunk_sendfile" on the TLS kselftest
to fail.
Now, if we're fine with the change in behaviour, I can make the selftest
observe the short sendfile() and cancel MSG_MORE itself - but that's just a
test program.
So that's the question: Do I have to maintain the current behaviour for the
short-splice case?
David
On Fri, Jun 2, 2023 at 7:45 AM David Howells <dhowells@redhat.com> wrote: > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Do what I already suggested: making SPLICE_F_MORE reflect reality. > > I'm trying to. I need MSG_MORE to behave sensibly for what I want. But you need to stop doing these random hacks to fs/splice.c The point is, you *CANNOT* make SPLICE_F_MORE reflect reality by hacking fs/splice.c. Really. The generic layer DOES NOT KNOW, AND FUNDAMENTALLY CANNOT KNOW if there is more data to be had. So any of these random patches that try to add heuristics to fs/splice.c will be rejected out of hand. They simply cannot be correct. And no, on the whole I do not believe you have to maintain some selftest. A selftest failure is worrisome in that it clearly shows that some behavior changed, but the situation here is (a) the current behavior is arguably bad and buggy (b) if we want to fix that bug, then the current behavior *will* change Now, the only question then is whether the self-test actually tests anything that user space actually depends on, or if it just tests some random corner case. So the self-test is certainly a ref flag, but not necessarily a very meaningful one. It shows that some user-visible change happened, which is always a big danger flag, but after all that was the whole *point* of the whole exercise. The fact that the self-test caught the change is good, because it means we had test coverage, but when the behavior is something we *want* to change, the test failure is not a problem in itself. So what I think you should do is to fix the bug right, with a clean patch, and no crazy hacks. That is something we can then apply and test. All the while knowing full well that "uhhuh, this is a visible change, we may have to revert it". If then some *real* load ends up showing a regression, we may just be screwed. Our current behavior may be buggy, but we have the rule that once user space depends on kernel bugs, they become features pretty much by definition, however much we might dislike it. At that point, we'll have to see what we can do - if anything. Basically, what I think the SPLICE_F_MORE rules *should* be (and hey, I may be missing something) is 1) if the user set that bit in the flags, then it's always true. The user basically told us "I will supply more data even after the splice has finished", so it doesn't matter if the kernel runs out of data in the middle. 2) if the splice read side sees "I was asked for N bytes, but I could only supply X bytes and I still have more to give", when we should set SPLICE_F_MORE internally ("temporarily") for the next splice call. This is basically the "kernel independently knows that there will be more data" case. 3) In the end, this is all "best effort" and to some degree inevitably a heuristic. We cannot see the future. We may hit that case #2 and set the "there will be more data" bit, but then get a signal and finish the splice system call before that more data actually happens. Now, presumably the user will then continue the partial splice after handling the signal, so (3) is still "right", but obviously we can't _know_ that. A corollary to (3) is that the reader side may not always know if there will be more data to be read. For a file source, it's fairly clear (modulo the obvious caveats - files can be truncated etc etc). For other splice sources, the "I still have more to give" may not be as unambiguous. It is what it is. Am I missing some important case? Considering that we clearly do *not* do a great job at SPLICE_F_MORE right now, I'd really want the situation to be either that we just make the code "ClearlyCorrect(tm)" and simple, or we just leave it alone as "that's our odd behavior, deal with it". None of this "let's change this all to be even more complex, and handle some particular special case the way I want" crap. Do it right, or don't do it at all. Linus
On Fri, 2 Jun 2023 08:11:47 -0400 Linus Torvalds wrote: > If then some *real* load ends up showing a regression, we may just be > screwed. Our current behavior may be buggy, but we have the rule that > once user space depends on kernel bugs, they become features pretty > much by definition, however much we might dislike it. > > At that point, we'll have to see what we can do - if anything. Can we have a provisional plan of how we'll fix it if someone does complain? We can't just revert David's work, and if none of the solutions are appealing - socket implementations may be left holding the bag. I dislike the magic zero sends, and I think you do, too. In case of TLS its unclear whether we should generate an empty record (like UDP would). Can we add an optional splice_end / short_splice / splice_underflow / splice_I_did_not_mean_to_set_more_on_the_previous_call_sorry callback to struct file_operations?
On Fri, Jun 2, 2023 at 12:39 PM Jakub Kicinski <kuba@kernel.org> wrote: > > Can we add an optional splice_end / short_splice / splice_underflow / > splice_I_did_not_mean_to_set_more_on_the_previous_call_sorry callback > to struct file_operations? A splice_end() operation might well be the simplest model, but I think it's broken. It would certainly be easy to implement: file descriptor that doesn't care about SPLICE_F_MORE - so most of them - would just leave it as NULL, and the splice code could decide to call it *if* it had left the last splice with SPLICE_F_MORE, _and_ the user hadn't set it, and the file descriptor wants that information. But I think one of the problems here is one of "what the hell is the meaning of that bit"? In particular, think about what happens if a signal is pending, and we return with a partially completed write? There potentially *is* more data to be sent, it's just not sent by *this* splice() call, as user space has to handle the signal first. What is the semantics of SPLICE_F_MORE in that kind of situation? Which is why I really think that it would be *so* much better if we really let the whole SPLICE_F_MORE bit be a signal from the *input* side. I know I've been harping on this, but just from a "sane semantics" standpoint, I really think the only thing that *really* makes sense is for the input side of a splice to say "I gave you X amount of data, but I have more to give". And that would *literally* be the semantic meaning of that SPLICE_F_MORE bit. Wouldn't it be lovely to have some actual documented meaning to it, which does *not* depend on things like ".. but what if a signal happens" issues? And yes, it's entirely possible that I'm missing something, and I'm misunderstanding what people really want, but I do feel like this is a somewhat subtle area, and if people really care about the exact semantics of SPLICE_F_MORE, then we need to *have* exact semantics for it. And no, I don't think "splice_end()" can be that exact semantics - even if it's simple - exactly because splice() is an interruptible operation, so the "end" of a splice() is simply not a stable thing. I also do wonder how much we care. What are the situations where the packet boundaries can really matter in actual real world. Exactly because I'm not 100% convinced we've had super-stable behavior here. The fact that a test-case never triggers signal handling in the middle of a splice() call isn't exactly a huge surprise. The test case probably doesn't *have* signals. But it just means that the test-case isn't all that real-life. Linus
On Fri, Jun 2, 2023 at 12:53 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And no, I don't think "splice_end()" can be that exact semantics - > even if it's simple - exactly because splice() is an interruptible > operation, so the "end" of a splice() is simply not a stable thing. Just to harp some more on this - if SPLICE_F_MORE is seen as purely a performance hit, with no real semantic value, and will still set random packet boundaries but we want big packets for all the _usual_ cases, then I think something like "splice_end()" can be a fine solution regardless of exact semantics. Alternatively, if we make it the rule that "splice_end()" is only called on EOF situations - so signals etc do not matter - then the semantics would be stable and sound fine to me too. In that second case, I'd like to literally name it that way, and actually call it "splice_eof()". Because I'd like to really make it very clear what the semantics would be. So a "splice_eof()" sounds fine to me, and we'd make the semantics be the current behavior: - splice() sets SPLICE_F_MORE if 'len > read_len' - splice() _clears_ SPLICE_F_MORE if we have hit 'len' - splice always sets SPLICE_F_MORE if it was passed by the user BUT with the small new 'splice_eof()' rule that: - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that "use all of len" case that cleared SPLICE_F_MORE, *and* we did a "->splice_in()" that returned EOF (ie zero), *then* we will also do that ->splice_eof() call. The above sounds like "stable and possibly useful semantics" to me. It would just have to be documented. Is that what people want? I don't think it's conceptually any different from my suggestion of saying "->splice_read() can set SPLICE_F_MORE if it has more to give", just a different implementation that doesn't require changes on the splice_read() side. Linus
On Fri, 2 Jun 2023 13:05:14 -0400 Linus Torvalds wrote: > Just to harp some more on this - if SPLICE_F_MORE is seen as purely a > performance hit, with no real semantic value, and will still set > random packet boundaries but we want big packets for all the _usual_ > cases, then I think something like "splice_end()" can be a fine > solution regardless of exact semantics. > > Alternatively, if we make it the rule that "splice_end()" is only > called on EOF situations - so signals etc do not matter - then the > semantics would be stable and sound fine to me too. > > In that second case, I'd like to literally name it that way, and > actually call it "splice_eof()". Because I'd like to really make it > very clear what the semantics would be. > > So a "splice_eof()" sounds fine to me, and we'd make the semantics be > the current behavior: > > - splice() sets SPLICE_F_MORE if 'len > read_len' > > - splice() _clears_ SPLICE_F_MORE if we have hit 'len' > > - splice always sets SPLICE_F_MORE if it was passed by the user > > BUT with the small new 'splice_eof()' rule that: > > - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that > "use all of len" case that cleared SPLICE_F_MORE, *and* we did a > "->splice_in()" that returned EOF (ie zero), *then* we will also do > that ->splice_eof() call. > > The above sounds like "stable and possibly useful semantics" to me. It > would just have to be documented. > > Is that what people want? ->splice_eof() with the proposed semantics sounds perfect for the cases testers complained about it the past, IMHO. We can pencil that in as the contingency plan. Actually I like these semantics so much I'm tempted to ask David to implement it already and save users potential debugging :D > I don't think it's conceptually any different from my suggestion of > saying "->splice_read() can set SPLICE_F_MORE if it has more to give", > just a different implementation that doesn't require changes on the > splice_read() side. Setting SPLICE_F_MORE from the input side does feel much cleaner than guessing in splice.c. But we may end up needing the eof() callback for the corner cases, anyway :(
Linus Torvalds <torvalds@linux-foundation.org> wrote: > So a "splice_eof()" sounds fine to me, and we'd make the semantics be > the current behavior: > > - splice() sets SPLICE_F_MORE if 'len > read_len' > > - splice() _clears_ SPLICE_F_MORE if we have hit 'len' > > - splice always sets SPLICE_F_MORE if it was passed by the user > > BUT with the small new 'splice_eof()' rule that: > > - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that > "use all of len" case that cleared SPLICE_F_MORE, *and* we did a > "->splice_in()" that returned EOF (ie zero), *then* we will also do > that ->splice_eof() call. > > The above sounds like "stable and possibly useful semantics" to me. It > would just have to be documented. > > Is that what people want? That's easier to implement, I think. That's basically what I was trying to achieve by sending a zero-length actor call, but this is a cleaner way of doing it, particularly if it's added as a socket op next to ->sendmsg(). Otherwise I have to build up the input side to try and tell me in advance whether it thinks we hit an EOF/hole/whatever condition. The problem is that, as previously mentioned, it doesn't work for all circumstances - seqfile, pipes, sockets for instance. Take the following scenario for example: I could read from a TCP socket, filling up the pipe-buffer, but not with sufficient data to fulfill the operation. Say I drain the TCP socket, but it's still open, so might produce more data. I then call the actor, which passes all the data to sendmsg() with MSG_SPLICE_PAGES and MSG_MORE and clears the buffer. I then go round again, but in the meantime, the source socket got shut down with no further data available and do_splice_to() returns 0. There's no way to predict this, so having a ->splice_eof() call would handle this situation. David
Linus Torvalds <torvalds@linux-foundation.org> wrote: > BUT with the small new 'splice_eof()' rule that: > > - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that > "use all of len" case that cleared SPLICE_F_MORE, *and* we did a > "->splice_in()" that returned EOF (ie zero), *then* we will also do > that ->splice_eof() call. But not if we didn't splice anything at all yet? David
From: Linus Torvalds > Sent: 01 June 2023 16:12 > > On Thu, Jun 1, 2023 at 10:34 AM David Howells <dhowells@redhat.com> wrote: > > > > At the moment, it transcribes 16 pages at a time. I could make it set > > MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b) > > there's yet more data in the buffer. > > That would at least be a good first step. > > > However, this might well cause a malfunction in UDP, for example. MSG_MORE > > corks the current packet, so if I ask sendfile() say shove 32K into a packet, > > if, say, 16K is read from the source and entirely transcribed into the packet, > > If you use splice() for UDP, I don't think you would normally expect > to get all that well-defined packet boundaries. Especially since (assuming I've understood other bits of this thread) the splice() can get split into multiple sendmsg() calls for other reasons. What semantics are you trying to implement for AF_TLS? MSG_MORE has different effects on different protocols. For UDP the next data is appended to the datagram being built. (This is really pretty pointless, doing it in the caller will be faster!) For TCP it stops the pending data being sent immediately. And more data is appended. I'm pretty sure it gets sent on timeout. For SCTP the data chunk created for the sendmsg() isn't sent immediately. Any more sendmsg(MSG_MORE) get queued until a full ethernet packet is buffered. The pending data is sent on timeout. This is pretty much the only way to get two (or more) DATA chunks into an ethernet frame when Nagle is disabled. But I get the impression AF_TLS is deciding not to encode/send the data because 'there isn't enough'. That seems wrong. Note that you can't use a zero length sendmsg() to flush pending data - if there is no pending data some protocols will send a zero length data message. A socket option/ioctl (eg UNCORK) could be (ab)used to force queued data be sent. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
David Laight <David.Laight@ACULAB.COM> wrote: > > > However, this might well cause a malfunction in UDP, for example. > > > MSG_MORE corks the current packet, so if I ask sendfile() say shove 32K > > > into a packet, if, say, 16K is read from the source and entirely > > > transcribed into the packet, > > > > If you use splice() for UDP, I don't think you would normally expect > > to get all that well-defined packet boundaries. > > Especially since (assuming I've understood other bits of this thread) > the splice() can get split into multiple sendmsg() calls for other > reasons. Yes - with SPLICE_F_MORE/MSG_MORE set on all but the last piece. The issue is what happens if the input side gets a premature EOF after we've passed a chunk with MSG_MORE set when the caller didn't indicate SPLICE_F_MORE? > What semantics are you trying to implement for AF_TLS? As I understand it, deasserting MSG_MORE causes a record boundary to be interposed on TLS. > MSG_MORE has different effects on different protocols. Do you mean "different protocols" in relation to TLS specifically? Software vs device vs device-specific like Chelsio-TLS? > For UDP the next data is appended to the datagram being built. > (This is really pretty pointless, doing it in the caller will be faster!) Splice with SPLICE_F_MORE seems to work the same as sendmsg with MSG_MORE here. You get an error if you try to append with splice or sendmsg more than a single packet will hold. > For TCP it stops the pending data being sent immediately. > And more data is appended. > I'm pretty sure it gets sent on timeout. Yeah - corking is used by some network filesystem protocols, presumably to better place RPC messages into TCP packets. > For SCTP the data chunk created for the sendmsg() isn't sent immediately. > Any more sendmsg(MSG_MORE) get queued until a full ethernet packet > is buffered. > The pending data is sent on timeout. > This is pretty much the only way to get two (or more) DATA chunks > into an ethernet frame when Nagle is disabled. SCTP doesn't support sendpage, so that's not an issue. > But I get the impression AF_TLS is deciding not to encode/send > the data because 'there isn't enough'. > That seems wrong. > > Note that you can't use a zero length sendmsg() to flush pending > data - if there is no pending data some protocols will send a > zero length data message. > A socket option/ioctl (eg UNCORK) could be (ab)used to force > queued data be sent. Yeah - I've changed that, see v4. I've implemented Linus's ->splice_eof() idea. David
diff --git a/fs/splice.c b/fs/splice.c index 3e06611d19ae..a7cf216c02a7 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -982,10 +982,21 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, * If this is the last data and SPLICE_F_MORE was not set * initially, clears it. */ - if (read_len < len) - sd->flags |= SPLICE_F_MORE; - else if (!more) + if (read_len < len) { + struct inode *ii = in->f_mapping->host; + + if (ii->i_fop->llseek != noop_llseek && + pos >= i_size_read(ii)) { + if (!more) + sd->flags &= ~SPLICE_F_MORE; + } else { + sd->flags |= SPLICE_F_MORE; + } + + } else if (!more) { sd->flags &= ~SPLICE_F_MORE; + } + /* * NOTE: nonblocking mode only applies to the input. We * must not do the output in nonblocking mode as then we
Jakub Kicinski <kuba@kernel.org> wrote: > Will the TLS selftests under tools/.../net/tls.c exercise this? Interesting. Now that you've pointed me at it, I've tried running it. Mostly it passes, but I'm having some problems with the multi_chunk_sendfile tests that time out. I think that splice_direct_to_actor() has a bug. The problem is this bit of code: /* * If more data is pending, set SPLICE_F_MORE * If this is the last data and SPLICE_F_MORE was not set * initially, clears it. */ if (read_len < len) sd->flags |= SPLICE_F_MORE; else if (!more) sd->flags &= ~SPLICE_F_MORE; When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be passed to the network protocol) if we haven't yet read everything that the user requested and clears it if we fulfilled what the user requested. This has the weird effect that MSG_MORE gets kind of inverted. It's never seen by the actor if we can read the entire request into the pipe - except if we hit the EOF first. If we hit the EOF before we fulfil the entire request, we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set. The upstream TLS code ignores it - but I'm changing this with my patches as sendmsg() then uses it to mark the EOR. I think we probably need to fix this in some way to check the size of source file - which may not be a regular file:-/ With the attached change, all tests pass; without it, a bunch of tests fail with timeouts. David ---