diff mbox series

Bug in short splice to socket?

Message ID 499791.1685485603@warthog.procyon.org.uk (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Bug in short splice to socket? | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

David Howells May 30, 2023, 10:26 p.m. UTC
Jakub Kicinski <kuba@kernel.org> wrote:

> Will the TLS selftests under tools/.../net/tls.c exercise this?

Interesting.  Now that you've pointed me at it, I've tried running it.  Mostly
it passes, but I'm having some problems with the multi_chunk_sendfile tests
that time out.  I think that splice_direct_to_actor() has a bug.  The problem
is this bit of code:

		/*
		 * If more data is pending, set SPLICE_F_MORE
		 * If this is the last data and SPLICE_F_MORE was not set
		 * initially, clears it.
		 */
		if (read_len < len)
			sd->flags |= SPLICE_F_MORE;
		else if (!more)
			sd->flags &= ~SPLICE_F_MORE;

When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be
passed to the network protocol) if we haven't yet read everything that the
user requested and clears it if we fulfilled what the user requested.

This has the weird effect that MSG_MORE gets kind of inverted.  It's never
seen by the actor if we can read the entire request into the pipe - except if
we hit the EOF first.  If we hit the EOF before we fulfil the entire request,
we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set.  The
upstream TLS code ignores it - but I'm changing this with my patches as
sendmsg() then uses it to mark the EOR.

I think we probably need to fix this in some way to check the size of source
file - which may not be a regular file:-/  With the attached change, all tests
pass; without it, a bunch of tests fail with timeouts.

David
---

Comments

Jakub Kicinski May 31, 2023, 12:32 a.m. UTC | #1
On Tue, 30 May 2023 23:26:43 +0100 David Howells wrote:
> Interesting.  Now that you've pointed me at it, I've tried running it.  Mostly
> it passes, but I'm having some problems with the multi_chunk_sendfile tests
> that time out.  I think that splice_direct_to_actor() has a bug.  The problem
> is this bit of code:
> 
> 		/*
> 		 * If more data is pending, set SPLICE_F_MORE
> 		 * If this is the last data and SPLICE_F_MORE was not set
> 		 * initially, clears it.
> 		 */
> 		if (read_len < len)
> 			sd->flags |= SPLICE_F_MORE;
> 		else if (!more)
> 			sd->flags &= ~SPLICE_F_MORE;
> 
> When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be
> passed to the network protocol) if we haven't yet read everything that the
> user requested and clears it if we fulfilled what the user requested.
> 
> This has the weird effect that MSG_MORE gets kind of inverted.  It's never
> seen by the actor if we can read the entire request into the pipe - except if
> we hit the EOF first.  If we hit the EOF before we fulfil the entire request,
> we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set.  The
> upstream TLS code ignores it - but I'm changing this with my patches as
> sendmsg() then uses it to mark the EOR.
> 
> I think we probably need to fix this in some way to check the size of source
> file - which may not be a regular file:-/  With the attached change, all tests
> pass; without it, a bunch of tests fail with timeouts.

Yeah.. it's one of those 'known warts' which we worked around in TLS
because I don't know enough about VFS to confidently fix it in fs/.
Proper fix would be pretty nice to have.

The original-original report of the problem is here, FWIW:
https://lore.kernel.org/netdev/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/

And my somewhat hacky fix was d452d48b9f8.
David Laight June 1, 2023, 11:01 a.m. UTC | #2
From: David Howells
> Sent: 30 May 2023 23:27
> 
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > Will the TLS selftests under tools/.../net/tls.c exercise this?
> 
> Interesting.  Now that you've pointed me at it, I've tried running it.  Mostly
> it passes, but I'm having some problems with the multi_chunk_sendfile tests
> that time out.  I think that splice_direct_to_actor() has a bug.  The problem
> is this bit of code:
> 
> 		/*
> 		 * If more data is pending, set SPLICE_F_MORE
> 		 * If this is the last data and SPLICE_F_MORE was not set
> 		 * initially, clears it.
> 		 */
> 		if (read_len < len)
> 			sd->flags |= SPLICE_F_MORE;
> 		else if (!more)
> 			sd->flags &= ~SPLICE_F_MORE;
> 
> When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be
> passed to the network protocol) if we haven't yet read everything that the
> user requested and clears it if we fulfilled what the user requested.
> 
> This has the weird effect that MSG_MORE gets kind of inverted.  It's never
> seen by the actor if we can read the entire request into the pipe - except if
> we hit the EOF first.  If we hit the EOF before we fulfil the entire request,
> we get a short read and SPLICE_F_MORE and thus MSG_MORE *is* set.  The
> upstream TLS code ignores it - but I'm changing this with my patches as
> sendmsg() then uses it to mark the EOR.

Isn't MSG_MORE supposed to be just a hint that more data will follow.
So you'd expect a final send with MSG_MORE to get sent, but possibly
after a short timeout.

Using it as a record marker seems wrong.

I'm not sure how to clear 'Oh bugger I set MSG_MORE but have no data'
to avoid the timeout.
A zero length semdmsg() won't DTRT with protocols like SCTP.
(Does splice even do anything sensible with SCTP?)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Linus Torvalds June 1, 2023, 1:09 p.m. UTC | #3
On Tue, May 30, 2023 at 6:26 PM David Howells <dhowells@redhat.com> wrote:
>
> When used with sendfile(), it sets SPLICE_F_MORE (which causes MSG_MORE to be
> passed to the network protocol) if we haven't yet read everything that the
> user requested and clears it if we fulfilled what the user requested.
>
> This has the weird effect that MSG_MORE gets kind of inverted. [...]

I hate this patch.

The old code is unquestionably garbage, but this just ends up
resulting in more of the same.

The reason the old code is garbage is that it sets SPLICE_F_MORE
entirely in the wrong place. It sets it *after* it has done the
splice(). That's just crazy.

And that craziness is the direct cause of the bug.

You try to fix the bug by just extending on the craziness. No. The
crazy should be removed, not extended upon.

The "flag" setting should be done *before* the send, if we know that
more data is pending even after the "do_splice_to()". Not after.

And the networking code should do its own "oh, splice gave me X bytes,
but I only used Y, so I know more data is pending, so I'll set
MSG_MORE on the *current* packet". But that's entirely inside of
whatever networking code that does th esplice.

So no. I absolutely refuse to entertain this patch because it's
completely broken. The fact that the old code was also completely
broken is *not* an excuse to make for more nonsensical breakage that
may or may just hide the crazy.

               Linus
Linus Torvalds June 1, 2023, 1:19 p.m. UTC | #4
On Thu, Jun 1, 2023 at 9:09 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The reason the old code is garbage is that it sets SPLICE_F_MORE
> entirely in the wrong place. It sets it *after* it has done the
> splice(). That's just crazy.

Clarification, because there are two splice's (from and to): by "after
the splice" I mean after the splice source, of course. It's still set
before the splice _to_ the network.

(But it is still true that I hope the networking code itself then sets
MSG_MORE even if SPLICE_F_MORE wasn't set, if it gets a buffer that is
bigger than what it can handle right now - so there are two
*different* reasons for "more data" - either the source knows it has
more to give, or the destination knows it didn't use everything it
got).

The point is that the splice *source* knows whether there is more data
to be had, and that's where the "there is more" should be set.

But the generic code does *not* know. You add a completely horrendous
hack to kind of approximate that knowledge, but it's *wrong*.

The old code was wrong too, of course. No question about that.

Basically my argument is that the whole "there is more data" should be
set by "->splice_read()" not by some hack in some generic
splice_direct_to_actor() function that is absolutely not supposed to
know about the internals of the source or the destination.

Do we have a good interface for that? No. I get the feeling that to
actually fix this, we'd have to pass in the 'struct splice_desc"
pointer to ->splice_read().

Then the reading side can say "yup, I have more data for you after
this", and it all makes sense at least conceptually.

So please, can we just fix the layering violation, rather than make it worse?

                Linus
David Howells June 1, 2023, 2:34 p.m. UTC | #5
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jun 1, 2023 at 9:09 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The reason the old code is garbage is that it sets SPLICE_F_MORE
> > entirely in the wrong place. It sets it *after* it has done the
> > splice(). That's just crazy.
> 
> Clarification, because there are two splice's (from and to): by "after
> the splice" I mean after the splice source, of course. It's still set
> before the splice _to_ the network.
> 
> (But it is still true that I hope the networking code itself then sets
> MSG_MORE even if SPLICE_F_MORE wasn't set, if it gets a buffer that is
> bigger than what it can handle right now - so there are two
> *different* reasons for "more data" - either the source knows it has
> more to give, or the destination knows it didn't use everything it
> got).

I'm in the process of changing things so that ->sendpage() is removed and
replaced with sendmsg() with MSG_SPLICE_PAGES.  The aim is to end up with a
splice_to_socket() function (see attached) that transcribes a chunk of the
pipe into a BVEC iterator and does a single sendmsg() that pushes a whole
chunk of data to the socket in one go.

In the network protocol, the loop inside sendmsg splices those pages into
socket buffers, sleeping as necessary to gain sufficient memory to transcribe
all of them.  It can return partly done and the fully consumed pages will be
consumed and then it'll return to gain more data.

At the moment, it transcribes 16 pages at a time.  I could make it set
MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b)
there's yet more data in the buffer.

However, this might well cause a malfunction in UDP, for example.  MSG_MORE
corks the current packet, so if I ask sendfile() say shove 32K into a packet,
if, say, 16K is read from the source and entirely transcribed into the packet,
if I understand what you're proposing, MSG_MORE wouldn't get set and the
packet would be transmitted early.  A similar issue might exist for AF_TLS,
where the lack of MSG_MORE triggers an end-of-record.

The problem isn't that the buffer supplied is bigger than the protocol could
handle in one go, it's that what we read was less than the amount that the
user requested - but we don't know if there will be more data.

One solution might be to pass MSG_MORE regardless, and then follow up with a
null sendmsg() if we then hit a zero-length read (ie. EOF) or -ENODATA (ie. a
hole).

> The point is that the splice *source* knows whether there is more data
> to be had, and that's where the "there is more" should be set.

Actually, that's not necessarily true.  If the source is a pipe or a socket,
there may not be more, but we don't know that until the far end is closed -
but for a file it probably is (we could be racing with a writer).  And then
there's seqfile-based things like the trace buffer or procfiles which are
really a class of their own.

> Basically my argument is that the whole "there is more data" should be
> set by "->splice_read()" not by some hack in some generic
> splice_direct_to_actor() function that is absolutely not supposed to
> know about the internals of the source or the destination.
> 
> Do we have a good interface for that? No. I get the feeling that to
> actually fix this, we'd have to pass in the 'struct splice_desc"
> pointer to ->splice_read().

That might become more feasible, actually.  In Jens's tree are the patches to
clean up splice_read a lot and get rid of ITER_PIPE.  Most of the
->splice_reads are going to be direct calls to copy_splice_read() and
filemap_splice_read() or wrappers around filemap_splice_read().  The latter,
at least, might be in a good position to note EOF, perhaps by marking a pipe
buffer as "no more".

copy_splice_read() is more problematic because ->read_iter() doesn't indicate
if it hit EOF (or hit a hole).

David
---
ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
			 loff_t *ppos, size_t len, unsigned int flags)
{
	struct socket *sock = sock_from_file(out);
	struct bio_vec bvec[16];
	struct msghdr msg = {};
	ssize_t ret;
	size_t spliced = 0;
	bool need_wakeup = false;

	pipe_lock(pipe);

	while (len > 0) {
		unsigned int head, tail, mask, bc = 0;
		size_t remain = len;

		/*
		 * Check for signal early to make process killable when there
		 * are always buffers available
		 */
		ret = -ERESTARTSYS;
		if (signal_pending(current))
			break;

		while (pipe_empty(pipe->head, pipe->tail)) {
			ret = 0;
			if (!pipe->writers)
				goto out;

			if (spliced)
				goto out;

			ret = -EAGAIN;
			if (flags & SPLICE_F_NONBLOCK)
				goto out;

			ret = -ERESTARTSYS;
			if (signal_pending(current))
				goto out;

			if (need_wakeup) {
				wakeup_pipe_writers(pipe);
				need_wakeup = false;
			}

			pipe_wait_readable(pipe);
		}

		head = pipe->head;
		tail = pipe->tail;
		mask = pipe->ring_size - 1;

		while (!pipe_empty(head, tail)) {
			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
			size_t seg;

			if (!buf->len) {
				tail++;
				continue;
			}

			seg = min_t(size_t, remain, buf->len);
			seg = min_t(size_t, seg, PAGE_SIZE);

			ret = pipe_buf_confirm(pipe, buf);
			if (unlikely(ret)) {
				if (ret == -ENODATA)
					ret = 0;
				break;
			}

			bvec_set_page(&bvec[bc++], buf->page, seg, buf->offset);
			remain -= seg;
			if (seg >= buf->len)
				tail++;
			if (bc >= ARRAY_SIZE(bvec))
				break;
		}

		if (!bc)
			break;

		msg.msg_flags = 0;
		if (flags & SPLICE_F_MORE)
			msg.msg_flags = MSG_MORE;
		if (remain && pipe_occupancy(pipe->head, tail) > 0)
			msg.msg_flags = MSG_MORE;
		msg.msg_flags |= MSG_SPLICE_PAGES;

		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, bc, len - remain);
		ret = sock_sendmsg(sock, &msg);
		if (ret <= 0)
			break;

		spliced += ret;
		len -= ret;
		tail = pipe->tail;
		while (ret > 0) {
			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
			size_t seg = min_t(size_t, ret, buf->len);

			buf->offset += seg;
			buf->len -= seg;
			ret -= seg;

			if (!buf->len) {
				pipe_buf_release(pipe, buf);
				tail++;
			}
		}

		if (tail != pipe->tail) {
			pipe->tail = tail;
			if (pipe->files)
				need_wakeup = true;
		}
	}

out:
	pipe_unlock(pipe);
	if (need_wakeup)
		wakeup_pipe_writers(pipe);
	return spliced ?: ret;
}
Linus Torvalds June 1, 2023, 3:12 p.m. UTC | #6
On Thu, Jun 1, 2023 at 10:34 AM David Howells <dhowells@redhat.com> wrote:
>
> At the moment, it transcribes 16 pages at a time.  I could make it set
> MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b)
> there's yet more data in the buffer.

That would at least be a good first step.

> However, this might well cause a malfunction in UDP, for example.  MSG_MORE
> corks the current packet, so if I ask sendfile() say shove 32K into a packet,
> if, say, 16K is read from the source and entirely transcribed into the packet,

If you use splice() for UDP, I don't think you would normally expect
to get all that well-defined packet boundaries.

That said, I think *this* part of splice_direct_to_actor() is correct:

                if (read_len < len)
                        sd->flags |= SPLICE_F_MORE;    <- WRONG
                else if (!more)
                        sd->flags &= ~SPLICE_F_MORE; <- CORRECT

ie if we've used up all of the 'len' argument, *and* 'more' wasn't set
in the incoming flags, then at that point we should clear
SPLICE_F_MORE.

So that means that UDP packets boundaries will be honored at the
'splice()' system call 'len' argument.

Obviously packet boundaries might happen before that - ie depending on
what the packet size limits are.

But the "set SPLICE_F_MORE" bit is just wrong. The generic code simply
does not know enough to make that determination.

> if I understand what you're proposing, MSG_MORE wouldn't get set and the
> packet would be transmitted early.

No, I'm saying that MSG_MORE should be set depending on what the
splice *input* says.

If the splice input knows that it has more to give but stopped early
for whatever reason (typically that the splice pipe buffers filled up,
but that's not necessarily the *only* reason), then it should set
SPLICE_F_MORE.

But this is literally only something that the input side *can* know.

And as you mention, some input sides cannot know even that. Regular
files typically know if there is more data. Other dynamic sources may
simply not know. And if they know, they just shouldn't set
SPLICE_F_MORE.

Of course, SPLICE_F_MORE may then be set because the *user* passed in
that flag, but that's a completely separate issue. The user may pass
in that flag because the user wants maximally sized packets, and knows
that other things will be fed into the destination (not even
necessarily through splice) after the splice.

So you really have multiple different reasons why SPLICE_F_MORE might
get set, but that

                if (read_len < len)

is *not* a valid reason. And no, extending that logic with more random
logic is still not a valid reason.

            Linus
David Howells June 1, 2023, 5:14 p.m. UTC | #7
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > However, this might well cause a malfunction in UDP, for example.
> > MSG_MORE corks the current packet, so if I ask sendfile() say shove 32K
> > into a packet, if, say, 16K is read from the source and entirely
> > transcribed into the packet,
> 
> If you use splice() for UDP, I don't think you would normally expect
> to get all that well-defined packet boundaries.

Actually, it will.  Attempting to overfill a UDP packet with splice will get
you -EMSGSIZE.  It won't turn a splice into more than one UDP packet.

I wonder if the right solution actually is to declare that the problem is
userspace's.  If you ask it to splice Z amount of data and it can't manage
that because the source dries up prematurely, then make it so that you assume
it always passed MSG_MORE and returns a short splice to userspace.  Userspace
can retry the splice/sendfile or do an empty sendmsg() to cap the message (or
cancel it).  Perhaps flushing a short message is actually a *bad* idea.

The answer then might be to make TLS handle a zero-length send() and fix the
test cases.  Would the attached changes then work for you?

David
---
diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..237688b0700b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -956,13 +956,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	 */
 	bytes = 0;
 	len = sd->total_len;
+
+	/* Don't block on output, we have to drain the direct pipe. */
 	flags = sd->flags;
+	sd->flags &= ~SPLICE_F_NONBLOCK;
 
 	/*
-	 * Don't block on output, we have to drain the direct pipe.
+	 * We signal MORE until we've read sufficient data to fulfill the
+	 * request and we keep signalling it if the caller set it.
 	 */
-	sd->flags &= ~SPLICE_F_NONBLOCK;
 	more = sd->flags & SPLICE_F_MORE;
+	sd->flags |= SPLICE_F_MORE;
 
 	WARN_ON_ONCE(!pipe_empty(pipe->head, pipe->tail));
 
@@ -978,14 +982,12 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		sd->total_len = read_len;
 
 		/*
-		 * If more data is pending, set SPLICE_F_MORE
-		 * If this is the last data and SPLICE_F_MORE was not set
-		 * initially, clears it.
+		 * If we now have sufficient data to fulfill the request then
+		 * we clear SPLICE_F_MORE if it was not set initially.
 		 */
-		if (read_len < len)
-			sd->flags |= SPLICE_F_MORE;
-		else if (!more)
+		if (read_len >= len && !more)
 			sd->flags &= ~SPLICE_F_MORE;
+
 		/*
 		 * NOTE: nonblocking mode only applies to the input. We
 		 * must not do the output in nonblocking mode as then we
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f63e4405cf34..5d48391da16c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -995,6 +995,9 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
 		}
 	}
 
+	if (!msg_data_left(msg) && eor)
+		goto copied;
+
 	while (msg_data_left(msg)) {
 		if (sk->sk_err) {
 			ret = -sk->sk_err;
diff --git a/tools/testing/selftests/net/tls.c b/tools/testing/selftests/net/tls.c
index e699548d4247..7df31583f2a4 100644
--- a/tools/testing/selftests/net/tls.c
+++ b/tools/testing/selftests/net/tls.c
@@ -377,7 +377,7 @@ static void chunked_sendfile(struct __test_metadata *_metadata,
 	char buf[TLS_PAYLOAD_MAX_LEN];
 	uint16_t test_payload_size;
 	int size = 0;
-	int ret;
+	int ret = 0;
 	char filename[] = "/tmp/mytemp.XXXXXX";
 	int fd = mkstemp(filename);
 	off_t offset = 0;
@@ -398,6 +398,9 @@ static void chunked_sendfile(struct __test_metadata *_metadata,
 		size -= ret;
 	}
 
+	if (ret < chunk_size)
+		EXPECT_EQ(send(self->fd, "", 0, 0), 0);
+
 	EXPECT_EQ(recv(self->cfd, buf, test_payload_size, MSG_WAITALL),
 		  test_payload_size);
Jakub Kicinski June 2, 2023, 4:20 a.m. UTC | #8
On Thu, 01 Jun 2023 18:14:40 +0100 David Howells wrote:
> The answer then might be to make TLS handle a zero-length send()

IDK. Eric added MSG_SENDPAGE_NOTLAST 11 years ago, to work around 
this exact problem. Your refactoring happens to break it and what
you're saying sounds to me more or less like "MSG_SENDPAGE_NOTLAST 
is unnecessary, it's user's fault".

A bit unconvincing. Maybe Eric would chime in, I'm not too familiar
with the deadly mess of the unchecked sendmsg()/sendpage() flags.
David Howells June 2, 2023, 8:23 a.m. UTC | #9
Jakub Kicinski <kuba@kernel.org> wrote:

> On Thu, 01 Jun 2023 18:14:40 +0100 David Howells wrote:
> > The answer then might be to make TLS handle a zero-length send()
> 
> IDK. Eric added MSG_SENDPAGE_NOTLAST 11 years ago, to work around 
> this exact problem. Your refactoring happens to break it and what
> you're saying sounds to me more or less like "MSG_SENDPAGE_NOTLAST 
> is unnecessary, it's user's fault".
> 
> A bit unconvincing. Maybe Eric would chime in, I'm not too familiar
> with the deadly mess of the unchecked sendmsg()/sendpage() flags.

Not so much the "user's fault" as we couldn't fulfill what the user asked - so
should we leave it to the user to work out how to clean it up rather than
automatically allowing the socket to flush (if cancellation might be an option
instead)?

The problem I have with NOTLAST is that it won't be asserted if the short read
only occupies a single pipe_buf.  We don't know that we won't get some more
data on the next pass.

An alternative way to maintain the current behaviour might be to have
splice_direct_to_actor() call the actor with sd->total_len == 0 if
do_splice_to() returned 0 and SPLICE_F_MORE wasn't set by the caller
(ie. !more).  Attached is a change to do that.  It becomes simpler if/once my
splice_to_socket() patches are applied - but I don't really want to push that
until all the protocols that support sendpage() support sendmsg() +
MSG_SPLICE_PAGES as well[*].

[*] Though if you're okay having a small window where TLS copies data rather
    than splicing, I could push the splice_to_socket() patches *first*.  TCP
    and AF_UNIX splice already support MSG_SPLICE_PAGES so that would bump
    their efficiency.

David
---
diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..84e9ca06db47 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -643,6 +643,22 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des
 		wakeup_pipe_writers(pipe);
 }
 
+/*
+ * Pass a zero-length record to the splice-write actor with SPLICE_F_MORE
+ * turned off to allow the network to see MSG_MORE deasserted.
+ */
+static ssize_t splice_from_pipe_zero(struct pipe_inode_info *pipe,
+				     struct splice_desc *sd,
+				     splice_actor *actor)
+{
+	struct pipe_buffer buf = {
+		.page	= ZERO_PAGE(0),
+		.ops	= &nosteal_pipe_buf_ops,
+	};
+
+	return actor(pipe, &buf, sd);
+}
+
 /**
  * __splice_from_pipe - splice data from a pipe to given actor
  * @pipe:	pipe to splice from
@@ -662,6 +678,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 	int ret;
 
 	splice_from_pipe_begin(sd);
+	if (!sd->total_len)
+		return splice_from_pipe_zero(pipe, sd, actor);
+
 	do {
 		cond_resched();
 		ret = splice_from_pipe_next(pipe, sd);
@@ -956,13 +975,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	 */
 	bytes = 0;
 	len = sd->total_len;
+
+	/* Don't block on output, we have to drain the direct pipe. */
 	flags = sd->flags;
+	sd->flags &= ~SPLICE_F_NONBLOCK;
 
 	/*
-	 * Don't block on output, we have to drain the direct pipe.
+	 * We signal MORE until we've read sufficient data to fulfill the
+	 * request and we keep signalling it if the caller set it.
 	 */
-	sd->flags &= ~SPLICE_F_NONBLOCK;
 	more = sd->flags & SPLICE_F_MORE;
+	sd->flags |= SPLICE_F_MORE;
 
 	WARN_ON_ONCE(!pipe_empty(pipe->head, pipe->tail));
 
@@ -971,21 +994,19 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		loff_t pos = sd->pos, prev_pos = pos;
 
 		ret = do_splice_to(in, &pos, pipe, len, flags);
-		if (unlikely(ret <= 0))
+		if (unlikely(ret < 0))
 			goto out_release;
 
 		read_len = ret;
 		sd->total_len = read_len;
 
 		/*
-		 * If more data is pending, set SPLICE_F_MORE
-		 * If this is the last data and SPLICE_F_MORE was not set
-		 * initially, clears it.
+		 * If we now have sufficient data to fulfill the request then
+		 * we clear SPLICE_F_MORE if it was not set initially.
 		 */
-		if (read_len < len)
-			sd->flags |= SPLICE_F_MORE;
-		else if (!more)
+		if ((read_len == 0 || read_len >= len) && !more)
 			sd->flags &= ~SPLICE_F_MORE;
+
 		/*
 		 * NOTE: nonblocking mode only applies to the input. We
 		 * must not do the output in nonblocking mode as then we
@@ -1005,6 +1026,8 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 			sd->pos = prev_pos + ret;
 			goto out_release;
 		}
+		if (read_len < 0)
+			goto out_release;
 	}
 
 done:
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f63e4405cf34..5d48391da16c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -995,6 +995,9 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
 		}
 	}
 
+	if (!msg_data_left(msg) && eor)
+		goto copied;
+
 	while (msg_data_left(msg)) {
 		if (sk->sk_err) {
 			ret = -sk->sk_err;
Linus Torvalds June 2, 2023, 11:28 a.m. UTC | #10
On Fri, Jun 2, 2023 at 4:23 AM David Howells <dhowells@redhat.com> wrote:
>
> +/*
> + * Pass a zero-length record to the splice-write actor with SPLICE_F_MORE
> + * turned off to allow the network to see MSG_MORE deasserted.
> + */
> +static ssize_t splice_from_pipe_zero(struct pipe_inode_info *pipe,

David, STOP.

Stop doing these crazy hacks to generic splice code. None of this makes sense.

Those zero-sized splices have never worked, don't try to make them work.

The splice() system call has always had that

        if (!len)
                return 0;

since being introduced. We're not adding any crazy stuff now.

Do what I already suggested: making SPLICE_F_MORE reflect reality.

Or don't do anything at all. Because this kind of hacky garbage is not
even amusing.

               Linus
David Howells June 2, 2023, 11:44 a.m. UTC | #11
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Do what I already suggested: making SPLICE_F_MORE reflect reality.

I'm trying to.  I need MSG_MORE to behave sensibly for what I want.

What I have signals SPLICE_F_MORE (and thus MSG_MORE) as long as we haven't
yet read enough data to fulfill the request - and will break out of the loop
if we get a zero-length read.

But this causes a change in behaviour because we then leave the protocol
having seen MSG_MORE set where it didn't previously see that.

This causes "tls -r tls.12_aes_gcm.multi_chunk_sendfile" on the TLS kselftest
to fail.

Now, if we're fine with the change in behaviour, I can make the selftest
observe the short sendfile() and cancel MSG_MORE itself - but that's just a
test program.

So that's the question: Do I have to maintain the current behaviour for the
short-splice case?

David
Linus Torvalds June 2, 2023, 12:11 p.m. UTC | #12
On Fri, Jun 2, 2023 at 7:45 AM David Howells <dhowells@redhat.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Do what I already suggested: making SPLICE_F_MORE reflect reality.
>
> I'm trying to.  I need MSG_MORE to behave sensibly for what I want.

But you need to stop doing these random hacks to fs/splice.c

The point is, you *CANNOT* make SPLICE_F_MORE reflect reality by
hacking fs/splice.c. Really. The generic layer DOES NOT KNOW, AND
FUNDAMENTALLY CANNOT KNOW if there is more data to be had.

So any of these random patches that try to add heuristics to
fs/splice.c will be rejected out of hand. They simply cannot be
correct.

And no, on the whole I do not believe you have to maintain some
selftest. A selftest failure is worrisome in that it clearly shows
that some behavior changed, but the situation here is

 (a) the current behavior is arguably bad and buggy

 (b) if we want to fix that bug, then the current behavior *will* change

Now, the only question then is whether the self-test actually tests
anything that user space actually depends on, or if it just tests some
random corner case.

So the self-test is certainly a ref flag, but not necessarily a very
meaningful one.

It shows that some user-visible change happened, which is always a big
danger flag, but after all that was the whole *point* of the whole
exercise.  The fact that the self-test caught the change is good,
because it means we had test coverage, but when the behavior is
something we *want* to change, the test failure is not a problem in
itself.

So what I think you should do is to fix the bug right, with a clean
patch, and no crazy hacks. That is something we can then apply and
test. All the while knowing full well that "uhhuh, this is a visible
change, we may have to revert it".

If then some *real* load ends up showing a regression, we may just be
screwed. Our current behavior may be buggy, but we have the rule that
once user space depends on kernel bugs, they become features pretty
much by definition, however much we might dislike it.

At that point, we'll have to see what we can do - if anything.

Basically, what I think the SPLICE_F_MORE rules *should* be (and hey,
I may be missing something) is

 1) if the user set that bit in the flags, then it's always true.

    The user basically told us "I will supply more data even after the
splice has finished", so it doesn't matter if the kernel runs out of
data in the middle.

 2) if the splice read side sees "I was asked for N bytes, but I could
only supply X bytes and I still have more to give", when we should set
SPLICE_F_MORE internally ("temporarily") for the next splice call.

    This is basically the "kernel independently knows that there will
be more data" case.

 3) In the end, this is all "best effort" and to some degree
inevitably a heuristic. We cannot see the future. We may hit that case
#2 and set the "there will be more data" bit, but then get a signal
and finish the splice system call before that more data actually
happens.

    Now, presumably the user will then continue the partial splice
after handling the signal, so (3) is still "right", but obviously we
can't _know_ that.

A corollary to (3) is that the reader side may not always know if
there will be more data to be read. For a file source, it's fairly
clear (modulo the obvious caveats - files can be truncated etc etc).
For other splice sources, the "I still have more to give" may not be
as unambiguous.  It is what it is.

Am I missing some important case? Considering that we clearly do *not*
do a great job at  SPLICE_F_MORE right now, I'd really want the
situation to be either that we just make the code "ClearlyCorrect(tm)"
and simple, or we just leave it alone as "that's our odd behavior,
deal with it".

None of this "let's change this all to be even more complex, and
handle some particular special case the way I want" crap.

Do it right, or don't do it at all.

             Linus
Jakub Kicinski June 2, 2023, 4:39 p.m. UTC | #13
On Fri, 2 Jun 2023 08:11:47 -0400 Linus Torvalds wrote:
> If then some *real* load ends up showing a regression, we may just be
> screwed. Our current behavior may be buggy, but we have the rule that
> once user space depends on kernel bugs, they become features pretty
> much by definition, however much we might dislike it.
> 
> At that point, we'll have to see what we can do - if anything.

Can we have a provisional plan of how we'll fix it if someone does
complain? We can't just revert David's work, and if none of the
solutions are appealing - socket implementations may be left holding
the bag.

I dislike the magic zero sends, and I think you do, too. In case of TLS
its unclear whether we should generate an empty record (like UDP would).

Can we add an optional splice_end / short_splice / splice_underflow /
splice_I_did_not_mean_to_set_more_on_the_previous_call_sorry callback
to struct file_operations?
Linus Torvalds June 2, 2023, 4:53 p.m. UTC | #14
On Fri, Jun 2, 2023 at 12:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Can we add an optional splice_end / short_splice / splice_underflow /
> splice_I_did_not_mean_to_set_more_on_the_previous_call_sorry callback
> to struct file_operations?

A splice_end() operation might well be the simplest model, but I think
it's broken.

It would certainly be easy to implement: file descriptor that doesn't
care about SPLICE_F_MORE - so most of them - would just leave it as
NULL, and the splice code could decide to call it *if* it had left the
last splice with SPLICE_F_MORE, _and_ the user hadn't set it, and the
file descriptor wants that information.

But I think one of the problems here is one of "what the hell is the
meaning of that bit"?

In particular, think about what happens if a signal is pending, and we
return with a partially completed write? There potentially *is* more
data to be sent, it's just not sent by *this* splice() call, as user
space has to handle the signal first.

What is the semantics of SPLICE_F_MORE in that kind of situation?

Which is why I really think that it would be *so* much better if we
really let the whole SPLICE_F_MORE bit be a signal from the *input*
side.

I know I've been harping on this, but just from a "sane semantics"
standpoint, I really think the only thing that *really* makes sense is
for the input side of a splice to say "I gave you X amount of data,
but I have more to give".

And that would *literally* be the semantic meaning of that SPLICE_F_MORE bit.

Wouldn't it be lovely to have some actual documented meaning to it,
which does *not* depend on things like ".. but what if a signal
happens" issues?

And yes, it's entirely possible that I'm missing something, and I'm
misunderstanding what people really want, but I do feel like this is a
somewhat subtle area, and if people really care about the exact
semantics of SPLICE_F_MORE, then we need to *have* exact semantics for
it.

And no, I don't think "splice_end()" can be that exact semantics -
even if it's simple - exactly because splice() is an interruptible
operation, so the "end" of a splice() is simply not a stable thing.

I also do wonder how much we care. What are the situations where the
packet boundaries can really matter in actual real world. Exactly
because I'm not 100% convinced we've had super-stable behavior here.

The fact that a test-case never triggers signal handling in the middle
of a splice() call isn't exactly a huge surprise. The test case
probably doesn't *have* signals. But it just means that the test-case
isn't all that real-life.

              Linus
Linus Torvalds June 2, 2023, 5:05 p.m. UTC | #15
On Fri, Jun 2, 2023 at 12:53 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And no, I don't think "splice_end()" can be that exact semantics -
> even if it's simple - exactly because splice() is an interruptible
> operation, so the "end" of a splice() is simply not a stable thing.

Just to harp some more on this - if SPLICE_F_MORE is seen as purely a
performance hit, with no real semantic value, and will still set
random packet boundaries but we want big packets for all the _usual_
cases, then I think something like "splice_end()" can be a fine
solution regardless of exact semantics.

Alternatively, if we make it the rule that "splice_end()" is only
called on EOF situations - so signals etc do not matter - then the
semantics would be stable and sound fine to me too.

In that second case, I'd like to literally name it that way, and
actually call it "splice_eof()". Because I'd like to really make it
very clear what the semantics would be.

So a "splice_eof()" sounds fine to me, and we'd make the semantics be
the current behavior:

 - splice() sets SPLICE_F_MORE if 'len > read_len'

 - splice() _clears_ SPLICE_F_MORE if we have hit 'len'

 - splice always sets SPLICE_F_MORE if it was passed by the user

BUT with the small new 'splice_eof()' rule that:

 - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that
"use all of len" case that cleared SPLICE_F_MORE, *and* we did a
"->splice_in()" that returned EOF (ie zero), *then* we will also do
that ->splice_eof() call.

The above sounds like "stable and possibly useful semantics" to me. It
would just have to be documented.

Is that what people want?

I don't think it's conceptually any different from my suggestion of
saying "->splice_read() can set SPLICE_F_MORE if it has more to give",
just a different implementation that doesn't require changes on the
splice_read() side.

                 Linus
Jakub Kicinski June 2, 2023, 5:38 p.m. UTC | #16
On Fri, 2 Jun 2023 13:05:14 -0400 Linus Torvalds wrote:
> Just to harp some more on this - if SPLICE_F_MORE is seen as purely a
> performance hit, with no real semantic value, and will still set
> random packet boundaries but we want big packets for all the _usual_
> cases, then I think something like "splice_end()" can be a fine
> solution regardless of exact semantics.
> 
> Alternatively, if we make it the rule that "splice_end()" is only
> called on EOF situations - so signals etc do not matter - then the
> semantics would be stable and sound fine to me too.
> 
> In that second case, I'd like to literally name it that way, and
> actually call it "splice_eof()". Because I'd like to really make it
> very clear what the semantics would be.
> 
> So a "splice_eof()" sounds fine to me, and we'd make the semantics be
> the current behavior:
> 
>  - splice() sets SPLICE_F_MORE if 'len > read_len'
> 
>  - splice() _clears_ SPLICE_F_MORE if we have hit 'len'
> 
>  - splice always sets SPLICE_F_MORE if it was passed by the user
> 
> BUT with the small new 'splice_eof()' rule that:
> 
>  - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that
> "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
> "->splice_in()" that returned EOF (ie zero), *then* we will also do
> that ->splice_eof() call.
> 
> The above sounds like "stable and possibly useful semantics" to me. It
> would just have to be documented.
> 
> Is that what people want?

->splice_eof() with the proposed semantics sounds perfect for the cases
testers complained about it the past, IMHO. We can pencil that in as the
contingency plan. Actually I like these semantics so much I'm tempted to
ask David to implement it already and save users potential debugging :D

> I don't think it's conceptually any different from my suggestion of
> saying "->splice_read() can set SPLICE_F_MORE if it has more to give",
> just a different implementation that doesn't require changes on the
> splice_read() side.

Setting SPLICE_F_MORE from the input side does feel much cleaner than
guessing in splice.c. But we may end up needing the eof() callback for 
the corner cases, anyway :(
David Howells June 2, 2023, 8:38 p.m. UTC | #17
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So a "splice_eof()" sounds fine to me, and we'd make the semantics be
> the current behavior:
> 
>  - splice() sets SPLICE_F_MORE if 'len > read_len'
> 
>  - splice() _clears_ SPLICE_F_MORE if we have hit 'len'
> 
>  - splice always sets SPLICE_F_MORE if it was passed by the user
> 
> BUT with the small new 'splice_eof()' rule that:
> 
>  - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that
> "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
> "->splice_in()" that returned EOF (ie zero), *then* we will also do
> that ->splice_eof() call.
> 
> The above sounds like "stable and possibly useful semantics" to me. It
> would just have to be documented.
> 
> Is that what people want?

That's easier to implement, I think.  That's basically what I was trying to
achieve by sending a zero-length actor call, but this is a cleaner way of
doing it, particularly if it's added as a socket op next to ->sendmsg().

Otherwise I have to build up the input side to try and tell me in advance
whether it thinks we hit an EOF/hole/whatever condition.  The problem is that,
as previously mentioned, it doesn't work for all circumstances - seqfile,
pipes, sockets for instance.

Take the following scenario for example: I could read from a TCP socket,
filling up the pipe-buffer, but not with sufficient data to fulfill the
operation.  Say I drain the TCP socket, but it's still open, so might produce
more data.  I then call the actor, which passes all the data to sendmsg() with
MSG_SPLICE_PAGES and MSG_MORE and clears the buffer.  I then go round again,
but in the meantime, the source socket got shut down with no further data
available and do_splice_to() returns 0.

There's no way to predict this, so having a ->splice_eof() call would handle
this situation.

David
David Howells June 2, 2023, 8:50 p.m. UTC | #18
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> BUT with the small new 'splice_eof()' rule that:
> 
>  - if the user did *not* set SPLICE_F_MORE *and* we didn't hit that
> "use all of len" case that cleared SPLICE_F_MORE, *and* we did a
> "->splice_in()" that returned EOF (ie zero), *then* we will also do
> that ->splice_eof() call.

But not if we didn't splice anything at all yet?

David
David Laight June 5, 2023, 11:03 a.m. UTC | #19
From: Linus Torvalds
> Sent: 01 June 2023 16:12
> 
> On Thu, Jun 1, 2023 at 10:34 AM David Howells <dhowells@redhat.com> wrote:
> >
> > At the moment, it transcribes 16 pages at a time.  I could make it set
> > MSG_MORE only if (a) SPLICE_F_MORE was passed into the splice() syscall or (b)
> > there's yet more data in the buffer.
> 
> That would at least be a good first step.
> 
> > However, this might well cause a malfunction in UDP, for example.  MSG_MORE
> > corks the current packet, so if I ask sendfile() say shove 32K into a packet,
> > if, say, 16K is read from the source and entirely transcribed into the packet,
> 
> If you use splice() for UDP, I don't think you would normally expect
> to get all that well-defined packet boundaries.

Especially since (assuming I've understood other bits of this thread)
the splice() can get split into multiple sendmsg() calls for other
reasons.

What semantics are you trying to implement for AF_TLS?
MSG_MORE has different effects on different protocols.

For UDP the next data is appended to the datagram being built.
(This is really pretty pointless, doing it in the caller will be faster!)

For TCP it stops the pending data being sent immediately.
And more data is appended.
I'm pretty sure it gets sent on timeout.

For SCTP the data chunk created for the sendmsg() isn't sent immediately.
Any more sendmsg(MSG_MORE) get queued until a full ethernet packet
is buffered.
The pending data is sent on timeout.
This is pretty much the only way to get two (or more) DATA chunks
into an ethernet frame when Nagle is disabled.

But I get the impression AF_TLS is deciding not to encode/send
the data because 'there isn't enough'.
That seems wrong.

Note that you can't use a zero length sendmsg() to flush pending
data - if there is no pending data some protocols will send a 
zero length data message.
A socket option/ioctl (eg UNCORK) could be (ab)used to force
queued data be sent.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Howells June 5, 2023, 3:52 p.m. UTC | #20
David Laight <David.Laight@ACULAB.COM> wrote:

> > > However, this might well cause a malfunction in UDP, for example.
> > > MSG_MORE corks the current packet, so if I ask sendfile() say shove 32K
> > > into a packet, if, say, 16K is read from the source and entirely
> > > transcribed into the packet,
> > 
> > If you use splice() for UDP, I don't think you would normally expect
> > to get all that well-defined packet boundaries.
> 
> Especially since (assuming I've understood other bits of this thread)
> the splice() can get split into multiple sendmsg() calls for other
> reasons.

Yes - with SPLICE_F_MORE/MSG_MORE set on all but the last piece.  The issue is
what happens if the input side gets a premature EOF after we've passed a chunk
with MSG_MORE set when the caller didn't indicate SPLICE_F_MORE?

> What semantics are you trying to implement for AF_TLS?

As I understand it, deasserting MSG_MORE causes a record boundary to be
interposed on TLS.

> MSG_MORE has different effects on different protocols.

Do you mean "different protocols" in relation to TLS specifically? Software vs
device vs device-specific like Chelsio-TLS?

> For UDP the next data is appended to the datagram being built.
> (This is really pretty pointless, doing it in the caller will be faster!)

Splice with SPLICE_F_MORE seems to work the same as sendmsg with MSG_MORE
here.  You get an error if you try to append with splice or sendmsg more than
a single packet will hold.

> For TCP it stops the pending data being sent immediately.
> And more data is appended.
> I'm pretty sure it gets sent on timeout.

Yeah - corking is used by some network filesystem protocols, presumably to
better place RPC messages into TCP packets.

> For SCTP the data chunk created for the sendmsg() isn't sent immediately.
> Any more sendmsg(MSG_MORE) get queued until a full ethernet packet
> is buffered.
> The pending data is sent on timeout.
> This is pretty much the only way to get two (or more) DATA chunks
> into an ethernet frame when Nagle is disabled.

SCTP doesn't support sendpage, so that's not an issue.

> But I get the impression AF_TLS is deciding not to encode/send
> the data because 'there isn't enough'.
> That seems wrong.
> 
> Note that you can't use a zero length sendmsg() to flush pending
> data - if there is no pending data some protocols will send a 
> zero length data message.
> A socket option/ioctl (eg UNCORK) could be (ab)used to force
> queued data be sent.

Yeah - I've changed that, see v4.  I've implemented Linus's ->splice_eof()
idea.

David
diff mbox series

Patch

diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..a7cf216c02a7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -982,10 +982,21 @@  ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		 * If this is the last data and SPLICE_F_MORE was not set
 		 * initially, clears it.
 		 */
-		if (read_len < len)
-			sd->flags |= SPLICE_F_MORE;
-		else if (!more)
+		if (read_len < len) {
+			struct inode *ii = in->f_mapping->host;
+
+			if (ii->i_fop->llseek != noop_llseek &&
+			    pos >= i_size_read(ii)) {
+				if (!more)
+					sd->flags &= ~SPLICE_F_MORE;
+			} else {
+				sd->flags |= SPLICE_F_MORE;
+			}
+
+		} else if (!more) {
 			sd->flags &= ~SPLICE_F_MORE;
+		}
+
 		/*
 		 * NOTE: nonblocking mode only applies to the input. We
 		 * must not do the output in nonblocking mode as then we