diff mbox series

[01/44] 9p: handling Rerror without copy_from_iter_full()

Message ID 20220622041552.737754-1-viro@zeniv.linux.org.uk (mailing list archive)
State New, archived
Headers show
Series [01/44] 9p: handling Rerror without copy_from_iter_full() | expand

Commit Message

Al Viro June 22, 2022, 4:15 a.m. UTC
p9_client_zc_rpc()/p9_check_zc_errors() are playing fast
and loose with copy_from_iter_full().

	Reading from file is done by sending Tread request.  Response
consists of fixed-sized header (including the amount of data actually
read) followed by the data itself.

	For zero-copy case we arrange the things so that the first
11 bytes of reply go into the fixed-sized buffer, with the rest going
straight into the pages we want to read into.

	What makes the things inconvenient is that sglist describing
what should go where has to be set *before* the reply arrives.  As
the result, if reply is an error, the things get interesting.  On success
we get
	size[4] Rread tag[2] count[4] data[count]
For error layout varies depending upon the protocol variant -
in original 9P and 9P2000 it's
	size[4] Rerror tag[2] len[2] error[len]
in 9P2000.U
	size[4] Rerror tag[2] len[2] error[len] errno[4]
in 9P2000.L
	size[4] Rlerror tag[2] errno[4]

	The last case is nice and simple - we have an 11-byte response
that fits into the fixed-sized buffer we hoped to get an Rread into.
In other two, though, we get a variable-length string spill into the
pages we'd prepared for the data to be read.

	Had that been in fixed-sized buffer (which is actually 4K),
we would've dealt with that the same way we handle non-zerocopy case.
However, for zerocopy it doesn't end up there, so we need to copy it
from those pages.

	The trouble is, by the time we get around to that, the
references to pages in question are already dropped.  As the result,
p9_zc_check_errors() tries to get the data using copy_from_iter_full().
Unfortunately, the iov_iter it's trying to read from might *NOT* be
capable of that.  It is, after all, a data destination, not data source.
In particular, if it's an ITER_PIPE one, copy_from_iter_full() will
simply fail.

	In ->zc_request() itself we do have those pages and dealing with
the problem in there would be a simple matter of memcpy_from_page()
into the fixed-sized buffer.  Moreover, it isn't hard to recognize
the (rare) case when such copying is needed.  That way we get rid of
p9_zc_check_errors() entirely - p9_check_errors() can be used instead
both for zero-copy and non-zero-copy cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 net/9p/client.c       | 86 +------------------------------------------
 net/9p/trans_virtio.c | 34 +++++++++++++++++
 2 files changed, 35 insertions(+), 85 deletions(-)

Comments

Dominique Martinet July 1, 2022, 6:21 a.m. UTC | #1
+Christian Schoenebeck in Ccs as that concerns qemu as well.

The patch I'm replying to is at
https://lkml.kernel.org/r/20220622041552.737754-1-viro@zeniv.linux.org.uk

Al Viro wrote on Wed, Jun 22, 2022 at 05:15:09AM +0100:
>         p9_client_zc_rpc()/p9_check_zc_errors() are playing fast
> and loose with copy_from_iter_full().
> 
> 	Reading from file is done by sending Tread request.  Response
> consists of fixed-sized header (including the amount of data actually
> read) followed by the data itself.
> 
> 	For zero-copy case we arrange the things so that the first
> 11 bytes of reply go into the fixed-sized buffer, with the rest going
> straight into the pages we want to read into.
> 
> 	What makes the things inconvenient is that sglist describing
> what should go where has to be set *before* the reply arrives.  As
> the result, if reply is an error, the things get interesting.  On success
> we get
> 	size[4] Rread tag[2] count[4] data[count]
> For error layout varies depending upon the protocol variant -
> in original 9P and 9P2000 it's
> 	size[4] Rerror tag[2] len[2] error[len]
> in 9P2000.U
> 	size[4] Rerror tag[2] len[2] error[len] errno[4]
> in 9P2000.L
> 	size[4] Rlerror tag[2] errno[4]
> 
> 	The last case is nice and simple - we have an 11-byte response
> that fits into the fixed-sized buffer we hoped to get an Rread into.
> In other two, though, we get a variable-length string spill into the
> pages we'd prepared for the data to be read.
> 
> 	Had that been in fixed-sized buffer (which is actually 4K),
> we would've dealt with that the same way we handle non-zerocopy case.
> However, for zerocopy it doesn't end up there, so we need to copy it
> from those pages.
> 
> 	The trouble is, by the time we get around to that, the
> references to pages in question are already dropped.  As the result,
> p9_zc_check_errors() tries to get the data using copy_from_iter_full().
> Unfortunately, the iov_iter it's trying to read from might *NOT* be
> capable of that.  It is, after all, a data destination, not data source.
> In particular, if it's an ITER_PIPE one, copy_from_iter_full() will
> simply fail.
> 
> 	In ->zc_request() itself we do have those pages and dealing with
> the problem in there would be a simple matter of memcpy_from_page()
> into the fixed-sized buffer.  Moreover, it isn't hard to recognize
> the (rare) case when such copying is needed.  That way we get rid of
> p9_zc_check_errors() entirely - p9_check_errors() can be used instead
> both for zero-copy and non-zero-copy cases.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

I ran basic tests with this, should be ok given the code path is never
used on normal (9p2000.L) workloads.


I also tried 9p2000.u for principle and ... I have no idea if this works
but it didn't seem to blow up there at least.
The problem is that 9p2000.u just doesn't work well even without these
patches, so I still stand by what I said about 9p2000.u and virtio (zc
interface): we really can (and I think should) just say virtio doesn't
support 9p2000.u.
(and could then further simplify this)

If you're curious, 9p2000.u hangs without your patch on at least two
different code paths (trying to read a huge buffer aborts sending a
reply because msize is too small instead of clamping it, that one has a
qemu warning message; but there are others ops like copyrange that just
fail silently and I didn't investigate)

I'd rather not fool someone into believing we support it when nobody has
time to maintain it and it fails almost immediately when user requests
some unusual IO patterns... And I definitely don't have time to even try
fixing it.
I'll suggest the same thing to qemu lists if we go that way.


Anyway, for anything useful:

Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>

--
Dominique
Dominique Martinet July 1, 2022, 6:25 a.m. UTC | #2
(sigh, I'm tired -- said I'd add Christian in Ccs and promply forgot to
do it. Sorry for double send to everyone else.)

+Christian Schoenebeck in Ccs as that concerns qemu as well.

The patch I'm replying to is at
https://lkml.kernel.org/r/20220622041552.737754-1-viro@zeniv.linux.org.uk

Al Viro wrote on Wed, Jun 22, 2022 at 05:15:09AM +0100:
>         p9_client_zc_rpc()/p9_check_zc_errors() are playing fast
> and loose with copy_from_iter_full().
> 
> 	Reading from file is done by sending Tread request.  Response
> consists of fixed-sized header (including the amount of data actually
> read) followed by the data itself.
> 
> 	For zero-copy case we arrange the things so that the first
> 11 bytes of reply go into the fixed-sized buffer, with the rest going
> straight into the pages we want to read into.
> 
> 	What makes the things inconvenient is that sglist describing
> what should go where has to be set *before* the reply arrives.  As
> the result, if reply is an error, the things get interesting.  On success
> we get
> 	size[4] Rread tag[2] count[4] data[count]
> For error layout varies depending upon the protocol variant -
> in original 9P and 9P2000 it's
> 	size[4] Rerror tag[2] len[2] error[len]
> in 9P2000.U
> 	size[4] Rerror tag[2] len[2] error[len] errno[4]
> in 9P2000.L
> 	size[4] Rlerror tag[2] errno[4]
> 
> 	The last case is nice and simple - we have an 11-byte response
> that fits into the fixed-sized buffer we hoped to get an Rread into.
> In other two, though, we get a variable-length string spill into the
> pages we'd prepared for the data to be read.
> 
> 	Had that been in fixed-sized buffer (which is actually 4K),
> we would've dealt with that the same way we handle non-zerocopy case.
> However, for zerocopy it doesn't end up there, so we need to copy it
> from those pages.
> 
> 	The trouble is, by the time we get around to that, the
> references to pages in question are already dropped.  As the result,
> p9_zc_check_errors() tries to get the data using copy_from_iter_full().
> Unfortunately, the iov_iter it's trying to read from might *NOT* be
> capable of that.  It is, after all, a data destination, not data source.
> In particular, if it's an ITER_PIPE one, copy_from_iter_full() will
> simply fail.
> 
> 	In ->zc_request() itself we do have those pages and dealing with
> the problem in there would be a simple matter of memcpy_from_page()
> into the fixed-sized buffer.  Moreover, it isn't hard to recognize
> the (rare) case when such copying is needed.  That way we get rid of
> p9_zc_check_errors() entirely - p9_check_errors() can be used instead
> both for zero-copy and non-zero-copy cases.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

I ran basic tests with this, should be ok given the code path is never
used on normal (9p2000.L) workloads.


I also tried 9p2000.u for principle and ... I have no idea if this works
but it didn't seem to blow up there at least.
The problem is that 9p2000.u just doesn't work well even without these
patches, so I still stand by what I said about 9p2000.u and virtio (zc
interface): we really can (and I think should) just say virtio doesn't
support 9p2000.u.
(and could then further simplify this)

If you're curious, 9p2000.u hangs without your patch on at least two
different code paths (trying to read a huge buffer aborts sending a
reply because msize is too small instead of clamping it, that one has a
qemu warning message; but there are others ops like copyrange that just
fail silently and I didn't investigate)

I'd rather not fool someone into believing we support it when nobody has
time to maintain it and it fails almost immediately when user requests
some unusual IO patterns... And I definitely don't have time to even try
fixing it.
I'll suggest the same thing to qemu lists if we go that way.


Anyway, for anything useful:

Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>

--
Dominique
Christian Schoenebeck July 1, 2022, 4:02 p.m. UTC | #3
On Freitag, 1. Juli 2022 08:25:49 CEST Dominique Martinet wrote:
> (sigh, I'm tired -- said I'd add Christian in Ccs and promply forgot to
> do it. Sorry for double send to everyone else.)
> 
> +Christian Schoenebeck in Ccs as that concerns qemu as well.
> 
> The patch I'm replying to is at
> https://lkml.kernel.org/r/20220622041552.737754-1-viro@zeniv.linux.org.uk
> 
> Al Viro wrote on Wed, Jun 22, 2022 at 05:15:09AM +0100:
> >         p9_client_zc_rpc()/p9_check_zc_errors() are playing fast
> > 
> > and loose with copy_from_iter_full().
> > 
> > 	Reading from file is done by sending Tread request.  Response
> > 
> > consists of fixed-sized header (including the amount of data actually
> > read) followed by the data itself.
> > 
> > 	For zero-copy case we arrange the things so that the first
> > 
> > 11 bytes of reply go into the fixed-sized buffer, with the rest going
> > straight into the pages we want to read into.
> > 
> > 	What makes the things inconvenient is that sglist describing
> > 
> > what should go where has to be set *before* the reply arrives.  As
> > the result, if reply is an error, the things get interesting.  On success
> > we get
> > 
> > 	size[4] Rread tag[2] count[4] data[count]
> > 
> > For error layout varies depending upon the protocol variant -
> > in original 9P and 9P2000 it's
> > 
> > 	size[4] Rerror tag[2] len[2] error[len]
> > 
> > in 9P2000.U
> > 
> > 	size[4] Rerror tag[2] len[2] error[len] errno[4]
> > 
> > in 9P2000.L
> > 
> > 	size[4] Rlerror tag[2] errno[4]
> > 	
> > 	The last case is nice and simple - we have an 11-byte response
> > 
> > that fits into the fixed-sized buffer we hoped to get an Rread into.
> > In other two, though, we get a variable-length string spill into the
> > pages we'd prepared for the data to be read.
> > 
> > 	Had that been in fixed-sized buffer (which is actually 4K),
> > 
> > we would've dealt with that the same way we handle non-zerocopy case.
> > However, for zerocopy it doesn't end up there, so we need to copy it
> > from those pages.
> > 
> > 	The trouble is, by the time we get around to that, the
> > 
> > references to pages in question are already dropped.  As the result,
> > p9_zc_check_errors() tries to get the data using copy_from_iter_full().
> > Unfortunately, the iov_iter it's trying to read from might *NOT* be
> > capable of that.  It is, after all, a data destination, not data source.
> > In particular, if it's an ITER_PIPE one, copy_from_iter_full() will
> > simply fail.
> > 
> > 	In ->zc_request() itself we do have those pages and dealing with
> > 
> > the problem in there would be a simple matter of memcpy_from_page()
> > into the fixed-sized buffer.  Moreover, it isn't hard to recognize
> > the (rare) case when such copying is needed.  That way we get rid of
> > p9_zc_check_errors() entirely - p9_check_errors() can be used instead
> > both for zero-copy and non-zero-copy cases.
> > 
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> I ran basic tests with this, should be ok given the code path is never
> used on normal (9p2000.L) workloads.

I haven't read this patch in detail yet, but upfront: POSIX error strings are 
like what, max. 128 bytes, no? So my expectation therefore would be that this 
patch could be further simplified.

Apart from that, I would rename handle_rerror() to something that reflects 
better what it actually does, e.g. unsparse_error() or cp_rerror_to_sdata().

> I also tried 9p2000.u for principle and ... I have no idea if this works
> but it didn't seem to blow up there at least.
> The problem is that 9p2000.u just doesn't work well even without these
> patches, so I still stand by what I said about 9p2000.u and virtio (zc
> interface): we really can (and I think should) just say virtio doesn't
> support 9p2000.u.
> (and could then further simplify this)
>
> If you're curious, 9p2000.u hangs without your patch on at least two
> different code paths (trying to read a huge buffer aborts sending a
> reply because msize is too small instead of clamping it, that one has a
> qemu warning message; but there are others ops like copyrange that just
> fail silently and I didn't investigate)

Last time I tested 9p2000.u was with the "remove msize limit" (WIP) patches:
https://lore.kernel.org/all/cover.1640870037.git.linux_oss@crudebyte.com/
Where I did not observe any issue with 9p2000.u.

What msize are we talking about, or can you tell a way to reproduce?

> I'd rather not fool someone into believing we support it when nobody has
> time to maintain it and it fails almost immediately when user requests
> some unusual IO patterns... And I definitely don't have time to even try
> fixing it.
> I'll suggest the same thing to qemu lists if we go that way.

Yeah, the situation with 9p2000.u in QEMU is similar in the sense that 
9p2000.u is barely used, little contributions, code not in good shape (e.g 
slower in many aspects in comparison to 9p2000.L), and for that reason I 
discussed with Greg to deprecate 9p2000.u in QEMU (not done yet). We are not 
aware about any serious issue with 9p2000.u though.

Best regards,
Christian Schoenebeck
Dominique Martinet July 1, 2022, 9 p.m. UTC | #4
Christian Schoenebeck wrote on Fri, Jul 01, 2022 at 06:02:31PM +0200:
> > I also tried 9p2000.u for principle and ... I have no idea if this works
> > but it didn't seem to blow up there at least.
> > The problem is that 9p2000.u just doesn't work well even without these
> > patches, so I still stand by what I said about 9p2000.u and virtio (zc
> > interface): we really can (and I think should) just say virtio doesn't
> > support 9p2000.u.
> > (and could then further simplify this)
> >
> > If you're curious, 9p2000.u hangs without your patch on at least two
> > different code paths (trying to read a huge buffer aborts sending a
> > reply because msize is too small instead of clamping it, that one has a
> > qemu warning message; but there are others ops like copyrange that just
> > fail silently and I didn't investigate)
> 
> Last time I tested 9p2000.u was with the "remove msize limit" (WIP) patches:
> https://lore.kernel.org/all/cover.1640870037.git.linux_oss@crudebyte.com/
> Where I did not observe any issue with 9p2000.u.
> 
> What msize are we talking about, or can you tell a way to reproduce?

I just ran fsstress on a
`mount -t 9p -o cache=none,trans=virtio,version=9p2000.u` mount on
v5.19-rc2:

fsstress -d /mnt/fstress -n 1000 -v

If that doesn't reproduce for you (and you care) I can turn some more
logs on, but from the look of it it could very well be msize related, I
just didn't check as I don't expect any real user

--
Dominique
Christian Schoenebeck July 3, 2022, 1:30 p.m. UTC | #5
On Freitag, 1. Juli 2022 23:00:06 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Fri, Jul 01, 2022 at 06:02:31PM +0200:
> > > I also tried 9p2000.u for principle and ... I have no idea if this works
> > > but it didn't seem to blow up there at least.
> > > The problem is that 9p2000.u just doesn't work well even without these
> > > patches, so I still stand by what I said about 9p2000.u and virtio (zc
> > > interface): we really can (and I think should) just say virtio doesn't
> > > support 9p2000.u.
> > > (and could then further simplify this)
> > > 
> > > If you're curious, 9p2000.u hangs without your patch on at least two
> > > different code paths (trying to read a huge buffer aborts sending a
> > > reply because msize is too small instead of clamping it, that one has a
> > > qemu warning message; but there are others ops like copyrange that just
> > > fail silently and I didn't investigate)
> > 
> > Last time I tested 9p2000.u was with the "remove msize limit" (WIP)
> > patches:
> > https://lore.kernel.org/all/cover.1640870037.git.linux_oss@crudebyte.com/
> > Where I did not observe any issue with 9p2000.u.
> > 
> > What msize are we talking about, or can you tell a way to reproduce?
> 
> I just ran fsstress on a
> `mount -t 9p -o cache=none,trans=virtio,version=9p2000.u` mount on
> v5.19-rc2:
> 
> fsstress -d /mnt/fstress -n 1000 -v
> 
> If that doesn't reproduce for you (and you care) I can turn some more
> logs on, but from the look of it it could very well be msize related, I
> just didn't check as I don't expect any real user

Confirmed. :/ I tested with various kernel versions (also w/wo "remove msize 
limit" WIP patches), different combinations of msize and cache options. They 
all start to hang the fsstress app with 9p2000.u protocol version, sometimes 
sooner, sometimes later.

BTW, fsstress does not pass with 9p2000.L here either, it would not hang, but 
it consistently terminates fsstress with (cache=none|mmap):

    posix_memalign: Invalid argument

@Greg: JFYI

Best regards,
Christian Schoenebeck
diff mbox series

Patch

diff --git a/net/9p/client.c b/net/9p/client.c
index 8bba0d9cf975..d403085b9ef5 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -550,90 +550,6 @@  static int p9_check_errors(struct p9_client *c, struct p9_req_t *req)
 	return err;
 }
 
-/**
- * p9_check_zc_errors - check 9p packet for error return and process it
- * @c: current client instance
- * @req: request to parse and check for error conditions
- * @uidata: external buffer containing error
- * @in_hdrlen: Size of response protocol buffer.
- *
- * returns error code if one is discovered, otherwise returns 0
- *
- * this will have to be more complicated if we have multiple
- * error packet types
- */
-
-static int p9_check_zc_errors(struct p9_client *c, struct p9_req_t *req,
-			      struct iov_iter *uidata, int in_hdrlen)
-{
-	int err;
-	int ecode;
-	s8 type;
-	char *ename = NULL;
-
-	err = p9_parse_header(&req->rc, NULL, &type, NULL, 0);
-	/* dump the response from server
-	 * This should be after parse_header which poplulate pdu_fcall.
-	 */
-	trace_9p_protocol_dump(c, &req->rc);
-	if (err) {
-		p9_debug(P9_DEBUG_ERROR, "couldn't parse header %d\n", err);
-		return err;
-	}
-
-	if (type != P9_RERROR && type != P9_RLERROR)
-		return 0;
-
-	if (!p9_is_proto_dotl(c)) {
-		/* Error is reported in string format */
-		int len;
-		/* 7 = header size for RERROR; */
-		int inline_len = in_hdrlen - 7;
-
-		len = req->rc.size - req->rc.offset;
-		if (len > (P9_ZC_HDR_SZ - 7)) {
-			err = -EFAULT;
-			goto out_err;
-		}
-
-		ename = &req->rc.sdata[req->rc.offset];
-		if (len > inline_len) {
-			/* We have error in external buffer */
-			if (!copy_from_iter_full(ename + inline_len,
-						 len - inline_len, uidata)) {
-				err = -EFAULT;
-				goto out_err;
-			}
-		}
-		ename = NULL;
-		err = p9pdu_readf(&req->rc, c->proto_version, "s?d",
-				  &ename, &ecode);
-		if (err)
-			goto out_err;
-
-		if (p9_is_proto_dotu(c) && ecode < 512)
-			err = -ecode;
-
-		if (!err) {
-			err = p9_errstr2errno(ename, strlen(ename));
-
-			p9_debug(P9_DEBUG_9P, "<<< RERROR (%d) %s\n",
-				 -ecode, ename);
-		}
-		kfree(ename);
-	} else {
-		err = p9pdu_readf(&req->rc, c->proto_version, "d", &ecode);
-		err = -ecode;
-
-		p9_debug(P9_DEBUG_9P, "<<< RLERROR (%d)\n", -ecode);
-	}
-	return err;
-
-out_err:
-	p9_debug(P9_DEBUG_ERROR, "couldn't parse error%d\n", err);
-	return err;
-}
-
 static struct p9_req_t *
 p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...);
 
@@ -874,7 +790,7 @@  static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type,
 	if (err < 0)
 		goto reterr;
 
-	err = p9_check_zc_errors(c, req, uidata, in_hdrlen);
+	err = p9_check_errors(c, req);
 	trace_9p_client_res(c, type, req->rc.tag, err);
 	if (!err)
 		return req;
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index b24a4fb0f0a2..2a210c2f8e40 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -377,6 +377,35 @@  static int p9_get_mapped_pages(struct virtio_chan *chan,
 	}
 }
 
+static void handle_rerror(struct p9_req_t *req, int in_hdr_len,
+			  size_t offs, struct page **pages)
+{
+	unsigned size, n;
+	void *to = req->rc.sdata + in_hdr_len;
+
+	// Fits entirely into the static data?  Nothing to do.
+	if (req->rc.size < in_hdr_len)
+		return;
+
+	// Really long error message?  Tough, truncate the reply.  Might get
+	// rejected (we can't be arsed to adjust the size encoded in header,
+	// or string size for that matter), but it wouldn't be anything valid
+	// anyway.
+	if (unlikely(req->rc.size > P9_ZC_HDR_SZ))
+		req->rc.size = P9_ZC_HDR_SZ;
+
+	// data won't span more than two pages
+	size = req->rc.size - in_hdr_len;
+	n = PAGE_SIZE - offs;
+	if (size > n) {
+		memcpy_from_page(to, *pages++, offs, n);
+		offs = 0;
+		to += n;
+		size -= n;
+	}
+	memcpy_from_page(to, *pages, offs, size);
+}
+
 /**
  * p9_virtio_zc_request - issue a zero copy request
  * @client: client instance issuing the request
@@ -503,6 +532,11 @@  p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	kicked = 1;
 	p9_debug(P9_DEBUG_TRANS, "virtio request kicked\n");
 	err = wait_event_killable(req->wq, req->status >= REQ_STATUS_RCVD);
+	// RERROR needs reply (== error string) in static data
+	if (req->status == REQ_STATUS_RCVD &&
+	    unlikely(req->rc.sdata[4] == P9_RERROR))
+		handle_rerror(req, in_hdr_len, offs, in_pages);
+
 	/*
 	 * Non kernel buffers are pinned, unpin them
 	 */