diff mbox series

[v4,12/12] net/9p: allocate appropriate reduced message buffers

Message ID 8c305df4646b65218978fc6474aa0f5f29b216a0.1640870037.git.linux_oss@crudebyte.com (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series remove msize limit in virtio transport | expand

Checks

Context Check Description
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4 this patch: 4
netdev/cc_maintainers warning 2 maintainers not CCed: davem@davemloft.net kuba@kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4 this patch: 4
netdev/checkpatch warning CHECK: Alignment should match open parenthesis
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Christian Schoenebeck Dec. 30, 2021, 1:23 p.m. UTC
So far 'msize' was simply used for all 9p message types, which is far
too much and slowed down performance tremendously with large values
for user configurable 'msize' option.

Let's stop this waste by using the new p9_msg_buf_size() function for
allocating more appropriate, smaller buffers according to what is
actually sent over the wire.

Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/client.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

Comments

Dominique Martinet April 2, 2022, 2:05 p.m. UTC | #1
Christian Schoenebeck wrote on Thu, Dec 30, 2021 at 02:23:18PM +0100:
> So far 'msize' was simply used for all 9p message types, which is far
> too much and slowed down performance tremendously with large values
> for user configurable 'msize' option.
> 
> Let's stop this waste by using the new p9_msg_buf_size() function for
> allocating more appropriate, smaller buffers according to what is
> actually sent over the wire.

By the way, thinking of protocols earlier made me realize this won't
work on RDMA transport...

unlike virtio/tcp/xen, RDMA doesn't "mailbox" messages: there's a pool
of posted buffers, and once a message has been received it looks for the
header in the received message and associates it with the matching
request, but there's no guarantee a small message will use a small
buffer...

This is also going to need some thought, perhaps just copying small
buffers and recycling the buffer if a large one was used? but there
might be a window with no buffer available and I'm not sure what'd
happen, and don't have any RDMA hardware available to test this right
now so this will be fun.


I'm not shooting this down (it's definitely interesting), but we might
need to make it optional until someone with RDMA hardware can validate a
solution.
Christian Schoenebeck April 3, 2022, 11:29 a.m. UTC | #2
On Samstag, 2. April 2022 16:05:36 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Thu, Dec 30, 2021 at 02:23:18PM +0100:
> > So far 'msize' was simply used for all 9p message types, which is far
> > too much and slowed down performance tremendously with large values
> > for user configurable 'msize' option.
> > 
> > Let's stop this waste by using the new p9_msg_buf_size() function for
> > allocating more appropriate, smaller buffers according to what is
> > actually sent over the wire.
> 
> By the way, thinking of protocols earlier made me realize this won't
> work on RDMA transport...
> 
> unlike virtio/tcp/xen, RDMA doesn't "mailbox" messages: there's a pool
> of posted buffers, and once a message has been received it looks for the
> header in the received message and associates it with the matching
> request, but there's no guarantee a small message will use a small
> buffer...
> 
> This is also going to need some thought, perhaps just copying small
> buffers and recycling the buffer if a large one was used? but there
> might be a window with no buffer available and I'm not sure what'd
> happen, and don't have any RDMA hardware available to test this right
> now so this will be fun.
> 
> 
> I'm not shooting this down (it's definitely interesting), but we might
> need to make it optional until someone with RDMA hardware can validate a
> solution.

So maybe I should just exclude the 9p RDMA transport from this 9p message size 
reduction change in v5 until somebody had a chance to test this change with 
RDMA.

Which makes me wonder, what is that exact hardware, hypervisor, OS that 
supports 9p & RDMA?

On the long-term I can imagine to add RDMA transport support on QEMU 9p side. 
There is already RDMA code in QEMU, however it is only used for migration by 
QEMU so far I think.

Best regards,
Christian Schoenebeck
Dominique Martinet April 3, 2022, 12:37 p.m. UTC | #3
Christian Schoenebeck wrote on Sun, Apr 03, 2022 at 01:29:53PM +0200:
> So maybe I should just exclude the 9p RDMA transport from this 9p message size 
> reduction change in v5 until somebody had a chance to test this change with 
> RDMA.

Yes, I'm pretty certain it won't work so we'll want to exclude it unless
we can extend the RDMA protocol to address buffers.

From my understanding, RDMA comes with two type of primitives:
 - recv/send that 9p exlusively uses, which is just a pool of buffers
registered to the NIC and get filled on a first-come-first-serve basis
(I'm not sure if it's a first-fit, or if message will be truncated, or
if it'll error out if the message doesn't fit... But basically given
that's what we use for 9p we have no way of guaranteeing that a read
reply will be filled in the big buffer allocated for it and not
something else)

If we're lucky the algorithm used is smallest-fit first, but it doesn't
look like it:
---
The order of the Receive Request consumptions in a Receive Queue is by
the order that they were posted to it.
When you have a SRQ, you cannot predict which Receive Request will be
consumed by which QP, so all the Receive Requests in that SRQ should be
able to contain the incoming message (in terms of length).
--- https://www.rdmamojo.com/2013/02/02/ibv_post_recv/ (in a comment)


 - read/write, which can be addressed e.g. the remote end can specify a
cookie along with address/size and directly operate on remote memory
(hence the "remote direct memory access" name). There are also some cool
stuff that can be done like atomic compare and swap or arithmetic
operations on remote memory which doesn't really concern us.

Using read/writes like NFS over RDMA does would resolve the problem and
allow what they call "direct data placement", which is reading or
writing directly from the page cache or user buffer as a real zero copy
operation, but it requires the cookie/address to be sent and client to
act on it so it's a real transport-specific protocol change, but given
the low number of users I think that's something that could be
considered if someone wants to work on it.

Until then we'll be safer with that bit disabled...

> Which makes me wonder, what is that exact hardware, hypervisor, OS that 
> supports 9p & RDMA?

I've used it with mellanox infiniband cards in the past. These support
SRIOV virtual functions so are quite practical for VMs, could let it do
the work with a single machine and no cable.

I'm pretty sure it'd work with any recent server hardware that supports
RoCE (I -think- it's getting more common?), or with emulation ~10 years
ago I got it to run with softiwarp which has been merged in the kernel
(siw) since so that might be the easiest way to run it now.

Server-side, both diod and nfs-ganesha support 9p over RDMA, I haven't
used diod recently but ganesha ought to work.


> On the long-term I can imagine to add RDMA transport support on QEMU 9p side. 

What would you expect it to be used for?

> There is already RDMA code in QEMU, however it is only used for migration by 
> QEMU so far I think.

Yes, looking at it a bit there's live migration over RDMA (I tested it
at my previous job), some handling for gluster+rdma, and a
paravirtualized RDMA device (pvrdma).
the docs for it says it works with soft-roce so it would also probably
work for tests (I'm not sure what difference there is between rxe and
siw), but at this point you've just setup virtualized rdma on the host
anyway...

I'll try to get something setup for tests on my end as well, it's
definitely something I had on my todo...
Christian Schoenebeck April 3, 2022, 2 p.m. UTC | #4
On Sonntag, 3. April 2022 14:37:55 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Sun, Apr 03, 2022 at 01:29:53PM +0200:
> > So maybe I should just exclude the 9p RDMA transport from this 9p message
> > size reduction change in v5 until somebody had a chance to test this
> > change with RDMA.
> 
> Yes, I'm pretty certain it won't work so we'll want to exclude it unless
> we can extend the RDMA protocol to address buffers.

OK, agreed. It only needs a minor adjustment to this patch 12 to exclude the
RDMA transport (+2 lines or so). So no big deal.

> > On the long-term I can imagine to add RDMA transport support on QEMU 9p
> > side.
> What would you expect it to be used for?

There are several potential use cases that would come to my mind, e.g:

- Separating storage hardware from host hardware. With virtio we are
  constrained to the same machine.

- Maybe also a candidate to achieve what the 9p 'synth' driver in QEMU tried 
  to achieve? That 'synth' driver is running in a separate process from the 
  QEMU process, with the goal to increase safety. However currently it is 
  more or less abondened as it is extremely slow, as 9p requests have to be
  dispatched like:

   guest -> QEMU (9p server) -> synth daemon -> QEMU (9p server) -> guest

  Maybe we could rid of those costly extra hops with RDMA, not sure though.

- Maybe also an alternative to virtio on the same machine: there are also some 
  shortcomings in virtio that are tedious to address (see e.g. current 
  struggle with pure formal negotiation issues just to relax the virtio spec 
  regarding its "Queue Size" requirements so that we could achieve higher 
  message sizes). I'm also not a big fan of virtio's assumption that guest 
  should guess in advance host's response size.

- Maybe as transport for macOS guest support in future? Upcoming QEMU 7.0 adds
  support for macOS 9p hosts, which revives the plan to add support for 9p
  to macOS guests as well. The question still is what transport to use for 
  macOS guests then.

However I currently don't know any details inside RDMA yet, and as you already
outlined, it probably has some shortcomings that would need to be revised with
protocol changes as well.

Best regards,
Christian Schoenebeck
diff mbox series

Patch

diff --git a/net/9p/client.c b/net/9p/client.c
index 56be1658870d..773915c95219 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -255,19 +255,35 @@  static struct kmem_cache *p9_req_cache;
  * p9_tag_alloc - Allocate a new request.
  * @c: Client session.
  * @type: Transaction type.
- * @t_size: Buffer size for holding this request.
- * @r_size: Buffer size for holding server's reply on this request.
+ * @t_size: Buffer size for holding this request
+ * (automatic calculation by format template if 0).
+ * @r_size: Buffer size for holding server's reply on this request
+ * (automatic calculation by format template if 0).
+ * @fmt: Format template for assembling 9p request message
+ * (see p9pdu_vwritef).
+ * @ap: Variable arguments to be fed to passed format template
+ * (see p9pdu_vwritef).
  *
  * Context: Process context.
  * Return: Pointer to new request.
  */
 static struct p9_req_t *
-p9_tag_alloc(struct p9_client *c, int8_t type, uint t_size, uint r_size)
+p9_tag_alloc(struct p9_client *c, int8_t type, uint t_size, uint r_size,
+	      const char *fmt, va_list ap)
 {
 	struct p9_req_t *req = kmem_cache_alloc(p9_req_cache, GFP_NOFS);
-	int alloc_tsize = min(c->msize, t_size);
-	int alloc_rsize = min(c->msize, r_size);
+	int alloc_tsize;
+	int alloc_rsize;
 	int tag;
+	va_list apc;
+
+	va_copy(apc, ap);
+	alloc_tsize = min_t(size_t, c->msize,
+			    t_size ?: p9_msg_buf_size(c, type, fmt, apc));
+	va_end(apc);
+
+	alloc_rsize = min_t(size_t, c->msize,
+			    r_size ?: p9_msg_buf_size(c, type + 1, fmt, ap));
 
 	if (!req)
 		return ERR_PTR(-ENOMEM);
@@ -685,6 +701,7 @@  static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
 {
 	int err;
 	struct p9_req_t *req;
+	va_list apc;
 
 	p9_debug(P9_DEBUG_MUX, "client %p op %d\n", c, type);
 
@@ -696,7 +713,9 @@  static struct p9_req_t *p9_client_prepare_req(struct p9_client *c,
 	if (c->status == BeginDisconnect && type != P9_TCLUNK)
 		return ERR_PTR(-EIO);
 
-	req = p9_tag_alloc(c, type, t_size, r_size);
+	va_copy(apc, ap);
+	req = p9_tag_alloc(c, type, t_size, r_size, fmt, apc);
+	va_end(apc);
 	if (IS_ERR(req))
 		return req;
 
@@ -733,7 +752,7 @@  p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
 	struct p9_req_t *req;
 
 	va_start(ap, fmt);
-	req = p9_client_prepare_req(c, type, c->msize, c->msize, fmt, ap);
+	req = p9_client_prepare_req(c, type, 0, 0, fmt, ap);
 	va_end(ap);
 	if (IS_ERR(req))
 		return req;