diff mbox series

[v3,26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it

Message ID 20210428110100.27757-27-dgilbert@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtiofs dax patches | expand

Commit Message

Dr. David Alan Gilbert April 28, 2021, 11:01 a.m. UTC
From: Vivek Goyal <vgoyal@redhat.com>

If qemu guest asked to drop CAP_FSETID upon write, send that info
to qemu in SLAVE_FS_IO message so that qemu can drop capability
before WRITE. This is to make sure that any setuid bit is killed
on fd (if there is one set).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 tools/virtiofsd/buffer.c         | 10 ++++++----
 tools/virtiofsd/fuse_common.h    |  6 +++++-
 tools/virtiofsd/fuse_lowlevel.h  |  6 +++++-
 tools/virtiofsd/fuse_virtio.c    |  5 ++++-
 tools/virtiofsd/passthrough_ll.c |  2 +-
 5 files changed, 21 insertions(+), 8 deletions(-)

Comments

Stefan Hajnoczi May 6, 2021, 3:37 p.m. UTC | #1
On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: Vivek Goyal <vgoyal@redhat.com>
> 
> If qemu guest asked to drop CAP_FSETID upon write, send that info
> to qemu in SLAVE_FS_IO message so that qemu can drop capability
> before WRITE. This is to make sure that any setuid bit is killed
> on fd (if there is one set).
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
to create setgid files, thereby potentially allowing an attacker to gain
any GID.

I think it's better not to implement QEMU FSETID functionality at all
and to handle it another way. In the worst case I/O requests should just
fail, it seems like a rare case anyway: I/O to a setuid/setgid file with
a memory buffer that is not mapped in virtiofsd.

Stefan
Vivek Goyal May 6, 2021, 4:02 p.m. UTC | #2
On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: Vivek Goyal <vgoyal@redhat.com>
> > 
> > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > before WRITE. This is to make sure that any setuid bit is killed
> > on fd (if there is one set).
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> 
> I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> to create setgid files, thereby potentially allowing an attacker to gain
> any GID.

Sure, its not recommended to run QEMU as root, but we don't block that
either and I do regularly test with qemu running as root.

> 
> I think it's better not to implement QEMU FSETID functionality at all
> and to handle it another way.

One way could be that virtiofsd tries to clear setuid bit after I/O
has finished. But that will be non-atomic operation and it is filled
with perils as it requires virtiofsd to know what all kernel will
do if this write has been done with CAP_FSETID dropped.

> In the worst case I/O requests should just
> fail, it seems like a rare case anyway:

Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
or not. If there is one, it might be reasonable to error out. If we
don't know, then we can't fail all the operations.

> I/O to a setuid/setgid file with
> a memory buffer that is not mapped in virtiofsd.

With DAX it is easily triggerable. User has to append to a setuid file
in virtiofs and this path will trigger.

I am fine with not supporting this patch but will also need a reaosonable
alternative solution.

Vivek
Stefan Hajnoczi May 10, 2021, 9:05 a.m. UTC | #3
On Thu, May 06, 2021 at 12:02:23PM -0400, Vivek Goyal wrote:
> On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: Vivek Goyal <vgoyal@redhat.com>
> > > 
> > > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > > before WRITE. This is to make sure that any setuid bit is killed
> > > on fd (if there is one set).
> > > 
> > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > 
> > I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> > running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> > to create setgid files, thereby potentially allowing an attacker to gain
> > any GID.
> 
> Sure, its not recommended to run QEMU as root, but we don't block that
> either and I do regularly test with qemu running as root.
> 
> > 
> > I think it's better not to implement QEMU FSETID functionality at all
> > and to handle it another way.
> 
> One way could be that virtiofsd tries to clear setuid bit after I/O
> has finished. But that will be non-atomic operation and it is filled
> with perils as it requires virtiofsd to know what all kernel will
> do if this write has been done with CAP_FSETID dropped.
> 
> > In the worst case I/O requests should just
> > fail, it seems like a rare case anyway:
> 
> Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
> or not. If there is one, it might be reasonable to error out. If we
> don't know, then we can't fail all the operations.
> 
> > I/O to a setuid/setgid file with
> > a memory buffer that is not mapped in virtiofsd.
> 
> With DAX it is easily triggerable. User has to append to a setuid file
> in virtiofs and this path will trigger.
> 
> I am fine with not supporting this patch but will also need a reaosonable
> alternative solution.

One way to avoid this problem is by introducing DMA read/write functions
into the vhost-user protocol that can be used by all device types, not
just virtio-fs.

Today virtio-fs uses the IO slave request when it cannot access a region
of guest memory. It sends the file descriptor to QEMU and QEMU performs
the pread(2)/pwrite(2) on behalf of virtiofsd.

I mentioned in the past that this solution is over-specialized. It
doesn't solve the larger problem that vhost-user processes do not have
full access to the guest memory space (e.g. DAX window).

Instead of sending file I/O requests over to QEMU, the vhost-user
protocol should offer DMA read/write requests so any vhost-user process
can access the guest memory space where vhost's shared memory mechanism
is insufficient.

Here is how it would work:

1. Drop the IO slave request, replace it with DMA read/write slave
   requests.

   Note that these new requests can also be used in environments where
   maximum vIOMMU isolation is needed for security reasons and sharing
   all of guest RAM with the vhost-user process is considered
   unacceptable.

2. When virtqueue buffer mapping fails, send DMA read/write slave
   requests to transfer the data from/to QEMU. virtiofsd calls
   pread(2)/pwrite(2) itself with virtiofsd's Linux capabilities.

Stefan
Vivek Goyal May 10, 2021, 3:23 p.m. UTC | #4
On Mon, May 10, 2021 at 10:05:09AM +0100, Stefan Hajnoczi wrote:
> On Thu, May 06, 2021 at 12:02:23PM -0400, Vivek Goyal wrote:
> > On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> > > On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > From: Vivek Goyal <vgoyal@redhat.com>
> > > > 
> > > > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > > > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > > > before WRITE. This is to make sure that any setuid bit is killed
> > > > on fd (if there is one set).
> > > > 
> > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > 
> > > I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> > > running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> > > to create setgid files, thereby potentially allowing an attacker to gain
> > > any GID.
> > 
> > Sure, its not recommended to run QEMU as root, but we don't block that
> > either and I do regularly test with qemu running as root.
> > 
> > > 
> > > I think it's better not to implement QEMU FSETID functionality at all
> > > and to handle it another way.
> > 
> > One way could be that virtiofsd tries to clear setuid bit after I/O
> > has finished. But that will be non-atomic operation and it is filled
> > with perils as it requires virtiofsd to know what all kernel will
> > do if this write has been done with CAP_FSETID dropped.
> > 
> > > In the worst case I/O requests should just
> > > fail, it seems like a rare case anyway:
> > 
> > Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
> > or not. If there is one, it might be reasonable to error out. If we
> > don't know, then we can't fail all the operations.
> > 
> > > I/O to a setuid/setgid file with
> > > a memory buffer that is not mapped in virtiofsd.
> > 
> > With DAX it is easily triggerable. User has to append to a setuid file
> > in virtiofs and this path will trigger.
> > 
> > I am fine with not supporting this patch but will also need a reaosonable
> > alternative solution.
> 
> One way to avoid this problem is by introducing DMA read/write functions
> into the vhost-user protocol that can be used by all device types, not
> just virtio-fs.
> 
> Today virtio-fs uses the IO slave request when it cannot access a region
> of guest memory. It sends the file descriptor to QEMU and QEMU performs
> the pread(2)/pwrite(2) on behalf of virtiofsd.
> 
> I mentioned in the past that this solution is over-specialized. It
> doesn't solve the larger problem that vhost-user processes do not have
> full access to the guest memory space (e.g. DAX window).
> 
> Instead of sending file I/O requests over to QEMU, the vhost-user
> protocol should offer DMA read/write requests so any vhost-user process
> can access the guest memory space where vhost's shared memory mechanism
> is insufficient.
> 
> Here is how it would work:
> 
> 1. Drop the IO slave request, replace it with DMA read/write slave
>    requests.
> 
>    Note that these new requests can also be used in environments where
>    maximum vIOMMU isolation is needed for security reasons and sharing
>    all of guest RAM with the vhost-user process is considered
>    unacceptable.
> 
> 2. When virtqueue buffer mapping fails, send DMA read/write slave
>    requests to transfer the data from/to QEMU. virtiofsd calls
>    pread(2)/pwrite(2) itself with virtiofsd's Linux capabilities.

Can you elaborate a bit more how will this new DMA read/write vhost-user
commands can be implemented. I am assuming its not a real DMA and just
sort of emulation of DMA. Effectively we have two processes and one
process needs to read/write to/from address space of other process.

We were also wondering if we can make use of process_vm_readv()
and process_vm_write() syscalls to achieve this. But this atleast
requires virtiofsd to be more priviliged than qemu and also virtiofsd
needs to know where DAX mapping window is. We briefly discussed this here.

https://lore.kernel.org/qemu-devel/20210421200746.GH1579961@redhat.com/

Vivek
Stefan Hajnoczi May 10, 2021, 3:32 p.m. UTC | #5
On Mon, May 10, 2021 at 11:23:24AM -0400, Vivek Goyal wrote:
> On Mon, May 10, 2021 at 10:05:09AM +0100, Stefan Hajnoczi wrote:
> > On Thu, May 06, 2021 at 12:02:23PM -0400, Vivek Goyal wrote:
> > > On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> > > > On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > From: Vivek Goyal <vgoyal@redhat.com>
> > > > > 
> > > > > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > > > > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > > > > before WRITE. This is to make sure that any setuid bit is killed
> > > > > on fd (if there is one set).
> > > > > 
> > > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > > 
> > > > I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> > > > running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> > > > to create setgid files, thereby potentially allowing an attacker to gain
> > > > any GID.
> > > 
> > > Sure, its not recommended to run QEMU as root, but we don't block that
> > > either and I do regularly test with qemu running as root.
> > > 
> > > > 
> > > > I think it's better not to implement QEMU FSETID functionality at all
> > > > and to handle it another way.
> > > 
> > > One way could be that virtiofsd tries to clear setuid bit after I/O
> > > has finished. But that will be non-atomic operation and it is filled
> > > with perils as it requires virtiofsd to know what all kernel will
> > > do if this write has been done with CAP_FSETID dropped.
> > > 
> > > > In the worst case I/O requests should just
> > > > fail, it seems like a rare case anyway:
> > > 
> > > Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
> > > or not. If there is one, it might be reasonable to error out. If we
> > > don't know, then we can't fail all the operations.
> > > 
> > > > I/O to a setuid/setgid file with
> > > > a memory buffer that is not mapped in virtiofsd.
> > > 
> > > With DAX it is easily triggerable. User has to append to a setuid file
> > > in virtiofs and this path will trigger.
> > > 
> > > I am fine with not supporting this patch but will also need a reaosonable
> > > alternative solution.
> > 
> > One way to avoid this problem is by introducing DMA read/write functions
> > into the vhost-user protocol that can be used by all device types, not
> > just virtio-fs.
> > 
> > Today virtio-fs uses the IO slave request when it cannot access a region
> > of guest memory. It sends the file descriptor to QEMU and QEMU performs
> > the pread(2)/pwrite(2) on behalf of virtiofsd.
> > 
> > I mentioned in the past that this solution is over-specialized. It
> > doesn't solve the larger problem that vhost-user processes do not have
> > full access to the guest memory space (e.g. DAX window).
> > 
> > Instead of sending file I/O requests over to QEMU, the vhost-user
> > protocol should offer DMA read/write requests so any vhost-user process
> > can access the guest memory space where vhost's shared memory mechanism
> > is insufficient.
> > 
> > Here is how it would work:
> > 
> > 1. Drop the IO slave request, replace it with DMA read/write slave
> >    requests.
> > 
> >    Note that these new requests can also be used in environments where
> >    maximum vIOMMU isolation is needed for security reasons and sharing
> >    all of guest RAM with the vhost-user process is considered
> >    unacceptable.
> > 
> > 2. When virtqueue buffer mapping fails, send DMA read/write slave
> >    requests to transfer the data from/to QEMU. virtiofsd calls
> >    pread(2)/pwrite(2) itself with virtiofsd's Linux capabilities.
> 
> Can you elaborate a bit more how will this new DMA read/write vhost-user
> commands can be implemented. I am assuming its not a real DMA and just
> sort of emulation of DMA. Effectively we have two processes and one
> process needs to read/write to/from address space of other process.
> 
> We were also wondering if we can make use of process_vm_readv()
> and process_vm_write() syscalls to achieve this. But this atleast
> requires virtiofsd to be more priviliged than qemu and also virtiofsd
> needs to know where DAX mapping window is. We briefly discussed this here.
> 
> https://lore.kernel.org/qemu-devel/20210421200746.GH1579961@redhat.com/

I wasn't thinking of directly allowing QEMU virtual memory access via
process_vm_readv/writev(). That would be more efficient but requires
privileges and also exposes internals of QEMU's virtual memory layout
and vIOMMU translation to the vhost-user process.

Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
containing the address (a device IOVA, it could just be a guest physical
memory address in most cases) and the length. The WRITE message would
also contain the data that the vhost-user device wishes to write. The
READ message reply would contain the data that the device read from
QEMU.

QEMU would implement this using QEMU's address_space_read/write() APIs.

So basically just a new vhost-user protocol message to do a memcpy(),
but with guest addresses and vIOMMU support :).

The vhost-user device will need to do bounce buffering so using these
new messages is slower than zero-copy I/O to shared guest RAM.

Stefan
Dr. David Alan Gilbert May 27, 2021, 7:09 p.m. UTC | #6
* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Mon, May 10, 2021 at 11:23:24AM -0400, Vivek Goyal wrote:
> > On Mon, May 10, 2021 at 10:05:09AM +0100, Stefan Hajnoczi wrote:
> > > On Thu, May 06, 2021 at 12:02:23PM -0400, Vivek Goyal wrote:
> > > > On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> > > > > On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > > From: Vivek Goyal <vgoyal@redhat.com>
> > > > > > 
> > > > > > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > > > > > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > > > > > before WRITE. This is to make sure that any setuid bit is killed
> > > > > > on fd (if there is one set).
> > > > > > 
> > > > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > > > 
> > > > > I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> > > > > running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> > > > > to create setgid files, thereby potentially allowing an attacker to gain
> > > > > any GID.
> > > > 
> > > > Sure, its not recommended to run QEMU as root, but we don't block that
> > > > either and I do regularly test with qemu running as root.
> > > > 
> > > > > 
> > > > > I think it's better not to implement QEMU FSETID functionality at all
> > > > > and to handle it another way.
> > > > 
> > > > One way could be that virtiofsd tries to clear setuid bit after I/O
> > > > has finished. But that will be non-atomic operation and it is filled
> > > > with perils as it requires virtiofsd to know what all kernel will
> > > > do if this write has been done with CAP_FSETID dropped.
> > > > 
> > > > > In the worst case I/O requests should just
> > > > > fail, it seems like a rare case anyway:
> > > > 
> > > > Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
> > > > or not. If there is one, it might be reasonable to error out. If we
> > > > don't know, then we can't fail all the operations.
> > > > 
> > > > > I/O to a setuid/setgid file with
> > > > > a memory buffer that is not mapped in virtiofsd.
> > > > 
> > > > With DAX it is easily triggerable. User has to append to a setuid file
> > > > in virtiofs and this path will trigger.
> > > > 
> > > > I am fine with not supporting this patch but will also need a reaosonable
> > > > alternative solution.
> > > 
> > > One way to avoid this problem is by introducing DMA read/write functions
> > > into the vhost-user protocol that can be used by all device types, not
> > > just virtio-fs.
> > > 
> > > Today virtio-fs uses the IO slave request when it cannot access a region
> > > of guest memory. It sends the file descriptor to QEMU and QEMU performs
> > > the pread(2)/pwrite(2) on behalf of virtiofsd.
> > > 
> > > I mentioned in the past that this solution is over-specialized. It
> > > doesn't solve the larger problem that vhost-user processes do not have
> > > full access to the guest memory space (e.g. DAX window).
> > > 
> > > Instead of sending file I/O requests over to QEMU, the vhost-user
> > > protocol should offer DMA read/write requests so any vhost-user process
> > > can access the guest memory space where vhost's shared memory mechanism
> > > is insufficient.
> > > 
> > > Here is how it would work:
> > > 
> > > 1. Drop the IO slave request, replace it with DMA read/write slave
> > >    requests.
> > > 
> > >    Note that these new requests can also be used in environments where
> > >    maximum vIOMMU isolation is needed for security reasons and sharing
> > >    all of guest RAM with the vhost-user process is considered
> > >    unacceptable.
> > > 
> > > 2. When virtqueue buffer mapping fails, send DMA read/write slave
> > >    requests to transfer the data from/to QEMU. virtiofsd calls
> > >    pread(2)/pwrite(2) itself with virtiofsd's Linux capabilities.
> > 
> > Can you elaborate a bit more how will this new DMA read/write vhost-user
> > commands can be implemented. I am assuming its not a real DMA and just
> > sort of emulation of DMA. Effectively we have two processes and one
> > process needs to read/write to/from address space of other process.
> > 
> > We were also wondering if we can make use of process_vm_readv()
> > and process_vm_write() syscalls to achieve this. But this atleast
> > requires virtiofsd to be more priviliged than qemu and also virtiofsd
> > needs to know where DAX mapping window is. We briefly discussed this here.
> > 
> > https://lore.kernel.org/qemu-devel/20210421200746.GH1579961@redhat.com/
> 
> I wasn't thinking of directly allowing QEMU virtual memory access via
> process_vm_readv/writev(). That would be more efficient but requires
> privileges and also exposes internals of QEMU's virtual memory layout
> and vIOMMU translation to the vhost-user process.
> 
> Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
> containing the address (a device IOVA, it could just be a guest physical
> memory address in most cases) and the length. The WRITE message would
> also contain the data that the vhost-user device wishes to write. The
> READ message reply would contain the data that the device read from
> QEMU.
> 
> QEMU would implement this using QEMU's address_space_read/write() APIs.
> 
> So basically just a new vhost-user protocol message to do a memcpy(),
> but with guest addresses and vIOMMU support :).

This doesn't actually feel that hard - ignoring vIOMMU for a minute
which I know very little about - I'd have to think where the data
actually flows, probably the slave fd.

> The vhost-user device will need to do bounce buffering so using these
> new messages is slower than zero-copy I/O to shared guest RAM.

I guess the theory is it's only in the weird corner cases anyway.

Dave

> Stefan
Dr. David Alan Gilbert June 10, 2021, 3:29 p.m. UTC | #7
* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:

<snip>

> > Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
> > containing the address (a device IOVA, it could just be a guest physical
> > memory address in most cases) and the length. The WRITE message would
> > also contain the data that the vhost-user device wishes to write. The
> > READ message reply would contain the data that the device read from
> > QEMU.
> > 
> > QEMU would implement this using QEMU's address_space_read/write() APIs.
> > 
> > So basically just a new vhost-user protocol message to do a memcpy(),
> > but with guest addresses and vIOMMU support :).
> 
> This doesn't actually feel that hard - ignoring vIOMMU for a minute
> which I know very little about - I'd have to think where the data
> actually flows, probably the slave fd.
> 
> > The vhost-user device will need to do bounce buffering so using these
> > new messages is slower than zero-copy I/O to shared guest RAM.
> 
> I guess the theory is it's only in the weird corner cases anyway.

The direction I'm going is something like the following;
the idea is that the master will have to handle the requests on a
separate thread, to avoid any problems with side effects from the memory
accesses; the slave will then have to parkt he requests somewhere and
handle them later.


From 07aacff77c50c8a2b588b2513f2dfcfb8f5aa9df Mon Sep 17 00:00:00 2001
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Date: Thu, 10 Jun 2021 15:34:04 +0100
Subject: [PATCH] WIP: vhost-user: DMA type interface

A DMA type interface where the slave can ask for a stream of bytes
to be read/written to the guests memory by the master.
The interface is asynchronous, since a request may have side effects
inside the guest.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/interop/vhost-user.rst               | 33 +++++++++++++++++++++++
 hw/virtio/vhost-user.c                    |  4 +++
 subprojects/libvhost-user/libvhost-user.h | 24 +++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 9ebd05e2bf..b9b5322147 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1347,6 +1347,15 @@ Master message types
   query the backend for its device status as defined in the Virtio
   specification.
 
+``VHOST_USER_MEM_DATA``
+  :id: 41
+  :equivalent ioctl: N/A
+  :slave payload: N/A
+  :master payload: ``struct VhostUserMemReply``
+
+  This message is an asynchronous response to a ``VHOST_USER_SLAVE_MEM_ACCESS``
+  message.  Where the request was for the master to read data, this
+  message will be followed by the data that was read.
 
 Slave message types
 -------------------
@@ -1469,6 +1478,30 @@ Slave message types
   The ``VHOST_USER_FS_FLAG_MAP_W`` flag must be set in the ``flags`` field to
   write to the file from RAM.
 
+``VHOST_USER_SLAVE_MEM_ACCESS``
+  :id: 9
+  :equivalent ioctl: N/A
+  :slave payload: ``struct VhostUserMemAccess``
+  :master payload: N/A
+
+  Requests that the master perform a range of memory accesses on behalf
+  of the slave that the slave can't perform itself.
+
+  The ``VHOST_USER_MEM_FLAG_TO_MASTER`` flag must be set in the ``flags``
+  field for the slave to write data into the RAM of the master.   In this
+  case the data to write follows the ``VhostUserMemAccess`` on the fd.
+  The ``VHOST_USER_MEM_FLAG_FROM_MASTER`` flag must be set in the ``flags``
+  field for the slave to read data from the RAM of the master.
+
+  When the master has completed the access it replies on the main fd with
+  a ``VHOST_USER_MEM_DATA`` message.
+
+  The master is allowed to complete part of the request and reply stating
+  the amount completed, leaving it to the slave to resend further components.
+  This may happen to limit memory allocations in the master or to simplify
+  the implementation.
+
+
 .. _reply_ack:
 
 VHOST_USER_PROTOCOL_F_REPLY_ACK
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 39a0e55cca..a3fefc4c1d 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -126,6 +126,9 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_MAX_MEM_SLOTS = 36,
     VHOST_USER_ADD_MEM_REG = 37,
     VHOST_USER_REM_MEM_REG = 38,
+    VHOST_USER_SET_STATUS = 39,
+    VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_MEM_DATA = 41,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -139,6 +142,7 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_FS_MAP = 6,
     VHOST_USER_SLAVE_FS_UNMAP = 7,
     VHOST_USER_SLAVE_FS_IO = 8,
+    VHOST_USER_SLAVE_MEM_ACCESS = 9,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
index eee611a2f6..b5444f4f6f 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -109,6 +109,9 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_MAX_MEM_SLOTS = 36,
     VHOST_USER_ADD_MEM_REG = 37,
     VHOST_USER_REM_MEM_REG = 38,
+    VHOST_USER_SET_STATUS = 39,
+    VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_MEM_DATA = 41,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -122,6 +125,7 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_FS_MAP = 6,
     VHOST_USER_SLAVE_FS_UNMAP = 7,
     VHOST_USER_SLAVE_FS_IO = 8,
+    VHOST_USER_SLAVE_MEM_ACCESS = 9,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
@@ -220,6 +224,24 @@ typedef struct VhostUserInflight {
     uint16_t queue_size;
 } VhostUserInflight;
 
+/* For the flags field of VhostUserMemAccess and VhostUserMemReply */
+#define VHOST_USER_MEM_FLAG_TO_MASTER (1u << 0)
+#define VHOST_USER_MEM_FLAG_FROM_MASTER (1u << 1)
+typedef struct VhostUserMemAccess {
+    uint32_t id; /* Included in the reply */
+    uint32_t flags;
+    uint64_t addr; /* In the bus address of the device */
+    uint64_t len;  /* In bytes */
+} VhostUserMemAccess;
+
+typedef struct VhostUserMemReply {
+    uint32_t id; /* From the request */
+    uint32_t flags;
+    uint32_t err; /* 0 on success */
+    uint32_t align;
+    uint64_t len;
+} VhostUserMemReply;
+
 #if defined(_WIN32) && (defined(__x86_64__) || defined(__i386__))
 # define VU_PACKED __attribute__((gcc_struct, packed))
 #else
@@ -248,6 +270,8 @@ typedef struct VhostUserMsg {
         VhostUserVringArea area;
         VhostUserInflight inflight;
         VhostUserFSSlaveMsgMax fs_max;
+        VhostUserMemAccess memaccess;
+        VhostUserMemReply  memreply;
     } payload;
 
     int fds[VHOST_MEMORY_BASELINE_NREGIONS];
Stefan Hajnoczi June 10, 2021, 4:23 p.m. UTC | #8
On Thu, Jun 10, 2021 at 04:29:42PM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> 
> <snip>
> 
> > > Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
> > > containing the address (a device IOVA, it could just be a guest physical
> > > memory address in most cases) and the length. The WRITE message would
> > > also contain the data that the vhost-user device wishes to write. The
> > > READ message reply would contain the data that the device read from
> > > QEMU.
> > > 
> > > QEMU would implement this using QEMU's address_space_read/write() APIs.
> > > 
> > > So basically just a new vhost-user protocol message to do a memcpy(),
> > > but with guest addresses and vIOMMU support :).
> > 
> > This doesn't actually feel that hard - ignoring vIOMMU for a minute
> > which I know very little about - I'd have to think where the data
> > actually flows, probably the slave fd.
> > 
> > > The vhost-user device will need to do bounce buffering so using these
> > > new messages is slower than zero-copy I/O to shared guest RAM.
> > 
> > I guess the theory is it's only in the weird corner cases anyway.

The feature is also useful if DMA isolation is desirable (i.e.
security/reliability are more important than performance). Once this new
vhost-user protocol feature is available it will be possible to run
vhost-user devices without shared memory or with limited shared memory
(e.g. just the vring).

> The direction I'm going is something like the following;
> the idea is that the master will have to handle the requests on a
> separate thread, to avoid any problems with side effects from the memory
> accesses; the slave will then have to parkt he requests somewhere and
> handle them later.
> 
> 
> From 07aacff77c50c8a2b588b2513f2dfcfb8f5aa9df Mon Sep 17 00:00:00 2001
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Date: Thu, 10 Jun 2021 15:34:04 +0100
> Subject: [PATCH] WIP: vhost-user: DMA type interface
> 
> A DMA type interface where the slave can ask for a stream of bytes
> to be read/written to the guests memory by the master.
> The interface is asynchronous, since a request may have side effects
> inside the guest.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  docs/interop/vhost-user.rst               | 33 +++++++++++++++++++++++
>  hw/virtio/vhost-user.c                    |  4 +++
>  subprojects/libvhost-user/libvhost-user.h | 24 +++++++++++++++++
>  3 files changed, 61 insertions(+)

Use of the word "RAM" in this patch is a little unclear since we need
these new messages precisely when it's not ordinary guest RAM :-). Maybe
referring to the address space is more general.

> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 9ebd05e2bf..b9b5322147 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -1347,6 +1347,15 @@ Master message types
>    query the backend for its device status as defined in the Virtio
>    specification.
>  
> +``VHOST_USER_MEM_DATA``
> +  :id: 41
> +  :equivalent ioctl: N/A
> +  :slave payload: N/A
> +  :master payload: ``struct VhostUserMemReply``
> +
> +  This message is an asynchronous response to a ``VHOST_USER_SLAVE_MEM_ACCESS``
> +  message.  Where the request was for the master to read data, this
> +  message will be followed by the data that was read.

Please explain why this message is asynchronous. Implementors will need
to understand the gotchas around deadlocks, etc.

>  
>  Slave message types
>  -------------------
> @@ -1469,6 +1478,30 @@ Slave message types
>    The ``VHOST_USER_FS_FLAG_MAP_W`` flag must be set in the ``flags`` field to
>    write to the file from RAM.
>  
> +``VHOST_USER_SLAVE_MEM_ACCESS``
> +  :id: 9
> +  :equivalent ioctl: N/A
> +  :slave payload: ``struct VhostUserMemAccess``
> +  :master payload: N/A
> +
> +  Requests that the master perform a range of memory accesses on behalf
> +  of the slave that the slave can't perform itself.
> +
> +  The ``VHOST_USER_MEM_FLAG_TO_MASTER`` flag must be set in the ``flags``
> +  field for the slave to write data into the RAM of the master.   In this
> +  case the data to write follows the ``VhostUserMemAccess`` on the fd.
> +  The ``VHOST_USER_MEM_FLAG_FROM_MASTER`` flag must be set in the ``flags``
> +  field for the slave to read data from the RAM of the master.
> +
> +  When the master has completed the access it replies on the main fd with
> +  a ``VHOST_USER_MEM_DATA`` message.
> +
> +  The master is allowed to complete part of the request and reply stating
> +  the amount completed, leaving it to the slave to resend further components.
> +  This may happen to limit memory allocations in the master or to simplify
> +  the implementation.
> +
> +
>  .. _reply_ack:
>  
>  VHOST_USER_PROTOCOL_F_REPLY_ACK
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 39a0e55cca..a3fefc4c1d 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -126,6 +126,9 @@ typedef enum VhostUserRequest {
>      VHOST_USER_GET_MAX_MEM_SLOTS = 36,
>      VHOST_USER_ADD_MEM_REG = 37,
>      VHOST_USER_REM_MEM_REG = 38,
> +    VHOST_USER_SET_STATUS = 39,
> +    VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_MEM_DATA = 41,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -139,6 +142,7 @@ typedef enum VhostUserSlaveRequest {
>      VHOST_USER_SLAVE_FS_MAP = 6,
>      VHOST_USER_SLAVE_FS_UNMAP = 7,
>      VHOST_USER_SLAVE_FS_IO = 8,
> +    VHOST_USER_SLAVE_MEM_ACCESS = 9,
>      VHOST_USER_SLAVE_MAX
>  }  VhostUserSlaveRequest;
>  
> diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
> index eee611a2f6..b5444f4f6f 100644
> --- a/subprojects/libvhost-user/libvhost-user.h
> +++ b/subprojects/libvhost-user/libvhost-user.h
> @@ -109,6 +109,9 @@ typedef enum VhostUserRequest {
>      VHOST_USER_GET_MAX_MEM_SLOTS = 36,
>      VHOST_USER_ADD_MEM_REG = 37,
>      VHOST_USER_REM_MEM_REG = 38,
> +    VHOST_USER_SET_STATUS = 39,
> +    VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_MEM_DATA = 41,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -122,6 +125,7 @@ typedef enum VhostUserSlaveRequest {
>      VHOST_USER_SLAVE_FS_MAP = 6,
>      VHOST_USER_SLAVE_FS_UNMAP = 7,
>      VHOST_USER_SLAVE_FS_IO = 8,
> +    VHOST_USER_SLAVE_MEM_ACCESS = 9,
>      VHOST_USER_SLAVE_MAX
>  }  VhostUserSlaveRequest;
>  
> @@ -220,6 +224,24 @@ typedef struct VhostUserInflight {
>      uint16_t queue_size;
>  } VhostUserInflight;
>  
> +/* For the flags field of VhostUserMemAccess and VhostUserMemReply */
> +#define VHOST_USER_MEM_FLAG_TO_MASTER (1u << 0)
> +#define VHOST_USER_MEM_FLAG_FROM_MASTER (1u << 1)
> +typedef struct VhostUserMemAccess {
> +    uint32_t id; /* Included in the reply */
> +    uint32_t flags;

Is VHOST_USER_MEM_FLAG_TO_MASTER | VHOST_USER_MEM_FLAG_FROM_MASTER
valid?

> +    uint64_t addr; /* In the bus address of the device */

Please check the spec for preferred terminology. "bus address" isn't
used in the spec, so there's probably another term for it.

> +    uint64_t len;  /* In bytes */
> +} VhostUserMemAccess;
> +
> +typedef struct VhostUserMemReply {
> +    uint32_t id; /* From the request */
> +    uint32_t flags;

Are any flags defined?

> +    uint32_t err; /* 0 on success */
> +    uint32_t align;

Is this a reserved padding field? "align" is confusing because it could
refer to some kind of memory alignment value. "reserved" or "padding" is
clearer.

> +    uint64_t len;
> +} VhostUserMemReply;
> +
>  #if defined(_WIN32) && (defined(__x86_64__) || defined(__i386__))
>  # define VU_PACKED __attribute__((gcc_struct, packed))
>  #else
> @@ -248,6 +270,8 @@ typedef struct VhostUserMsg {
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
>          VhostUserFSSlaveMsgMax fs_max;
> +        VhostUserMemAccess memaccess;
> +        VhostUserMemReply  memreply;
>      } payload;
>  
>      int fds[VHOST_MEMORY_BASELINE_NREGIONS];
> -- 
> 2.31.1
> 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Dr. David Alan Gilbert June 16, 2021, 12:36 p.m. UTC | #9
* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Thu, Jun 10, 2021 at 04:29:42PM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > 
> > <snip>
> > 
> > > > Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
> > > > containing the address (a device IOVA, it could just be a guest physical
> > > > memory address in most cases) and the length. The WRITE message would
> > > > also contain the data that the vhost-user device wishes to write. The
> > > > READ message reply would contain the data that the device read from
> > > > QEMU.
> > > > 
> > > > QEMU would implement this using QEMU's address_space_read/write() APIs.
> > > > 
> > > > So basically just a new vhost-user protocol message to do a memcpy(),
> > > > but with guest addresses and vIOMMU support :).
> > > 
> > > This doesn't actually feel that hard - ignoring vIOMMU for a minute
> > > which I know very little about - I'd have to think where the data
> > > actually flows, probably the slave fd.
> > > 
> > > > The vhost-user device will need to do bounce buffering so using these
> > > > new messages is slower than zero-copy I/O to shared guest RAM.
> > > 
> > > I guess the theory is it's only in the weird corner cases anyway.
> 
> The feature is also useful if DMA isolation is desirable (i.e.
> security/reliability are more important than performance). Once this new
> vhost-user protocol feature is available it will be possible to run
> vhost-user devices without shared memory or with limited shared memory
> (e.g. just the vring).

I don't see it ever being efficient, so that case is going to be pretty
limited.

> > The direction I'm going is something like the following;
> > the idea is that the master will have to handle the requests on a
> > separate thread, to avoid any problems with side effects from the memory
> > accesses; the slave will then have to parkt he requests somewhere and
> > handle them later.
> > 
> > 
> > From 07aacff77c50c8a2b588b2513f2dfcfb8f5aa9df Mon Sep 17 00:00:00 2001
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > Date: Thu, 10 Jun 2021 15:34:04 +0100
> > Subject: [PATCH] WIP: vhost-user: DMA type interface
> > 
> > A DMA type interface where the slave can ask for a stream of bytes
> > to be read/written to the guests memory by the master.
> > The interface is asynchronous, since a request may have side effects
> > inside the guest.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  docs/interop/vhost-user.rst               | 33 +++++++++++++++++++++++
> >  hw/virtio/vhost-user.c                    |  4 +++
> >  subprojects/libvhost-user/libvhost-user.h | 24 +++++++++++++++++
> >  3 files changed, 61 insertions(+)
> 
> Use of the word "RAM" in this patch is a little unclear since we need
> these new messages precisely when it's not ordinary guest RAM :-). Maybe
> referring to the address space is more general.

Yeh, I'll try and spot those.

> > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > index 9ebd05e2bf..b9b5322147 100644
> > --- a/docs/interop/vhost-user.rst
> > +++ b/docs/interop/vhost-user.rst
> > @@ -1347,6 +1347,15 @@ Master message types
> >    query the backend for its device status as defined in the Virtio
> >    specification.
> >  
> > +``VHOST_USER_MEM_DATA``
> > +  :id: 41
> > +  :equivalent ioctl: N/A
> > +  :slave payload: N/A
> > +  :master payload: ``struct VhostUserMemReply``
> > +
> > +  This message is an asynchronous response to a ``VHOST_USER_SLAVE_MEM_ACCESS``
> > +  message.  Where the request was for the master to read data, this
> > +  message will be followed by the data that was read.
> 
> Please explain why this message is asynchronous. Implementors will need
> to understand the gotchas around deadlocks, etc.

I've added:
  Making this a separate asynchronous response message (rather than just a reply
  to the ``VHOST_USER_SLAVE_MEM_ACCESS``) makes it much easier for the master
  to deal with any side effects the access may have, and in particular avoid
  deadlocks they might cause if an access triggers another vhost_user message.

> >  
> >  Slave message types
> >  -------------------
> > @@ -1469,6 +1478,30 @@ Slave message types
> >    The ``VHOST_USER_FS_FLAG_MAP_W`` flag must be set in the ``flags`` field to
> >    write to the file from RAM.
> >  
> > +``VHOST_USER_SLAVE_MEM_ACCESS``
> > +  :id: 9
> > +  :equivalent ioctl: N/A
> > +  :slave payload: ``struct VhostUserMemAccess``
> > +  :master payload: N/A
> > +
> > +  Requests that the master perform a range of memory accesses on behalf
> > +  of the slave that the slave can't perform itself.
> > +
> > +  The ``VHOST_USER_MEM_FLAG_TO_MASTER`` flag must be set in the ``flags``
> > +  field for the slave to write data into the RAM of the master.   In this
> > +  case the data to write follows the ``VhostUserMemAccess`` on the fd.
> > +  The ``VHOST_USER_MEM_FLAG_FROM_MASTER`` flag must be set in the ``flags``
> > +  field for the slave to read data from the RAM of the master.
> > +
> > +  When the master has completed the access it replies on the main fd with
> > +  a ``VHOST_USER_MEM_DATA`` message.
> > +
> > +  The master is allowed to complete part of the request and reply stating
> > +  the amount completed, leaving it to the slave to resend further components.
> > +  This may happen to limit memory allocations in the master or to simplify
> > +  the implementation.
> > +
> > +
> >  .. _reply_ack:
> >  
> >  VHOST_USER_PROTOCOL_F_REPLY_ACK
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 39a0e55cca..a3fefc4c1d 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -126,6 +126,9 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_GET_MAX_MEM_SLOTS = 36,
> >      VHOST_USER_ADD_MEM_REG = 37,
> >      VHOST_USER_REM_MEM_REG = 38,
> > +    VHOST_USER_SET_STATUS = 39,
> > +    VHOST_USER_GET_STATUS = 40,
> > +    VHOST_USER_MEM_DATA = 41,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >  
> > @@ -139,6 +142,7 @@ typedef enum VhostUserSlaveRequest {
> >      VHOST_USER_SLAVE_FS_MAP = 6,
> >      VHOST_USER_SLAVE_FS_UNMAP = 7,
> >      VHOST_USER_SLAVE_FS_IO = 8,
> > +    VHOST_USER_SLAVE_MEM_ACCESS = 9,
> >      VHOST_USER_SLAVE_MAX
> >  }  VhostUserSlaveRequest;
> >  
> > diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
> > index eee611a2f6..b5444f4f6f 100644
> > --- a/subprojects/libvhost-user/libvhost-user.h
> > +++ b/subprojects/libvhost-user/libvhost-user.h
> > @@ -109,6 +109,9 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_GET_MAX_MEM_SLOTS = 36,
> >      VHOST_USER_ADD_MEM_REG = 37,
> >      VHOST_USER_REM_MEM_REG = 38,
> > +    VHOST_USER_SET_STATUS = 39,
> > +    VHOST_USER_GET_STATUS = 40,
> > +    VHOST_USER_MEM_DATA = 41,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >  
> > @@ -122,6 +125,7 @@ typedef enum VhostUserSlaveRequest {
> >      VHOST_USER_SLAVE_FS_MAP = 6,
> >      VHOST_USER_SLAVE_FS_UNMAP = 7,
> >      VHOST_USER_SLAVE_FS_IO = 8,
> > +    VHOST_USER_SLAVE_MEM_ACCESS = 9,
> >      VHOST_USER_SLAVE_MAX
> >  }  VhostUserSlaveRequest;
> >  
> > @@ -220,6 +224,24 @@ typedef struct VhostUserInflight {
> >      uint16_t queue_size;
> >  } VhostUserInflight;
> >  
> > +/* For the flags field of VhostUserMemAccess and VhostUserMemReply */
> > +#define VHOST_USER_MEM_FLAG_TO_MASTER (1u << 0)
> > +#define VHOST_USER_MEM_FLAG_FROM_MASTER (1u << 1)
> > +typedef struct VhostUserMemAccess {
> > +    uint32_t id; /* Included in the reply */
> > +    uint32_t flags;
> 
> Is VHOST_USER_MEM_FLAG_TO_MASTER | VHOST_USER_MEM_FLAG_FROM_MASTER
> valid?

No; I've changed the docs to state:
  One (and only one) of the ``VHOST_USER_MEM_FLAG_TO_MASTER`` and
  ``VHOST_USER_MEM_FLAG_FROM_MASTER`` flags must be set in the ``flags`` field.

> > +    uint64_t addr; /* In the bus address of the device */
> 
> Please check the spec for preferred terminology. "bus address" isn't
> used in the spec, so there's probably another term for it.

I'm not seeing anything useful in the virtio spec; it mostly talks about
guest physical addresses; it does say 'bus addresses' in the definition
of 'VIRTIO_F_ACCESS_PLATFORM' .

> > +    uint64_t len;  /* In bytes */
> > +} VhostUserMemAccess;
> > +
> > +typedef struct VhostUserMemReply {
> > +    uint32_t id; /* From the request */
> > +    uint32_t flags;
> 
> Are any flags defined?

Currently they're a copy of the TO/FROM _MASTER flags that were in the
request; which are useful to the device to make it easy to know whether
there's data following on the stream.

> > +    uint32_t err; /* 0 on success */
> > +    uint32_t align;
> 
> Is this a reserved padding field? "align" is confusing because it could
> refer to some kind of memory alignment value. "reserved" or "padding" is
> clearer.

Changed to 'padding'

Dave

> > +    uint64_t len;
> > +} VhostUserMemReply;
> > +
> >  #if defined(_WIN32) && (defined(__x86_64__) || defined(__i386__))
> >  # define VU_PACKED __attribute__((gcc_struct, packed))
> >  #else
> > @@ -248,6 +270,8 @@ typedef struct VhostUserMsg {
> >          VhostUserVringArea area;
> >          VhostUserInflight inflight;
> >          VhostUserFSSlaveMsgMax fs_max;
> > +        VhostUserMemAccess memaccess;
> > +        VhostUserMemReply  memreply;
> >      } payload;
> >  
> >      int fds[VHOST_MEMORY_BASELINE_NREGIONS];
> > -- 
> > 2.31.1
> > 
> > -- 
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
Stefan Hajnoczi June 16, 2021, 3:29 p.m. UTC | #10
On Wed, Jun 16, 2021 at 01:36:10PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Thu, Jun 10, 2021 at 04:29:42PM +0100, Dr. David Alan Gilbert wrote:
> > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > +    uint64_t addr; /* In the bus address of the device */
> > 
> > Please check the spec for preferred terminology. "bus address" isn't
> > used in the spec, so there's probably another term for it.
> 
> I'm not seeing anything useful in the virtio spec; it mostly talks about
> guest physical addresses; it does say 'bus addresses' in the definition
> of 'VIRTIO_F_ACCESS_PLATFORM' .

I meant the docs/interop/vhost-user.rst spec.

Stefan
Dr. David Alan Gilbert June 16, 2021, 6:35 p.m. UTC | #11
* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Jun 16, 2021 at 01:36:10PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Thu, Jun 10, 2021 at 04:29:42PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > +    uint64_t addr; /* In the bus address of the device */
> > > 
> > > Please check the spec for preferred terminology. "bus address" isn't
> > > used in the spec, so there's probably another term for it.
> > 
> > I'm not seeing anything useful in the virtio spec; it mostly talks about
> > guest physical addresses; it does say 'bus addresses' in the definition
> > of 'VIRTIO_F_ACCESS_PLATFORM' .
> 
> I meant the docs/interop/vhost-user.rst spec.

I think they use the phrase 'guest address' so I've changed that to:

    uint64_t guest_addr; 

   Elsewhere in the vhost-user.rst it says:
   
   When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
    
   * Guest addresses map to the vhost memory region containing that guest
     address.
    
   When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
    
   * Guest addresses are also called I/O virtual addresses (IOVAs).  They are
     translated to user addresses via the IOTLB.
   
> Stefan
diff mbox series

Patch

diff --git a/tools/virtiofsd/buffer.c b/tools/virtiofsd/buffer.c
index 8135d52d2a..b4cda7db9a 100644
--- a/tools/virtiofsd/buffer.c
+++ b/tools/virtiofsd/buffer.c
@@ -203,7 +203,7 @@  static ssize_t fuse_buf_fd_to_fd(const struct fuse_buf *dst, size_t dst_off,
 static ssize_t fuse_buf_copy_one(fuse_req_t req,
                                  const struct fuse_buf *dst, size_t dst_off,
                                  const struct fuse_buf *src, size_t src_off,
-                                 size_t len)
+                                 size_t len, bool dropped_cap_fsetid)
 {
     int src_is_fd = src->flags & FUSE_BUF_IS_FD;
     int dst_is_fd = dst->flags & FUSE_BUF_IS_FD;
@@ -211,7 +211,8 @@  static ssize_t fuse_buf_copy_one(fuse_req_t req,
     int dst_is_phys = src->flags & FUSE_BUF_PHYS_ADDR;
 
     if (src_is_phys && !src_is_fd && dst_is_fd) {
-        return fuse_virtio_write(req, dst, dst_off, src, src_off, len);
+        return fuse_virtio_write(req, dst, dst_off, src, src_off, len,
+                                 dropped_cap_fsetid);
     }
     assert(!src_is_phys && !dst_is_phys);
     if (!src_is_fd && !dst_is_fd) {
@@ -267,7 +268,7 @@  static int fuse_bufvec_advance(struct fuse_bufvec *bufv, size_t len)
 }
 
 ssize_t fuse_buf_copy(fuse_req_t req, struct fuse_bufvec *dstv,
-                      struct fuse_bufvec *srcv)
+                      struct fuse_bufvec *srcv, bool dropped_cap_fsetid)
 {
     size_t copied = 0, i;
 
@@ -309,7 +310,8 @@  ssize_t fuse_buf_copy(fuse_req_t req, struct fuse_bufvec *dstv,
         dst_len = dst->size - dstv->off;
         len = min_size(src_len, dst_len);
 
-        res = fuse_buf_copy_one(req, dst, dstv->off, src, srcv->off, len);
+        res = fuse_buf_copy_one(req, dst, dstv->off, src, srcv->off, len,
+                                dropped_cap_fsetid);
         if (res < 0) {
             if (!copied) {
                 return res;
diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index beed03aa93..8a75729be9 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -733,10 +733,14 @@  size_t fuse_buf_size(const struct fuse_bufvec *bufv);
  * @param req The request this copy is part of
  * @param dst destination buffer vector
  * @param src source buffer vector
+ * @param dropped_cap_fsetid Caller has dropped CAP_FSETID. If work is handed
+ *        over to a different thread/process, CAP_FSETID needs to be dropped
+ *        there as well.
  * @return actual number of bytes copied or -errno on error
  */
 ssize_t fuse_buf_copy(fuse_req_t req,
-                      struct fuse_bufvec *dst, struct fuse_bufvec *src);
+                      struct fuse_bufvec *dst, struct fuse_bufvec *src,
+                      bool dropped_cap_fsetid);
 
 /**
  * Memory buffer iterator
diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
index 24e580aafe..dfd7e1525c 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -2030,9 +2030,13 @@  int64_t fuse_virtio_io(struct fuse_session *se, VhostUserFSSlaveMsg *msg,
  * @param src The source (memory) buffer
  * @param src_off The GPA
  * @param len Length in bytes
+ * @param dropped_cap_fsetid Caller dropped CAP_FSETID. If it is being handed
+ *        over to different thread/process, CAP_FSETID needs to be dropped
+ *        before write.
  */
 ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
                           size_t dst_off, const struct fuse_buf *src,
-                          size_t src_off, size_t len);
+                          size_t src_off, size_t len,
+                          bool dropped_cap_fsetid);
 
 #endif /* FUSE_LOWLEVEL_H_ */
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index c6ea2bd2a1..9f3d38942a 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -1291,7 +1291,7 @@  int64_t fuse_virtio_io(struct fuse_session *se, VhostUserFSSlaveMsg *msg,
  */
 ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
                           size_t dst_off, const struct fuse_buf *src,
-                          size_t src_off, size_t len)
+                          size_t src_off, size_t len, bool dropped_cap_fsetid)
 {
     VhostUserFSSlaveMsg *msg = g_malloc0(sizeof(VhostUserFSSlaveMsg) +
                                          sizeof(VhostUserFSSlaveMsgEntry));
@@ -1311,6 +1311,9 @@  ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
     msg->entries[0].c_offset = (uintptr_t)src->mem + src_off;
     msg->entries[0].len = len;
     msg->entries[0].flags = VHOST_USER_FS_FLAG_MAP_W;
+    if (dropped_cap_fsetid) {
+        msg->flags |= VHOST_USER_FS_GENFLAG_DROP_FSETID;
+    }
 
     int64_t result = fuse_virtio_io(req->se, msg, dst->fd);
     fuse_log(FUSE_LOG_DEBUG, "%s: result=%" PRId64 " \n", __func__, result);
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index c5b8a1f5b1..b76d878509 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2301,7 +2301,7 @@  static void lo_write_buf(fuse_req_t req, fuse_ino_t ino,
         }
     }
 
-    res = fuse_buf_copy(req, &out_buf, in_buf);
+    res = fuse_buf_copy(req, &out_buf, in_buf, fi->kill_priv);
     if (res < 0) {
         fuse_reply_err(req, -res);
     } else {