diff mbox series

[V1,18/32] osdep: import MADV_DOEXEC

Message ID 1596122076-341293-19-git-send-email-steven.sistare@oracle.com (mailing list archive)
State New, archived
Headers show
Series Live Update | expand

Commit Message

Steven Sistare July 30, 2020, 3:14 p.m. UTC
Anonymous memory segments used by the guest are preserved across a re-exec
of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
in the Linux kernel. For the madvise patches, see:

https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qemu/osdep.h | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Steven Sistare Aug. 17, 2020, 6:30 p.m. UTC | #1
On 7/30/2020 11:14 AM, Steve Sistare wrote:
> Anonymous memory segments used by the guest are preserved across a re-exec
> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> in the Linux kernel. For the madvise patches, see:
> 
> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/qemu/osdep.h | 7 +++++++
>  1 file changed, 7 insertions(+)

Hi Alex,
  The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
live update series, is getting a chilly reception on lkml.  We could instead 
create guest memory using memfd_create and preserve the fd across exec.  However, 
the subsequent mmap(fd) will return a different VA than was used previously, 
which  is a problem for memory that was registered with vfio, as the original VA 
is remembered in the kernel struct vfio_dma and used in various kernel functions, 
such as vfio_iommu_replay.

To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
vaddr with new_vaddr.  Flags cannot be changed.

memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
vfio on any form of shared memory (shm, dax, etc) could also be preserved across
exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.

What do you think?

- Steve
Alex Williamson Aug. 17, 2020, 8:48 p.m. UTC | #2
On Mon, 17 Aug 2020 14:30:51 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 7/30/2020 11:14 AM, Steve Sistare wrote:
> > Anonymous memory segments used by the guest are preserved across a re-exec
> > of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> > in the Linux kernel. For the madvise patches, see:
> > 
> > https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> > 
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> >  include/qemu/osdep.h | 7 +++++++
> >  1 file changed, 7 insertions(+)  
> 
> Hi Alex,
>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> live update series, is getting a chilly reception on lkml.  We could instead 
> create guest memory using memfd_create and preserve the fd across exec.  However, 
> the subsequent mmap(fd) will return a different VA than was used previously, 
> which  is a problem for memory that was registered with vfio, as the original VA 
> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> such as vfio_iommu_replay.
> 
> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> vaddr with new_vaddr.  Flags cannot be changed.
> 
> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> 
> What do you think

Your new REMAP ioctl would have parameters identical to the MAP_DMA
ioctl, so I think we should just use one of the flag bits on the
existing MAP_DMA ioctl for this variant.

Reading through the discussion on the kernel side there seems to be
some confusion around why vfio needs the vaddr beyond the user call to
MAP_DMA though.  Originally this was used to test for virtually
contiguous mappings for merging and splitting purposes.  This is
defunct in the v2 interface, however the vaddr is now used largely for
mdev devices.  If an mdev device is not backed by an IOMMU device and
does not share a container with an IOMMU device, then a user MAP_DMA
ioctl essentially just registers the translation within the vfio
container.  The mdev vendor driver can then later either request pages
to be pinned for device DMA or can perform copy_to/from_user() to
simulate DMA via the CPU.

Therefore I don't see that there's a simple re-architecture of the vfio
IOMMU backend that could drop vaddr use.  I'm a bit concerned this new
remap proposal also raises the question of how do we prevent userspace
remapping vaddrs racing with asynchronous kernel use of the previous
vaddrs.  Are we expecting guest drivers/agents to quiesce the device,
or maybe relying on clearing bus-master, for PCI devices, to halt DMA?
The vfio migration interface we've developed does have a mechanism to
stop a device, would we need to use this here?  If we do have a
mechanism to quiesce the device, is the only reason we're not unmapping
everything and remapping it into the new address space the latency in
performing that operation?  Thanks,

Alex
Steven Sistare Aug. 17, 2020, 9:20 p.m. UTC | #3
On 8/17/2020 4:48 PM, Alex Williamson wrote:
> On Mon, 17 Aug 2020 14:30:51 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 7/30/2020 11:14 AM, Steve Sistare wrote:
>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>> in the Linux kernel. For the madvise patches, see:
>>>
>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>  include/qemu/osdep.h | 7 +++++++
>>>  1 file changed, 7 insertions(+)  
>>
>> Hi Alex,
>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>> live update series, is getting a chilly reception on lkml.  We could instead 
>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>> the subsequent mmap(fd) will return a different VA than was used previously, 
>> which  is a problem for memory that was registered with vfio, as the original VA 
>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>> such as vfio_iommu_replay.
>>
>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>> vaddr with new_vaddr.  Flags cannot be changed.
>>
>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>
>> What do you think
> 
> Your new REMAP ioctl would have parameters identical to the MAP_DMA
> ioctl, so I think we should just use one of the flag bits on the
> existing MAP_DMA ioctl for this variant.

Sounds good.

> Reading through the discussion on the kernel side there seems to be
> some confusion around why vfio needs the vaddr beyond the user call to
> MAP_DMA though.  Originally this was used to test for virtually
> contiguous mappings for merging and splitting purposes.  This is
> defunct in the v2 interface, however the vaddr is now used largely for
> mdev devices.  If an mdev device is not backed by an IOMMU device and
> does not share a container with an IOMMU device, then a user MAP_DMA
> ioctl essentially just registers the translation within the vfio
> container.  The mdev vendor driver can then later either request pages
> to be pinned for device DMA or can perform copy_to/from_user() to
> simulate DMA via the CPU.
> 
> Therefore I don't see that there's a simple re-architecture of the vfio
> IOMMU backend that could drop vaddr use.  

Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
same conclusion.

> I'm a bit concerned this new
> remap proposal also raises the question of how do we prevent userspace
> remapping vaddrs racing with asynchronous kernel use of the previous
> vaddrs.  

Agreed.  After a quick glance at the code, holding iommu->lock during 
remap might be sufficient, but it needs more study.

> Are we expecting guest drivers/agents to quiesce the device,
> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?

No.  We want to support any guest, and the guest is not aware that qemu
live update is occurring.

> The vfio migration interface we've developed does have a mechanism to
> stop a device, would we need to use this here?  If we do have a
> mechanism to quiesce the device, is the only reason we're not unmapping
> everything and remapping it into the new address space the latency in
> performing that operation?  Thanks,

Same answer - we don't require that the guest has vfio migration support.

- Steve
Alex Williamson Aug. 17, 2020, 9:44 p.m. UTC | #4
On Mon, 17 Aug 2020 17:20:57 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 8/17/2020 4:48 PM, Alex Williamson wrote:
> > On Mon, 17 Aug 2020 14:30:51 -0400
> > Steven Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> On 7/30/2020 11:14 AM, Steve Sistare wrote:  
> >>> Anonymous memory segments used by the guest are preserved across a re-exec
> >>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> >>> in the Linux kernel. For the madvise patches, see:
> >>>
> >>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>
> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>> ---
> >>>  include/qemu/osdep.h | 7 +++++++
> >>>  1 file changed, 7 insertions(+)    
> >>
> >> Hi Alex,
> >>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> >> live update series, is getting a chilly reception on lkml.  We could instead 
> >> create guest memory using memfd_create and preserve the fd across exec.  However, 
> >> the subsequent mmap(fd) will return a different VA than was used previously, 
> >> which  is a problem for memory that was registered with vfio, as the original VA 
> >> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> >> such as vfio_iommu_replay.
> >>
> >> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> >> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> >> vaddr with new_vaddr.  Flags cannot be changed.
> >>
> >> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> >> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> >> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> >>
> >> What do you think  
> > 
> > Your new REMAP ioctl would have parameters identical to the MAP_DMA
> > ioctl, so I think we should just use one of the flag bits on the
> > existing MAP_DMA ioctl for this variant.  
> 
> Sounds good.
> 
> > Reading through the discussion on the kernel side there seems to be
> > some confusion around why vfio needs the vaddr beyond the user call to
> > MAP_DMA though.  Originally this was used to test for virtually
> > contiguous mappings for merging and splitting purposes.  This is
> > defunct in the v2 interface, however the vaddr is now used largely for
> > mdev devices.  If an mdev device is not backed by an IOMMU device and
> > does not share a container with an IOMMU device, then a user MAP_DMA
> > ioctl essentially just registers the translation within the vfio
> > container.  The mdev vendor driver can then later either request pages
> > to be pinned for device DMA or can perform copy_to/from_user() to
> > simulate DMA via the CPU.
> > 
> > Therefore I don't see that there's a simple re-architecture of the vfio
> > IOMMU backend that could drop vaddr use.    
> 
> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> same conclusion.
> 
> > I'm a bit concerned this new
> > remap proposal also raises the question of how do we prevent userspace
> > remapping vaddrs racing with asynchronous kernel use of the previous
> > vaddrs.    
> 
> Agreed.  After a quick glance at the code, holding iommu->lock during 
> remap might be sufficient, but it needs more study.

Unless you're suggesting an extended hold of the lock across the entire
re-exec of QEMU, that's only going to prevent a race between a remap
and a vendor driver pin or access, the time between the previous vaddr
becoming invalid and the remap is unprotected.

> > Are we expecting guest drivers/agents to quiesce the device,
> > or maybe relying on clearing bus-master, for PCI devices, to halt DMA?  
> 
> No.  We want to support any guest, and the guest is not aware that qemu
> live update is occurring.
> 
> > The vfio migration interface we've developed does have a mechanism to
> > stop a device, would we need to use this here?  If we do have a
> > mechanism to quiesce the device, is the only reason we're not unmapping
> > everything and remapping it into the new address space the latency in
> > performing that operation?  Thanks,  
> 
> Same answer - we don't require that the guest has vfio migration support.

QEMU toggling the runstate of the device via the vfio migration
interface could be done transparently to the guest, but if your
intention is to support any device (where none currently support the
migration interface) perhaps it's a moot point.  It seems like this
scheme only works with IOMMU backed devices where the device can
continue to operate against pinned pages, anything that might need to
dynamically pin pages against the process vaddr as it's running async
to the QEMU re-exec needs to be blocked or stopped.  Thanks,

Alex
Alex Williamson Aug. 18, 2020, 2:42 a.m. UTC | #5
On Mon, 17 Aug 2020 15:44:03 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 17 Aug 2020 17:20:57 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
> > On 8/17/2020 4:48 PM, Alex Williamson wrote:  
> > > On Mon, 17 Aug 2020 14:30:51 -0400
> > > Steven Sistare <steven.sistare@oracle.com> wrote:
> > >     
> > >> On 7/30/2020 11:14 AM, Steve Sistare wrote:    
> > >>> Anonymous memory segments used by the guest are preserved across a re-exec
> > >>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> > >>> in the Linux kernel. For the madvise patches, see:
> > >>>
> > >>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> > >>>
> > >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >>> ---
> > >>>  include/qemu/osdep.h | 7 +++++++
> > >>>  1 file changed, 7 insertions(+)      
> > >>
> > >> Hi Alex,
> > >>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> > >> live update series, is getting a chilly reception on lkml.  We could instead 
> > >> create guest memory using memfd_create and preserve the fd across exec.  However, 
> > >> the subsequent mmap(fd) will return a different VA than was used previously, 
> > >> which  is a problem for memory that was registered with vfio, as the original VA 
> > >> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> > >> such as vfio_iommu_replay.
> > >>
> > >> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> > >> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> > >> vaddr with new_vaddr.  Flags cannot be changed.
> > >>
> > >> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> > >> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> > >> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> > >>
> > >> What do you think    
> > > 
> > > Your new REMAP ioctl would have parameters identical to the MAP_DMA
> > > ioctl, so I think we should just use one of the flag bits on the
> > > existing MAP_DMA ioctl for this variant.    
> > 
> > Sounds good.
> >   
> > > Reading through the discussion on the kernel side there seems to be
> > > some confusion around why vfio needs the vaddr beyond the user call to
> > > MAP_DMA though.  Originally this was used to test for virtually
> > > contiguous mappings for merging and splitting purposes.  This is
> > > defunct in the v2 interface, however the vaddr is now used largely for
> > > mdev devices.  If an mdev device is not backed by an IOMMU device and
> > > does not share a container with an IOMMU device, then a user MAP_DMA
> > > ioctl essentially just registers the translation within the vfio
> > > container.  The mdev vendor driver can then later either request pages
> > > to be pinned for device DMA or can perform copy_to/from_user() to
> > > simulate DMA via the CPU.
> > > 
> > > Therefore I don't see that there's a simple re-architecture of the vfio
> > > IOMMU backend that could drop vaddr use.      
> > 
> > Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> > same conclusion.
> >   
> > > I'm a bit concerned this new
> > > remap proposal also raises the question of how do we prevent userspace
> > > remapping vaddrs racing with asynchronous kernel use of the previous
> > > vaddrs.      
> > 
> > Agreed.  After a quick glance at the code, holding iommu->lock during 
> > remap might be sufficient, but it needs more study.  
> 
> Unless you're suggesting an extended hold of the lock across the entire
> re-exec of QEMU, that's only going to prevent a race between a remap
> and a vendor driver pin or access, the time between the previous vaddr
> becoming invalid and the remap is unprotected.
> 
> > > Are we expecting guest drivers/agents to quiesce the device,
> > > or maybe relying on clearing bus-master, for PCI devices, to halt DMA?    
> > 
> > No.  We want to support any guest, and the guest is not aware that qemu
> > live update is occurring.
> >   
> > > The vfio migration interface we've developed does have a mechanism to
> > > stop a device, would we need to use this here?  If we do have a
> > > mechanism to quiesce the device, is the only reason we're not unmapping
> > > everything and remapping it into the new address space the latency in
> > > performing that operation?  Thanks,    
> > 
> > Same answer - we don't require that the guest has vfio migration support.  
> 
> QEMU toggling the runstate of the device via the vfio migration
> interface could be done transparently to the guest, but if your
> intention is to support any device (where none currently support the
> migration interface) perhaps it's a moot point.  It seems like this
> scheme only works with IOMMU backed devices where the device can
> continue to operate against pinned pages, anything that might need to
> dynamically pin pages against the process vaddr as it's running async
> to the QEMU re-exec needs to be blocked or stopped.  Thanks,

Hmm, even if a device is running against pinned memory, how do we
handle device interrupts that occur during QEMU's downtime?  I see that
we reconfigure interrupts, but does QEMU need to drain the eventfd and
manually inject those missed interrupts or will setting up the irqfds
trigger a poll?  Thanks,

Alex
Steven Sistare Aug. 19, 2020, 9:52 p.m. UTC | #6
On 8/17/2020 10:42 PM, Alex Williamson wrote:
> On Mon, 17 Aug 2020 15:44:03 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Mon, 17 Aug 2020 17:20:57 -0400
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:  
>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>     
>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:    
>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>  1 file changed, 7 insertions(+)      
>>>>>
>>>>> Hi Alex,
>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>> such as vfio_iommu_replay.
>>>>>
>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>
>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>
>>>>> What do you think    
>>>>
>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>> ioctl, so I think we should just use one of the flag bits on the
>>>> existing MAP_DMA ioctl for this variant.    
>>>
>>> Sounds good.
>>>   
>>>> Reading through the discussion on the kernel side there seems to be
>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>> contiguous mappings for merging and splitting purposes.  This is
>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>> ioctl essentially just registers the translation within the vfio
>>>> container.  The mdev vendor driver can then later either request pages
>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>> simulate DMA via the CPU.
>>>>
>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>> IOMMU backend that could drop vaddr use.      
>>>
>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>> same conclusion.
>>>   
>>>> I'm a bit concerned this new
>>>> remap proposal also raises the question of how do we prevent userspace
>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>> vaddrs.      
>>>
>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>> remap might be sufficient, but it needs more study.  
>>
>> Unless you're suggesting an extended hold of the lock across the entire
>> re-exec of QEMU, that's only going to prevent a race between a remap
>> and a vendor driver pin or access, the time between the previous vaddr
>> becoming invalid and the remap is unprotected.

OK.  What if we exclude mediated devices?  Its appears they are the only
ones where the kernel may async'ly use the vaddr, via call chains ending in 
vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.

The other functions that use dma->vaddr are
    vfio_dma_do_map 
    vfio_pin_map_dma 
    vfio_iommu_replay 
    vfio_pin_pages_remote
and they are all initiated via userland ioctl (if I have traced all the code 
paths correctly).  Thus iommu->lock would protect them.

We would block live update in qemu if the config includes a mediated device.

VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.

>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?    
>>>
>>> No.  We want to support any guest, and the guest is not aware that qemu
>>> live update is occurring.
>>>   
>>>> The vfio migration interface we've developed does have a mechanism to
>>>> stop a device, would we need to use this here?  If we do have a
>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>> everything and remapping it into the new address space the latency in
>>>> performing that operation?  Thanks,    
>>>
>>> Same answer - we don't require that the guest has vfio migration support.  
>>
>> QEMU toggling the runstate of the device via the vfio migration
>> interface could be done transparently to the guest, but if your
>> intention is to support any device (where none currently support the
>> migration interface) perhaps it's a moot point.  

That sounds useful when devices support.  Can you give me some function names
or references so I can study this qemu-based "vfio migration interface".

>> It seems like this
>> scheme only works with IOMMU backed devices where the device can
>> continue to operate against pinned pages, anything that might need to
>> dynamically pin pages against the process vaddr as it's running async
>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,

Yes, true of this remap proposal.

I wanted to unconditionally support all devices, which is why I think that

MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the

lkml discussion.

> Hmm, even if a device is running against pinned memory, how do we
> handle device interrupts that occur during QEMU's downtime?  I see that
> we reconfigure interrupts, but does QEMU need to drain the eventfd and
> manually inject those missed interrupts or will setting up the irqfds
> trigger a poll?  Thanks,

My existing code is apparently deficient in this area; I close the pre-exec eventfd,
and post exec create a new eventfd and attach it to the vfio device.  Jason and I
are discussing alternatives in this thread of the series:
  https://lore.kernel.org/qemu-devel/0da862c8-74bc-bf06-a436-4ebfcb9dd8d4@oracle.com/

I am hoping that unconditionally injecting a (potentially spurious) interrupt on a 
new eventfd after exec will solve the problem.

BTW, thanks for discussing these issues.  I appreciate it.

- Steve
Alex Williamson Aug. 24, 2020, 10:30 p.m. UTC | #7
On Wed, 19 Aug 2020 17:52:26 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 8/17/2020 10:42 PM, Alex Williamson wrote:
> > On Mon, 17 Aug 2020 15:44:03 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> On Mon, 17 Aug 2020 17:20:57 -0400
> >> Steven Sistare <steven.sistare@oracle.com> wrote:
> >>  
> >>> On 8/17/2020 4:48 PM, Alex Williamson wrote:    
> >>>> On Mon, 17 Aug 2020 14:30:51 -0400
> >>>> Steven Sistare <steven.sistare@oracle.com> wrote:
> >>>>       
> >>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:      
> >>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
> >>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> >>>>>> in the Linux kernel. For the madvise patches, see:
> >>>>>>
> >>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>>>>
> >>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>> ---
> >>>>>>  include/qemu/osdep.h | 7 +++++++
> >>>>>>  1 file changed, 7 insertions(+)        
> >>>>>
> >>>>> Hi Alex,
> >>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> >>>>> live update series, is getting a chilly reception on lkml.  We could instead 
> >>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
> >>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
> >>>>> which  is a problem for memory that was registered with vfio, as the original VA 
> >>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> >>>>> such as vfio_iommu_replay.
> >>>>>
> >>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> >>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> >>>>> vaddr with new_vaddr.  Flags cannot be changed.
> >>>>>
> >>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> >>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> >>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> >>>>>
> >>>>> What do you think      
> >>>>
> >>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
> >>>> ioctl, so I think we should just use one of the flag bits on the
> >>>> existing MAP_DMA ioctl for this variant.      
> >>>
> >>> Sounds good.
> >>>     
> >>>> Reading through the discussion on the kernel side there seems to be
> >>>> some confusion around why vfio needs the vaddr beyond the user call to
> >>>> MAP_DMA though.  Originally this was used to test for virtually
> >>>> contiguous mappings for merging and splitting purposes.  This is
> >>>> defunct in the v2 interface, however the vaddr is now used largely for
> >>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
> >>>> does not share a container with an IOMMU device, then a user MAP_DMA
> >>>> ioctl essentially just registers the translation within the vfio
> >>>> container.  The mdev vendor driver can then later either request pages
> >>>> to be pinned for device DMA or can perform copy_to/from_user() to
> >>>> simulate DMA via the CPU.
> >>>>
> >>>> Therefore I don't see that there's a simple re-architecture of the vfio
> >>>> IOMMU backend that could drop vaddr use.        
> >>>
> >>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> >>> same conclusion.
> >>>     
> >>>> I'm a bit concerned this new
> >>>> remap proposal also raises the question of how do we prevent userspace
> >>>> remapping vaddrs racing with asynchronous kernel use of the previous
> >>>> vaddrs.        
> >>>
> >>> Agreed.  After a quick glance at the code, holding iommu->lock during 
> >>> remap might be sufficient, but it needs more study.    
> >>
> >> Unless you're suggesting an extended hold of the lock across the entire
> >> re-exec of QEMU, that's only going to prevent a race between a remap
> >> and a vendor driver pin or access, the time between the previous vaddr
> >> becoming invalid and the remap is unprotected.  
> 
> OK.  What if we exclude mediated devices?  Its appears they are the only
> ones where the kernel may async'ly use the vaddr, via call chains ending in 
> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
> 
> The other functions that use dma->vaddr are
>     vfio_dma_do_map 
>     vfio_pin_map_dma 
>     vfio_iommu_replay 
>     vfio_pin_pages_remote
> and they are all initiated via userland ioctl (if I have traced all the code 
> paths correctly).  Thus iommu->lock would protect them.
> 
> We would block live update in qemu if the config includes a mediated device.
> 
> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.

That's not a solution I'd really be in favor of.  We're eliminating an
entire class of devices because they _might_ make use of these
interfaces, but anyone can add a vfio bus driver, even exposing the
same device API, and maybe make use of some of these interfaces in that
driver.  Maybe we'd even have reason to do it in vfio-pci if we had
reason to virtualize some aspect of a device.  I think we're setting
ourselves up for a very complicated support scenario if we just
arbitrarily decide to deny drivers using certain interfaces.


> >>>> Are we expecting guest drivers/agents to quiesce the device,
> >>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?      
> >>>
> >>> No.  We want to support any guest, and the guest is not aware that qemu
> >>> live update is occurring.
> >>>     
> >>>> The vfio migration interface we've developed does have a mechanism to
> >>>> stop a device, would we need to use this here?  If we do have a
> >>>> mechanism to quiesce the device, is the only reason we're not unmapping
> >>>> everything and remapping it into the new address space the latency in
> >>>> performing that operation?  Thanks,      
> >>>
> >>> Same answer - we don't require that the guest has vfio migration support.    
> >>
> >> QEMU toggling the runstate of the device via the vfio migration
> >> interface could be done transparently to the guest, but if your
> >> intention is to support any device (where none currently support the
> >> migration interface) perhaps it's a moot point.    
> 
> That sounds useful when devices support.  Can you give me some function names
> or references so I can study this qemu-based "vfio migration interface".

The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
the QEMU support or implementation in an mdev vendor driver.
Essentially migration exposes a new region of the device which would be
implemented by the vendor driver.  A register within that region
manipulates the device state, so a device could be stopped by clearing
the 'run' bit in that register.


> >> It seems like this
> >> scheme only works with IOMMU backed devices where the device can
> >> continue to operate against pinned pages, anything that might need to
> >> dynamically pin pages against the process vaddr as it's running async
> >> to the QEMU re-exec needs to be blocked or stopped.  Thanks,  
> 
> Yes, true of this remap proposal.
> 
> I wanted to unconditionally support all devices, which is why I think that
> 
> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
> 
> lkml discussion.
> 
> > Hmm, even if a device is running against pinned memory, how do we
> > handle device interrupts that occur during QEMU's downtime?  I see that
> > we reconfigure interrupts, but does QEMU need to drain the eventfd and
> > manually inject those missed interrupts or will setting up the irqfds
> > trigger a poll?  Thanks,  
> 
> My existing code is apparently deficient in this area; I close the pre-exec eventfd,
> and post exec create a new eventfd and attach it to the vfio device.  Jason and I
> are discussing alternatives in this thread of the series:
>   https://lore.kernel.org/qemu-devel/https0da862c8-74bc-bf06-a436-4ebfcb9dd8d4@oracle.com/
> 
> I am hoping that unconditionally injecting a (potentially spurious) interrupt on a 
> new eventfd after exec will solve the problem.

That's sloppy, but maybe sufficient, but I agree with Jason's concern
about treading carefully around anything that would cause the interrupt
state of the device to be modified, which certainly might not be
transparent to the device.  Then there's the issue of what would happen
if a fatal AER event occurred while the eventfd is disconnected.  We
wouldn't want to generate a spurious event on that channel.  The device
request eventfd will retry from the kernel side, so simply reconnecting
it should work.  Each type of virtual interrupt will need to have a
plan for what to do around this disconnected period, and like the error
reporting one, it might not be safe to lose the interrupt nor inject a
spurious interrupt.

Regarding emulated state in QEMU, yes QEMU does write all config to
the kernel, where some things might be emulated in the kernel, but
there are also things emulated in QEMU.  See for example
vdev->emulated_config_bits.  Writing everything to the kernel is just a
simplification because we know the kernel will drop writes that it
doesn't allow.  It's essentially a catch-all.  Thanks,

Alex
Steven Sistare Oct. 8, 2020, 4:32 p.m. UTC | #8
On 8/24/2020 6:30 PM, Alex Williamson wrote:
> On Wed, 19 Aug 2020 17:52:26 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
>> On 8/17/2020 10:42 PM, Alex Williamson wrote:
>>> On Mon, 17 Aug 2020 15:44:03 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>> On Mon, 17 Aug 2020 17:20:57 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:    
>>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>       
>>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:      
>>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>>>
>>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>>>
>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>> ---
>>>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>>>  1 file changed, 7 insertions(+)        
>>>>>>>
>>>>>>> Hi Alex,
>>>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>>>> such as vfio_iommu_replay.
>>>>>>>
>>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>>>
>>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>>>
>>>>>>> What do you think      
>>>>>>
>>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>>>> ioctl, so I think we should just use one of the flag bits on the
>>>>>> existing MAP_DMA ioctl for this variant.      
>>>>>
>>>>> Sounds good.
>>>>>     
>>>>>> Reading through the discussion on the kernel side there seems to be
>>>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>>>> contiguous mappings for merging and splitting purposes.  This is
>>>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>>>> ioctl essentially just registers the translation within the vfio
>>>>>> container.  The mdev vendor driver can then later either request pages
>>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>>>> simulate DMA via the CPU.
>>>>>>
>>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>>>> IOMMU backend that could drop vaddr use.        
>>>>>
>>>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>>>> same conclusion.
>>>>>     
>>>>>> I'm a bit concerned this new
>>>>>> remap proposal also raises the question of how do we prevent userspace
>>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>>>> vaddrs.        
>>>>>
>>>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>>>> remap might be sufficient, but it needs more study.    
>>>>
>>>> Unless you're suggesting an extended hold of the lock across the entire
>>>> re-exec of QEMU, that's only going to prevent a race between a remap
>>>> and a vendor driver pin or access, the time between the previous vaddr
>>>> becoming invalid and the remap is unprotected.  
>>
>> OK.  What if we exclude mediated devices?  Its appears they are the only
>> ones where the kernel may async'ly use the vaddr, via call chains ending in 
>> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
>>
>> The other functions that use dma->vaddr are
>>     vfio_dma_do_map 
>>     vfio_pin_map_dma 
>>     vfio_iommu_replay 
>>     vfio_pin_pages_remote
>> and they are all initiated via userland ioctl (if I have traced all the code 
>> paths correctly).  Thus iommu->lock would protect them.
>>
>> We would block live update in qemu if the config includes a mediated device.
>>
>> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.
> 
> That's not a solution I'd really be in favor of.  We're eliminating an
> entire class of devices because they _might_ make use of these
> interfaces, but anyone can add a vfio bus driver, even exposing the
> same device API, and maybe make use of some of these interfaces in that
> driver.  Maybe we'd even have reason to do it in vfio-pci if we had
> reason to virtualize some aspect of a device.  I think we're setting
> ourselves up for a very complicated support scenario if we just
> arbitrarily decide to deny drivers using certain interfaces.
> 
>>>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?      
>>>>>
>>>>> No.  We want to support any guest, and the guest is not aware that qemu
>>>>> live update is occurring.
>>>>>     
>>>>>> The vfio migration interface we've developed does have a mechanism to
>>>>>> stop a device, would we need to use this here?  If we do have a
>>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>>>> everything and remapping it into the new address space the latency in
>>>>>> performing that operation?  Thanks,      
>>>>>
>>>>> Same answer - we don't require that the guest has vfio migration support.    
>>>>
>>>> QEMU toggling the runstate of the device via the vfio migration
>>>> interface could be done transparently to the guest, but if your
>>>> intention is to support any device (where none currently support the
>>>> migration interface) perhaps it's a moot point.    
>>
>> That sounds useful when devices support.  Can you give me some function names
>> or references so I can study this qemu-based "vfio migration interface".
> 
> The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
> the QEMU support or implementation in an mdev vendor driver.
> Essentially migration exposes a new region of the device which would be
> implemented by the vendor driver.  A register within that region
> manipulates the device state, so a device could be stopped by clearing
> the 'run' bit in that register.
> 
>>>> It seems like this
>>>> scheme only works with IOMMU backed devices where the device can
>>>> continue to operate against pinned pages, anything that might need to
>>>> dynamically pin pages against the process vaddr as it's running async
>>>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,  
>>
>> Yes, true of this remap proposal.
>>
>> I wanted to unconditionally support all devices, which is why I think that
>>
>> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
>> lkml discussion.

Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
from mediated device drivers.

Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
VFIO_IOMMU_MAP_DMA.

VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
  Discard vaddr on the existing dma region defined by (iova, size), but keep the
  struct vfio_dma.  Subsequent translation requests are blocked.
  The implementation sets a flag in struct vfio_dma.  vfio_pin_pages() and
  vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
  Called before exec.

VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
  Remap the region (iova, size) to vaddr, and resume translation requests.
  Called after exec.

Unfortunately, remap as defined above has an undesirable side effect.  The mdev
driver may use kernel worker threads which serve requests from multiple clients
(eg i915/gvt workload_thread).  A process that fails to call MAP_DMA with REMAP,
or is tardy doing so, will delay other processes who are stuck waiting in
vfio_pin_pages or vfio_dma_rw.  This is unacceptable, and I mention this scheme in
case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
instance?), or in case you see a way to salvage it.

Here is a more robust implementation.  It only works for dma regions backed by
a file, such as memfd or shm.

VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
  Find the file and offset for iova, and save the struct file pointer in
  struct vfio_dma.  In vfio_pin_pages and vfio_dma_rw and their descendants,
  if file* is set, then call pagecache_get_page() to get the pfn, instead of
  get_user_pages.

VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
  Remap the region (iova, size) to vaddr and drop the file reference.

This begs the question of whether we can always use pagecache_get_page, and
eliminate the dependency on vaddr.  The translation performance could be
different, though.

I have not implemented this yet.  Any thoughts before I do?

- Steve
Alex Williamson Oct. 15, 2020, 8:36 p.m. UTC | #9
On Thu, 8 Oct 2020 12:32:35 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 8/24/2020 6:30 PM, Alex Williamson wrote:
> > On Wed, 19 Aug 2020 17:52:26 -0400
> > Steven Sistare <steven.sistare@oracle.com> wrote:  
> >> On 8/17/2020 10:42 PM, Alex Williamson wrote:  
> >>> On Mon, 17 Aug 2020 15:44:03 -0600
> >>> Alex Williamson <alex.williamson@redhat.com> wrote:  
> >>>> On Mon, 17 Aug 2020 17:20:57 -0400
> >>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
> >>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:      
> >>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
> >>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
> >>>>>>         
> >>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:        
> >>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
> >>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> >>>>>>>> in the Linux kernel. For the madvise patches, see:
> >>>>>>>>
> >>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>>>>>>
> >>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>>>> ---
> >>>>>>>>  include/qemu/osdep.h | 7 +++++++
> >>>>>>>>  1 file changed, 7 insertions(+)          
> >>>>>>>
> >>>>>>> Hi Alex,
> >>>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> >>>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
> >>>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
> >>>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
> >>>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
> >>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> >>>>>>> such as vfio_iommu_replay.
> >>>>>>>
> >>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> >>>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> >>>>>>> vaddr with new_vaddr.  Flags cannot be changed.
> >>>>>>>
> >>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> >>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> >>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> >>>>>>>
> >>>>>>> What do you think        
> >>>>>>
> >>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
> >>>>>> ioctl, so I think we should just use one of the flag bits on the
> >>>>>> existing MAP_DMA ioctl for this variant.        
> >>>>>
> >>>>> Sounds good.
> >>>>>       
> >>>>>> Reading through the discussion on the kernel side there seems to be
> >>>>>> some confusion around why vfio needs the vaddr beyond the user call to
> >>>>>> MAP_DMA though.  Originally this was used to test for virtually
> >>>>>> contiguous mappings for merging and splitting purposes.  This is
> >>>>>> defunct in the v2 interface, however the vaddr is now used largely for
> >>>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
> >>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
> >>>>>> ioctl essentially just registers the translation within the vfio
> >>>>>> container.  The mdev vendor driver can then later either request pages
> >>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
> >>>>>> simulate DMA via the CPU.
> >>>>>>
> >>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
> >>>>>> IOMMU backend that could drop vaddr use.          
> >>>>>
> >>>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> >>>>> same conclusion.
> >>>>>       
> >>>>>> I'm a bit concerned this new
> >>>>>> remap proposal also raises the question of how do we prevent userspace
> >>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
> >>>>>> vaddrs.          
> >>>>>
> >>>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
> >>>>> remap might be sufficient, but it needs more study.      
> >>>>
> >>>> Unless you're suggesting an extended hold of the lock across the entire
> >>>> re-exec of QEMU, that's only going to prevent a race between a remap
> >>>> and a vendor driver pin or access, the time between the previous vaddr
> >>>> becoming invalid and the remap is unprotected.    
> >>
> >> OK.  What if we exclude mediated devices?  Its appears they are the only
> >> ones where the kernel may async'ly use the vaddr, via call chains ending in 
> >> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
> >>
> >> The other functions that use dma->vaddr are
> >>     vfio_dma_do_map 
> >>     vfio_pin_map_dma 
> >>     vfio_iommu_replay 
> >>     vfio_pin_pages_remote
> >> and they are all initiated via userland ioctl (if I have traced all the code 
> >> paths correctly).  Thus iommu->lock would protect them.
> >>
> >> We would block live update in qemu if the config includes a mediated device.
> >>
> >> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.  
> > 
> > That's not a solution I'd really be in favor of.  We're eliminating an
> > entire class of devices because they _might_ make use of these
> > interfaces, but anyone can add a vfio bus driver, even exposing the
> > same device API, and maybe make use of some of these interfaces in that
> > driver.  Maybe we'd even have reason to do it in vfio-pci if we had
> > reason to virtualize some aspect of a device.  I think we're setting
> > ourselves up for a very complicated support scenario if we just
> > arbitrarily decide to deny drivers using certain interfaces.
> >   
> >>>>>> Are we expecting guest drivers/agents to quiesce the device,
> >>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?        
> >>>>>
> >>>>> No.  We want to support any guest, and the guest is not aware that qemu
> >>>>> live update is occurring.
> >>>>>       
> >>>>>> The vfio migration interface we've developed does have a mechanism to
> >>>>>> stop a device, would we need to use this here?  If we do have a
> >>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
> >>>>>> everything and remapping it into the new address space the latency in
> >>>>>> performing that operation?  Thanks,        
> >>>>>
> >>>>> Same answer - we don't require that the guest has vfio migration support.      
> >>>>
> >>>> QEMU toggling the runstate of the device via the vfio migration
> >>>> interface could be done transparently to the guest, but if your
> >>>> intention is to support any device (where none currently support the
> >>>> migration interface) perhaps it's a moot point.      
> >>
> >> That sounds useful when devices support.  Can you give me some function names
> >> or references so I can study this qemu-based "vfio migration interface".  
> > 
> > The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
> > the QEMU support or implementation in an mdev vendor driver.
> > Essentially migration exposes a new region of the device which would be
> > implemented by the vendor driver.  A register within that region
> > manipulates the device state, so a device could be stopped by clearing
> > the 'run' bit in that register.
> >   
> >>>> It seems like this
> >>>> scheme only works with IOMMU backed devices where the device can
> >>>> continue to operate against pinned pages, anything that might need to
> >>>> dynamically pin pages against the process vaddr as it's running async
> >>>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,    
> >>
> >> Yes, true of this remap proposal.
> >>
> >> I wanted to unconditionally support all devices, which is why I think that
> >>
> >> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
> >> lkml discussion.  
> 
> Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
> from mediated device drivers.
> 
> Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
> VFIO_IOMMU_MAP_DMA.
> 
> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>   Discard vaddr on the existing dma region defined by (iova, size), but keep the
>   struct vfio_dma.  Subsequent translation requests are blocked.
>   The implementation sets a flag in struct vfio_dma.  vfio_pin_pages() and
>   vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
>   Called before exec.
> 
> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>   Remap the region (iova, size) to vaddr, and resume translation requests.
>   Called after exec.
> 
> Unfortunately, remap as defined above has an undesirable side effect.  The mdev
> driver may use kernel worker threads which serve requests from multiple clients
> (eg i915/gvt workload_thread).  A process that fails to call MAP_DMA with REMAP,
> or is tardy doing so, will delay other processes who are stuck waiting in
> vfio_pin_pages or vfio_dma_rw.  This is unacceptable, and I mention this scheme in
> case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
> instance?), or in case you see a way to salvage it.

Right, that's my first thought when I hear that the pin and dma_rw paths
are blocked as well, we cannot rely on userspace to unblock anything.
A malicious user may hold out just to see how long until the host
becomes unusable.  Userspace determines how many groups share a
vfio_iommu.

> Here is a more robust implementation.  It only works for dma regions backed by
> a file, such as memfd or shm.
> 
> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>   Find the file and offset for iova, and save the struct file pointer in
>   struct vfio_dma.  In vfio_pin_pages and vfio_dma_rw and their descendants,
>   if file* is set, then call pagecache_get_page() to get the pfn, instead of
>   get_user_pages.
> 
> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>   Remap the region (iova, size) to vaddr and drop the file reference.
> 
> This begs the question of whether we can always use pagecache_get_page, and
> eliminate the dependency on vaddr.  The translation performance could be
> different, though.
> 
> I have not implemented this yet.  Any thoughts before I do?

That's a pretty hefty usage restriction, but what I take from it is
that these are mechanisms which provide a fallback lookup path that can
service callers in the interim during the gap of the range being
remapped.  The callers are always providing an IOVA and wishing to do
something to the memory referenced by that IOVA, we just need a
translation mechanism.  The IOMMU itself is also such an alternative
lookup, via iommu_iova_to_phys(), but of course requiring an
IOMMU-backed device is just another usage restriction, potentially one
that's not even apparent to the user.

Is a more general solution to make sure there's always an IOVA-to-phys
lookup mechanism available, implementing one if not provided by the
IOMMU or memory backing interface?  We'd need to adapt the dma_rw
interface to work on either a VA or PA, and pinning pages on
UNMAP+REMAP, plus stashing them in a translation structure, plus
dynamically adapting to changes (ex. the IOMMU backed device being
removed, leaving a non-IOMMU backed device in the vfio_iommu) all
sounds pretty complicated, especially as the vfio-iommu-type1 backend
becomes stretched to be more and more fragile.  Possibly it's still
feasible though.  Thanks,

Alex
Steven Sistare Oct. 19, 2020, 4:33 p.m. UTC | #10
On 10/15/2020 4:36 PM, Alex Williamson wrote:
> On Thu, 8 Oct 2020 12:32:35 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
>> On 8/24/2020 6:30 PM, Alex Williamson wrote:
>>> On Wed, 19 Aug 2020 17:52:26 -0400
>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
>>>> On 8/17/2020 10:42 PM, Alex Williamson wrote:  
>>>>> On Mon, 17 Aug 2020 15:44:03 -0600
>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:  
>>>>>> On Mon, 17 Aug 2020 17:20:57 -0400
>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
>>>>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:      
>>>>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>>         
>>>>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:        
>>>>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>> ---
>>>>>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>>>>>  1 file changed, 7 insertions(+)          
>>>>>>>>>
>>>>>>>>> Hi Alex,
>>>>>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>>>>>> such as vfio_iommu_replay.
>>>>>>>>>
>>>>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>>>>>
>>>>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>>>>>
>>>>>>>>> What do you think        
>>>>>>>>
>>>>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>>>>>> ioctl, so I think we should just use one of the flag bits on the
>>>>>>>> existing MAP_DMA ioctl for this variant.        
>>>>>>>
>>>>>>> Sounds good.
>>>>>>>       
>>>>>>>> Reading through the discussion on the kernel side there seems to be
>>>>>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>>>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>>>>>> contiguous mappings for merging and splitting purposes.  This is
>>>>>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>>>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>>>>>> ioctl essentially just registers the translation within the vfio
>>>>>>>> container.  The mdev vendor driver can then later either request pages
>>>>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>>>>>> simulate DMA via the CPU.
>>>>>>>>
>>>>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>>>>>> IOMMU backend that could drop vaddr use.          
>>>>>>>
>>>>>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>>>>>> same conclusion.
>>>>>>>       
>>>>>>>> I'm a bit concerned this new
>>>>>>>> remap proposal also raises the question of how do we prevent userspace
>>>>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>>>>>> vaddrs.          
>>>>>>>
>>>>>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>>>>>> remap might be sufficient, but it needs more study.      
>>>>>>
>>>>>> Unless you're suggesting an extended hold of the lock across the entire
>>>>>> re-exec of QEMU, that's only going to prevent a race between a remap
>>>>>> and a vendor driver pin or access, the time between the previous vaddr
>>>>>> becoming invalid and the remap is unprotected.    
>>>>
>>>> OK.  What if we exclude mediated devices?  Its appears they are the only
>>>> ones where the kernel may async'ly use the vaddr, via call chains ending in 
>>>> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
>>>>
>>>> The other functions that use dma->vaddr are
>>>>     vfio_dma_do_map 
>>>>     vfio_pin_map_dma 
>>>>     vfio_iommu_replay 
>>>>     vfio_pin_pages_remote
>>>> and they are all initiated via userland ioctl (if I have traced all the code 
>>>> paths correctly).  Thus iommu->lock would protect them.
>>>>
>>>> We would block live update in qemu if the config includes a mediated device.
>>>>
>>>> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.  
>>>
>>> That's not a solution I'd really be in favor of.  We're eliminating an
>>> entire class of devices because they _might_ make use of these
>>> interfaces, but anyone can add a vfio bus driver, even exposing the
>>> same device API, and maybe make use of some of these interfaces in that
>>> driver.  Maybe we'd even have reason to do it in vfio-pci if we had
>>> reason to virtualize some aspect of a device.  I think we're setting
>>> ourselves up for a very complicated support scenario if we just
>>> arbitrarily decide to deny drivers using certain interfaces.
>>>   
>>>>>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?        
>>>>>>>
>>>>>>> No.  We want to support any guest, and the guest is not aware that qemu
>>>>>>> live update is occurring.
>>>>>>>       
>>>>>>>> The vfio migration interface we've developed does have a mechanism to
>>>>>>>> stop a device, would we need to use this here?  If we do have a
>>>>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>>>>>> everything and remapping it into the new address space the latency in
>>>>>>>> performing that operation?  Thanks,        
>>>>>>>
>>>>>>> Same answer - we don't require that the guest has vfio migration support.      
>>>>>>
>>>>>> QEMU toggling the runstate of the device via the vfio migration
>>>>>> interface could be done transparently to the guest, but if your
>>>>>> intention is to support any device (where none currently support the
>>>>>> migration interface) perhaps it's a moot point.      
>>>>
>>>> That sounds useful when devices support.  Can you give me some function names
>>>> or references so I can study this qemu-based "vfio migration interface".  
>>>
>>> The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
>>> the QEMU support or implementation in an mdev vendor driver.
>>> Essentially migration exposes a new region of the device which would be
>>> implemented by the vendor driver.  A register within that region
>>> manipulates the device state, so a device could be stopped by clearing
>>> the 'run' bit in that register.
>>>   
>>>>>> It seems like this
>>>>>> scheme only works with IOMMU backed devices where the device can
>>>>>> continue to operate against pinned pages, anything that might need to
>>>>>> dynamically pin pages against the process vaddr as it's running async
>>>>>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,    
>>>>
>>>> Yes, true of this remap proposal.
>>>>
>>>> I wanted to unconditionally support all devices, which is why I think that
>>>>
>>>> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
>>>> lkml discussion.  
>>
>> Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
>> from mediated device drivers.
>>
>> Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
>> VFIO_IOMMU_MAP_DMA.
>>
>> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>   Discard vaddr on the existing dma region defined by (iova, size), but keep the
>>   struct vfio_dma.  Subsequent translation requests are blocked.
>>   The implementation sets a flag in struct vfio_dma.  vfio_pin_pages() and
>>   vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
>>   Called before exec.
>>
>> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>   Remap the region (iova, size) to vaddr, and resume translation requests.
>>   Called after exec.
>>
>> Unfortunately, remap as defined above has an undesirable side effect.  The mdev
>> driver may use kernel worker threads which serve requests from multiple clients
>> (eg i915/gvt workload_thread).  A process that fails to call MAP_DMA with REMAP,
>> or is tardy doing so, will delay other processes who are stuck waiting in
>> vfio_pin_pages or vfio_dma_rw.  This is unacceptable, and I mention this scheme in
>> case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
>> instance?), or in case you see a way to salvage it.
> 
> Right, that's my first thought when I hear that the pin and dma_rw paths
> are blocked as well, we cannot rely on userspace to unblock anything.
> A malicious user may hold out just to see how long until the host
> becomes unusable.  Userspace determines how many groups share a
> vfio_iommu.
> 
>> Here is a more robust implementation.  It only works for dma regions backed by
>> a file, such as memfd or shm.
>>
>> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>   Find the file and offset for iova, and save the struct file pointer in
>>   struct vfio_dma.  In vfio_pin_pages and vfio_dma_rw and their descendants,
>>   if file* is set, then call pagecache_get_page() to get the pfn, instead of
>>   get_user_pages.
>>
>> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>   Remap the region (iova, size) to vaddr and drop the file reference.
>>
>> This begs the question of whether we can always use pagecache_get_page, and
>> eliminate the dependency on vaddr.  The translation performance could be
>> different, though.
>>
>> I have not implemented this yet.  Any thoughts before I do?
> 
> That's a pretty hefty usage restriction, but what I take from it is
> that these are mechanisms which provide a fallback lookup path that can
> service callers in the interim during the gap of the range being
> remapped.  The callers are always providing an IOVA and wishing to do
> something to the memory referenced by that IOVA, we just need a
> translation mechanism.  The IOMMU itself is also such an alternative
> lookup, via iommu_iova_to_phys(), but of course requiring an
> IOMMU-backed device is just another usage restriction, potentially one
> that's not even apparent to the user.
> 
> Is a more general solution to make sure there's always an IOVA-to-phys
> lookup mechanism available, implementing one if not provided by the
> IOMMU or memory backing interface?  We'd need to adapt the dma_rw
> interface to work on either a VA or PA, and pinning pages on
> UNMAP+REMAP, plus stashing them in a translation structure, plus
> dynamically adapting to changes (ex. the IOMMU backed device being
> removed, leaving a non-IOMMU backed device in the vfio_iommu) all
> sounds pretty complicated, especially as the vfio-iommu-type1 backend
> becomes stretched to be more and more fragile.  Possibly it's still
> feasible though.  Thanks,

Requiring file backed memory is not a restriction in practice, because there is
no way to preserve MAP_ANON memory across exec and map it into the new qemu process.
That is what MADV_DOEXEC would have provided.  Without it, one cannot do live 
update with MAP_ANON memory.

For qemu, when allocating anonymous memory for guest memory regions, we would modify the 
allocation  functions to call  memfd_create + mmap(fd) instead of mmap(MAP_ANON).  The
implementation of memfd_create creates a /dev/shm file and unlinks it. Thus the memory is 
backed by a file, and the VFIO UNMAP/REMAP proposal works.

- Steve
Steven Sistare Oct. 26, 2020, 6:28 p.m. UTC | #11
On 10/19/2020 12:33 PM, Steven Sistare wrote:
> On 10/15/2020 4:36 PM, Alex Williamson wrote:
>> On Thu, 8 Oct 2020 12:32:35 -0400
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>> On 8/24/2020 6:30 PM, Alex Williamson wrote:
>>>> On Wed, 19 Aug 2020 17:52:26 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
>>>>> On 8/17/2020 10:42 PM, Alex Williamson wrote:  
>>>>>> On Mon, 17 Aug 2020 15:44:03 -0600
>>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:  
>>>>>>> On Mon, 17 Aug 2020 17:20:57 -0400
>>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
>>>>>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:      
>>>>>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>>>         
>>>>>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:        
>>>>>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>>>>>>
>>>>>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>>>>>>  1 file changed, 7 insertions(+)          
>>>>>>>>>>
>>>>>>>>>> Hi Alex,
>>>>>>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>>>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>>>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>>>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>>>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>>>>>>> such as vfio_iommu_replay.
>>>>>>>>>>
>>>>>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>>>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>>>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>>>>>>
>>>>>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>>>>>>
>>>>>>>>>> What do you think        
>>>>>>>>>
>>>>>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>>>>>>> ioctl, so I think we should just use one of the flag bits on the
>>>>>>>>> existing MAP_DMA ioctl for this variant.        
>>>>>>>>
>>>>>>>> Sounds good.
>>>>>>>>       
>>>>>>>>> Reading through the discussion on the kernel side there seems to be
>>>>>>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>>>>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>>>>>>> contiguous mappings for merging and splitting purposes.  This is
>>>>>>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>>>>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>>>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>>>>>>> ioctl essentially just registers the translation within the vfio
>>>>>>>>> container.  The mdev vendor driver can then later either request pages
>>>>>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>>>>>>> simulate DMA via the CPU.
>>>>>>>>>
>>>>>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>>>>>>> IOMMU backend that could drop vaddr use.          
>>>>>>>>
>>>>>>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>>>>>>> same conclusion.
>>>>>>>>       
>>>>>>>>> I'm a bit concerned this new
>>>>>>>>> remap proposal also raises the question of how do we prevent userspace
>>>>>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>>>>>>> vaddrs.          
>>>>>>>>
>>>>>>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>>>>>>> remap might be sufficient, but it needs more study.      
>>>>>>>
>>>>>>> Unless you're suggesting an extended hold of the lock across the entire
>>>>>>> re-exec of QEMU, that's only going to prevent a race between a remap
>>>>>>> and a vendor driver pin or access, the time between the previous vaddr
>>>>>>> becoming invalid and the remap is unprotected.    
>>>>>
>>>>> OK.  What if we exclude mediated devices?  Its appears they are the only
>>>>> ones where the kernel may async'ly use the vaddr, via call chains ending in 
>>>>> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
>>>>>
>>>>> The other functions that use dma->vaddr are
>>>>>     vfio_dma_do_map 
>>>>>     vfio_pin_map_dma 
>>>>>     vfio_iommu_replay 
>>>>>     vfio_pin_pages_remote
>>>>> and they are all initiated via userland ioctl (if I have traced all the code 
>>>>> paths correctly).  Thus iommu->lock would protect them.
>>>>>
>>>>> We would block live update in qemu if the config includes a mediated device.
>>>>>
>>>>> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.  
>>>>
>>>> That's not a solution I'd really be in favor of.  We're eliminating an
>>>> entire class of devices because they _might_ make use of these
>>>> interfaces, but anyone can add a vfio bus driver, even exposing the
>>>> same device API, and maybe make use of some of these interfaces in that
>>>> driver.  Maybe we'd even have reason to do it in vfio-pci if we had
>>>> reason to virtualize some aspect of a device.  I think we're setting
>>>> ourselves up for a very complicated support scenario if we just
>>>> arbitrarily decide to deny drivers using certain interfaces.
>>>>   
>>>>>>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>>>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?        
>>>>>>>>
>>>>>>>> No.  We want to support any guest, and the guest is not aware that qemu
>>>>>>>> live update is occurring.
>>>>>>>>       
>>>>>>>>> The vfio migration interface we've developed does have a mechanism to
>>>>>>>>> stop a device, would we need to use this here?  If we do have a
>>>>>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>>>>>>> everything and remapping it into the new address space the latency in
>>>>>>>>> performing that operation?  Thanks,        
>>>>>>>>
>>>>>>>> Same answer - we don't require that the guest has vfio migration support.      
>>>>>>>
>>>>>>> QEMU toggling the runstate of the device via the vfio migration
>>>>>>> interface could be done transparently to the guest, but if your
>>>>>>> intention is to support any device (where none currently support the
>>>>>>> migration interface) perhaps it's a moot point.      
>>>>>
>>>>> That sounds useful when devices support.  Can you give me some function names
>>>>> or references so I can study this qemu-based "vfio migration interface".  
>>>>
>>>> The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
>>>> the QEMU support or implementation in an mdev vendor driver.
>>>> Essentially migration exposes a new region of the device which would be
>>>> implemented by the vendor driver.  A register within that region
>>>> manipulates the device state, so a device could be stopped by clearing
>>>> the 'run' bit in that register.
>>>>   
>>>>>>> It seems like this
>>>>>>> scheme only works with IOMMU backed devices where the device can
>>>>>>> continue to operate against pinned pages, anything that might need to
>>>>>>> dynamically pin pages against the process vaddr as it's running async
>>>>>>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,    
>>>>>
>>>>> Yes, true of this remap proposal.
>>>>>
>>>>> I wanted to unconditionally support all devices, which is why I think that
>>>>>
>>>>> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
>>>>> lkml discussion.  
>>>
>>> Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
>>> from mediated device drivers.
>>>
>>> Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
>>> VFIO_IOMMU_MAP_DMA.
>>>
>>> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>>   Discard vaddr on the existing dma region defined by (iova, size), but keep the
>>>   struct vfio_dma.  Subsequent translation requests are blocked.
>>>   The implementation sets a flag in struct vfio_dma.  vfio_pin_pages() and
>>>   vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
>>>   Called before exec.
>>>
>>> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>>   Remap the region (iova, size) to vaddr, and resume translation requests.
>>>   Called after exec.
>>>
>>> Unfortunately, remap as defined above has an undesirable side effect.  The mdev
>>> driver may use kernel worker threads which serve requests from multiple clients
>>> (eg i915/gvt workload_thread).  A process that fails to call MAP_DMA with REMAP,
>>> or is tardy doing so, will delay other processes who are stuck waiting in
>>> vfio_pin_pages or vfio_dma_rw.  This is unacceptable, and I mention this scheme in
>>> case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
>>> instance?), or in case you see a way to salvage it.
>>
>> Right, that's my first thought when I hear that the pin and dma_rw paths
>> are blocked as well, we cannot rely on userspace to unblock anything.
>> A malicious user may hold out just to see how long until the host
>> becomes unusable.  Userspace determines how many groups share a
>> vfio_iommu.

I want to reconsider the above solution.  The pagecache_get_page solution below has problems 
which I will elaborate on shortly.

I was confused about the granularity of vfio_iommu sharing, but now I see that each vGPU 
in the gvt example would have its own vfio_iommu, so one misbehaving process would not 
block another.  Multiple iommu groups within one container share a vfio_iommu, and a
delay in finishing the remap will block all those devices from completing pin or 
rw operations, but that only blocks the kernel thread servicing that application.  
Please say more on how that could cause the system to become unusable.

>>> Here is a more robust implementation.  It only works for dma regions backed by
>>> a file, such as memfd or shm.
>>>
>>> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>>   Find the file and offset for iova, and save the struct file pointer in
>>>   struct vfio_dma.  In vfio_pin_pages and vfio_dma_rw and their descendants,
>>>   if file* is set, then call pagecache_get_page() to get the pfn, instead of
>>>   get_user_pages.
>>>
>>> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>>   Remap the region (iova, size) to vaddr and drop the file reference.
>>>
>>> This begs the question of whether we can always use pagecache_get_page, and
>>> eliminate the dependency on vaddr.  The translation performance could be
>>> different, though.
>>>
>>> I have not implemented this yet.  Any thoughts before I do?
>>
>> That's a pretty hefty usage restriction, but what I take from it is
>> that these are mechanisms which provide a fallback lookup path that can
>> service callers in the interim during the gap of the range being
>> remapped.  The callers are always providing an IOVA and wishing to do
>> something to the memory referenced by that IOVA, we just need a
>> translation mechanism.  The IOMMU itself is also such an alternative
>> lookup, via iommu_iova_to_phys(), but of course requiring an
>> IOMMU-backed device is just another usage restriction, potentially one
>> that's not even apparent to the user.
>>
>> Is a more general solution to make sure there's always an IOVA-to-phys
>> lookup mechanism available, implementing one if not provided by the
>> IOMMU or memory backing interface?  We'd need to adapt the dma_rw
>> interface to work on either a VA or PA, and pinning pages on
>> UNMAP+REMAP, plus stashing them in a translation structure, plus
>> dynamically adapting to changes (ex. the IOMMU backed device being
>> removed, leaving a non-IOMMU backed device in the vfio_iommu) all
>> sounds pretty complicated, especially as the vfio-iommu-type1 backend
>> becomes stretched to be more and more fragile.  Possibly it's still
>> feasible though.  Thanks,
> 
> Requiring file backed memory is not a restriction in practice, because there is
> no way to preserve MAP_ANON memory across exec and map it into the new qemu process.
> That is what MADV_DOEXEC would have provided.  Without it, one cannot do live 
> update with MAP_ANON memory.
> 
> For qemu, when allocating anonymous memory for guest memory regions, we would modify the 
> allocation  functions to call  memfd_create + mmap(fd) instead of mmap(MAP_ANON).  The
> implementation of memfd_create creates a /dev/shm file and unlinks it. Thus the memory is 
> backed by a file, and the VFIO UNMAP/REMAP proposal works.

I prototyped this using pagecache_get_page() to fault in and translate any page in the segment,
and it works. However, faulting at this level misses many checks and side effects that are
normally applied at the handle_mm_fault() level and below.  Just as one example, shmem_fault 
has accounting and limits for its swap-backed ramfs.  There is no good way to refactor the 
higher level functions to apply all side effects and only use a segment offset rather than VA.
Those functions use VA, vma, and pagetable throughout.

To salvage this approach, we could pre-fault and pin the entire dma range on the unmap-remap
call, and unpin in map-remap. If we do so,  pagecache_get_page() will simply return the 
translation. However, the pinning is very expensive and could cause a noticeable 
pause for guest operations if it waits on any pin or rw operations during the pre-fault phase.
We could pin unconditionally in the initial dma_map, before the guest is started, but
that defeats the purpose of the vfio_pin_pages interface.

- Steve
diff mbox series

Patch

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index bb28df1..7ce555a 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -390,6 +390,11 @@  void qemu_anon_ram_free(void *ptr, size_t size);
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
 #endif
+#ifdef MADV_DOEXEC
+#define QEMU_MADV_DOEXEC MADV_DOEXEC
+#else
+#define QEMU_MADV_DOEXEC QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -403,6 +408,7 @@  void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_DOEXEC  QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -416,6 +422,7 @@  void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_DOEXEC  QEMU_MADV_INVALID
 
 #endif