diff mbox

[qemu,v16,01/19] vfio: Delay DMA address space listener release

Message ID 1462344751-28281-2-git-send-email-aik@ozlabs.ru (mailing list archive)
State New, archived
Headers show

Commit Message

Alexey Kardashevskiy May 4, 2016, 6:52 a.m. UTC
This postpones VFIO container deinitialization to let region_del()
callbacks (called via vfio_listener_release) do proper clean up
while the group is still attached to the container.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

Comments

Alex Williamson May 5, 2016, 10:39 p.m. UTC | #1
On Wed,  4 May 2016 16:52:13 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This postpones VFIO container deinitialization to let region_del()
> callbacks (called via vfio_listener_release) do proper clean up
> while the group is still attached to the container.

Any mappings within the container should clean themselves up when the
container is deprivleged by removing the last group in the kernel.  Is
the issue that that doesn't happen, which would be a spapr vfio kernel
bug, or that our QEMU side structures get all out of whack if we let
that happen?

> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/common.c | 22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fe5ec6a..0b40262 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  {
>      VFIOContainer *container = group->container;
>  
> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> -        error_report("vfio: error disconnecting group %d from container",
> -                     group->groupid);
> -    }
> -
>      QLIST_REMOVE(group, container_next);
> +
> +    if (QLIST_EMPTY(&container->group_list)) {
> +        VFIOGuestIOMMU *giommu;
> +
> +        vfio_listener_release(container);
> +
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            memory_region_unregister_iommu_notifier(&giommu->n);
> +        }
> +    }
> +
>      group->container = NULL;
> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> +        error_report("vfio: error disconnecting group %d from container",
> +                     group->groupid);
> +    }
>  
>      if (QLIST_EMPTY(&container->group_list)) {
>          VFIOAddressSpace *space = container->space;
>          VFIOGuestIOMMU *giommu, *tmp;
>  
> -        vfio_listener_release(container);
>          QLIST_REMOVE(container, next);
>  
>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> -            memory_region_unregister_iommu_notifier(&giommu->n);
>              QLIST_REMOVE(giommu, giommu_next);
>              g_free(giommu);
>          }

I'm not spotting why this is a 2-pass process vs simply moving the
existing QLIST_EMPTY cleanup above the ioctl.  Thanks,

Alex
Alexey Kardashevskiy May 13, 2016, 7:16 a.m. UTC | #2
On 05/06/2016 08:39 AM, Alex Williamson wrote:
> On Wed,  4 May 2016 16:52:13 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> This postpones VFIO container deinitialization to let region_del()
>> callbacks (called via vfio_listener_release) do proper clean up
>> while the group is still attached to the container.
>
> Any mappings within the container should clean themselves up when the
> container is deprivleged by removing the last group in the kernel. Is
> the issue that that doesn't happen, which would be a spapr vfio kernel
> bug, or that our QEMU side structures get all out of whack if we let
> that happen?

My mailbase got corrupted, missed that.

This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
Notify IOMMU about starting/stopping being used by VFIO", I should have put 
01/19 and 02/19 right before 17/19, sorry about that.


Every reboot the spapr machine removes all (i.e. one or two) windows and 
creates the default one.

I do this by memory_region_del_subregion(iommu_mr) + 
memory_region_add_subregion(iommu_mr). Which gets translated to 
VFIO_IOMMU_SPAPR_TCE_REMOVE + VFIO_IOMMU_SPAPR_TCE_CREATE via 
vfio_memory_listener if there is VFIO; no direct calls from spapr to vfio 
=> cool. During the machine reset, the VFIO device is there with the 
container and groups attached, at some point with no windows.

Now to VFIO plug/unplug.

When VFIO plug happens, vfio_memory_listener is created, region_add() is 
called, the hardware window is created (via VFIO_IOMMU_SPAPR_TCE_CREATE).
Unplugging should end up doing VFIO_IOMMU_SPAPR_TCE_REMOVE somehow. If 
region_del() is not called when the container is being destroyed (as before 
this patchset), then the kernel cleans and destroys windows when 
close(container->fd) is called or when qemu is killed (and this fd is 
naturally closed), I hope this answers the comment from 02/19.

So far so good (right?)

However I also have a guest view of the TCE table, this is what the guest 
sees and this is what emulated PCI devices use. This guest view is either 
allocated in the KVM (so H_PUT_TCE can be handled quickly right in the host 
kernel, even in real mode) or userspace (VFIO case).

I generally want the guest view to be in the KVM. However when I plug VFIO, 
I have to move the table to the userspace. When I unplug VFIO, I want to do 
the opposite so I need a way to tell spapr that it can move the table. 
region_del() seemed a natural way of doing this as region_add() is already 
doing the opposite part.

With this patchset, each IOMMU MR gets a usage counter, region_add() does 
+1, region_del() does -1 (yeah, not extremely optimal during reset). When 
the counter goes from 0 to 1, vfio_start() hook is called, when the counter 
becomes 0 - vfio_stop(). Note that we may have multiple VFIO containers on 
the same PHB.

Without 01/19 and 02/19, I'll have to repeat region_del()'s counter 
decrement steps in vfio_disconnect_container(). And I still cannot move 
counting from region_add() to vfio_connect_container() so there will be 
asymmetry which I am fine with, I am just checking here - what would be the 
best approach here?


Thanks.



>
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/vfio/common.c | 22 +++++++++++++++-------
>>  1 file changed, 15 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index fe5ec6a..0b40262 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>  {
>>      VFIOContainer *container = group->container;
>>
>> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
>> -        error_report("vfio: error disconnecting group %d from container",
>> -                     group->groupid);
>> -    }
>> -
>>      QLIST_REMOVE(group, container_next);
>> +
>> +    if (QLIST_EMPTY(&container->group_list)) {
>> +        VFIOGuestIOMMU *giommu;
>> +
>> +        vfio_listener_release(container);
>> +
>> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>> +            memory_region_unregister_iommu_notifier(&giommu->n);
>> +        }
>> +    }
>> +
>>      group->container = NULL;
>> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
>> +        error_report("vfio: error disconnecting group %d from container",
>> +                     group->groupid);
>> +    }
>>
>>      if (QLIST_EMPTY(&container->group_list)) {
>>          VFIOAddressSpace *space = container->space;
>>          VFIOGuestIOMMU *giommu, *tmp;
>>
>> -        vfio_listener_release(container);
>>          QLIST_REMOVE(container, next);
>>
>>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
>> -            memory_region_unregister_iommu_notifier(&giommu->n);
>>              QLIST_REMOVE(giommu, giommu_next);
>>              g_free(giommu);
>>          }
>
> I'm not spotting why this is a 2-pass process vs simply moving the
> existing QLIST_EMPTY cleanup above the ioctl.  Thanks,
Alex Williamson May 13, 2016, 10:24 p.m. UTC | #3
On Fri, 13 May 2016 17:16:48 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 05/06/2016 08:39 AM, Alex Williamson wrote:
> > On Wed,  4 May 2016 16:52:13 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >  
> >> This postpones VFIO container deinitialization to let region_del()
> >> callbacks (called via vfio_listener_release) do proper clean up
> >> while the group is still attached to the container.  
> >
> > Any mappings within the container should clean themselves up when the
> > container is deprivleged by removing the last group in the kernel. Is
> > the issue that that doesn't happen, which would be a spapr vfio kernel
> > bug, or that our QEMU side structures get all out of whack if we let
> > that happen?  
> 
> My mailbase got corrupted, missed that.
> 
> This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> 01/19 and 02/19 right before 17/19, sorry about that.

Which I object to, it's just ridiculous to have vfio start/stop
callbacks in a set of generic iommu region ops.

> Every reboot the spapr machine removes all (i.e. one or two) windows and 
> creates the default one.
> 
> I do this by memory_region_del_subregion(iommu_mr) + 
> memory_region_add_subregion(iommu_mr). Which gets translated to 
> VFIO_IOMMU_SPAPR_TCE_REMOVE + VFIO_IOMMU_SPAPR_TCE_CREATE via 
> vfio_memory_listener if there is VFIO; no direct calls from spapr to vfio 
> => cool. During the machine reset, the VFIO device is there with the   
> container and groups attached, at some point with no windows.
> 
> Now to VFIO plug/unplug.
> 
> When VFIO plug happens, vfio_memory_listener is created, region_add() is 
> called, the hardware window is created (via VFIO_IOMMU_SPAPR_TCE_CREATE).
> Unplugging should end up doing VFIO_IOMMU_SPAPR_TCE_REMOVE somehow. If 
> region_del() is not called when the container is being destroyed (as before 
> this patchset), then the kernel cleans and destroys windows when 
> close(container->fd) is called or when qemu is killed (and this fd is 
> naturally closed), I hope this answers the comment from 02/19.
> 
> So far so good (right?)
> 
> However I also have a guest view of the TCE table, this is what the guest 
> sees and this is what emulated PCI devices use. This guest view is either 
> allocated in the KVM (so H_PUT_TCE can be handled quickly right in the host 
> kernel, even in real mode) or userspace (VFIO case).
> 
> I generally want the guest view to be in the KVM. However when I plug VFIO, 
> I have to move the table to the userspace. When I unplug VFIO, I want to do 
> the opposite so I need a way to tell spapr that it can move the table. 
> region_del() seemed a natural way of doing this as region_add() is already 
> doing the opposite part.
> 
> With this patchset, each IOMMU MR gets a usage counter, region_add() does 
> +1, region_del() does -1 (yeah, not extremely optimal during reset). When 
> the counter goes from 0 to 1, vfio_start() hook is called, when the counter 
> becomes 0 - vfio_stop(). Note that we may have multiple VFIO containers on 
> the same PHB.
> 
> Without 01/19 and 02/19, I'll have to repeat region_del()'s counter 
> decrement steps in vfio_disconnect_container(). And I still cannot move 
> counting from region_add() to vfio_connect_container() so there will be 
> asymmetry which I am fine with, I am just checking here - what would be the 
> best approach here?


You're imposing on other iommu models (type1) that in order to release
a container we first deregister the listener, which un-plays all of
the mappings within that region.  That's inefficient when we can simply
unset the container and move on.  So you're imposing an inefficiency on
a separate vfio iommu model for the book keeping of your own.  I don't
think that's a reasonable approach.  Has it even been testing how that
affects type1 users?  When a container is closed, clearly it shouldn't
be contributing to reference counts, so it seems like there must be
other ways to handle this.

> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  hw/vfio/common.c | 22 +++++++++++++++-------
> >>  1 file changed, 15 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index fe5ec6a..0b40262 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
> >>  {
> >>      VFIOContainer *container = group->container;
> >>
> >> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> >> -        error_report("vfio: error disconnecting group %d from container",
> >> -                     group->groupid);
> >> -    }
> >> -
> >>      QLIST_REMOVE(group, container_next);
> >> +
> >> +    if (QLIST_EMPTY(&container->group_list)) {
> >> +        VFIOGuestIOMMU *giommu;
> >> +
> >> +        vfio_listener_release(container);
> >> +
> >> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >> +            memory_region_unregister_iommu_notifier(&giommu->n);
> >> +        }
> >> +    }
> >> +
> >>      group->container = NULL;
> >> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> >> +        error_report("vfio: error disconnecting group %d from container",
> >> +                     group->groupid);
> >> +    }
> >>
> >>      if (QLIST_EMPTY(&container->group_list)) {
> >>          VFIOAddressSpace *space = container->space;
> >>          VFIOGuestIOMMU *giommu, *tmp;
> >>
> >> -        vfio_listener_release(container);
> >>          QLIST_REMOVE(container, next);
> >>
> >>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> >> -            memory_region_unregister_iommu_notifier(&giommu->n);
> >>              QLIST_REMOVE(giommu, giommu_next);
> >>              g_free(giommu);
> >>          }  
> >
> > I'm not spotting why this is a 2-pass process vs simply moving the
> > existing QLIST_EMPTY cleanup above the ioctl.  Thanks,  
> 
> 
> 
> 
>
David Gibson May 25, 2016, 6:34 a.m. UTC | #4
On Fri, May 13, 2016 at 04:24:53PM -0600, Alex Williamson wrote:
> On Fri, 13 May 2016 17:16:48 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On 05/06/2016 08:39 AM, Alex Williamson wrote:
> > > On Wed,  4 May 2016 16:52:13 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >  
> > >> This postpones VFIO container deinitialization to let region_del()
> > >> callbacks (called via vfio_listener_release) do proper clean up
> > >> while the group is still attached to the container.  
> > >
> > > Any mappings within the container should clean themselves up when the
> > > container is deprivleged by removing the last group in the kernel. Is
> > > the issue that that doesn't happen, which would be a spapr vfio kernel
> > > bug, or that our QEMU side structures get all out of whack if we let
> > > that happen?  
> > 
> > My mailbase got corrupted, missed that.
> > 
> > This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> > Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> > 01/19 and 02/19 right before 17/19, sorry about that.
> 
> Which I object to, it's just ridiculous to have vfio start/stop
> callbacks in a set of generic iommu region ops.

It's ugly, but I don't actually see a better way to do this (the
general concept of having vfio start/stop callbacks, that is, not the
specifics of the patches).

The fact is that how we implement the guest side IOMMU *does* need to
change depending on whether VFIO devices are present or not.  That's
due essentially to incompatibilities between a couple of kernel
mechanisms.  Which in itself is ugly, but nonetheless real.

A (usually blank) vfio on/off callback in the guest side IOMMU ops
seems like the least-bad way to handle this.

> > Every reboot the spapr machine removes all (i.e. one or two) windows and 
> > creates the default one.
> > 
> > I do this by memory_region_del_subregion(iommu_mr) + 
> > memory_region_add_subregion(iommu_mr). Which gets translated to 
> > VFIO_IOMMU_SPAPR_TCE_REMOVE + VFIO_IOMMU_SPAPR_TCE_CREATE via 
> > vfio_memory_listener if there is VFIO; no direct calls from spapr to vfio 
> > => cool. During the machine reset, the VFIO device is there with the   
> > container and groups attached, at some point with no windows.
> > 
> > Now to VFIO plug/unplug.
> > 
> > When VFIO plug happens, vfio_memory_listener is created, region_add() is 
> > called, the hardware window is created (via VFIO_IOMMU_SPAPR_TCE_CREATE).
> > Unplugging should end up doing VFIO_IOMMU_SPAPR_TCE_REMOVE somehow. If 
> > region_del() is not called when the container is being destroyed (as before 
> > this patchset), then the kernel cleans and destroys windows when 
> > close(container->fd) is called or when qemu is killed (and this fd is 
> > naturally closed), I hope this answers the comment from 02/19.
> > 
> > So far so good (right?)
> > 
> > However I also have a guest view of the TCE table, this is what the guest 
> > sees and this is what emulated PCI devices use. This guest view is either 
> > allocated in the KVM (so H_PUT_TCE can be handled quickly right in the host 
> > kernel, even in real mode) or userspace (VFIO case).
> > 
> > I generally want the guest view to be in the KVM. However when I plug VFIO, 
> > I have to move the table to the userspace. When I unplug VFIO, I want to do 
> > the opposite so I need a way to tell spapr that it can move the table. 
> > region_del() seemed a natural way of doing this as region_add() is already 
> > doing the opposite part.
> > 
> > With this patchset, each IOMMU MR gets a usage counter, region_add() does 
> > +1, region_del() does -1 (yeah, not extremely optimal during reset). When 
> > the counter goes from 0 to 1, vfio_start() hook is called, when the counter 
> > becomes 0 - vfio_stop(). Note that we may have multiple VFIO containers on 
> > the same PHB.
> > 
> > Without 01/19 and 02/19, I'll have to repeat region_del()'s counter 
> > decrement steps in vfio_disconnect_container(). And I still cannot move 
> > counting from region_add() to vfio_connect_container() so there will be 
> > asymmetry which I am fine with, I am just checking here - what would be the 
> > best approach here?
> 
> 
> You're imposing on other iommu models (type1) that in order to release
> a container we first deregister the listener, which un-plays all of
> the mappings within that region.  That's inefficient when we can simply
> unset the container and move on.  So you're imposing an inefficiency on
> a separate vfio iommu model for the book keeping of your own.  I don't
> think that's a reasonable approach.  Has it even been testing how that
> affects type1 users?  When a container is closed, clearly it shouldn't
> be contributing to reference counts, so it seems like there must be
> other ways to handle this.

My first guess is to agree, but I'll look at that more carefully when
I actually get to the patch doing that.

What I really don't understand about this one is what the
group<->container connection - an entirely host side matter - has to
do with the reference counting here, which is per *guest* side IOMMU.

> 
> > >>
> > >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > >> ---
> > >>  hw/vfio/common.c | 22 +++++++++++++++-------
> > >>  1 file changed, 15 insertions(+), 7 deletions(-)
> > >>
> > >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > >> index fe5ec6a..0b40262 100644
> > >> --- a/hw/vfio/common.c
> > >> +++ b/hw/vfio/common.c
> > >> @@ -921,23 +921,31 @@ static void vfio_disconnect_container(VFIOGroup *group)
> > >>  {
> > >>      VFIOContainer *container = group->container;
> > >>
> > >> -    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> > >> -        error_report("vfio: error disconnecting group %d from container",
> > >> -                     group->groupid);
> > >> -    }
> > >> -
> > >>      QLIST_REMOVE(group, container_next);
> > >> +
> > >> +    if (QLIST_EMPTY(&container->group_list)) {
> > >> +        VFIOGuestIOMMU *giommu;
> > >> +
> > >> +        vfio_listener_release(container);
> > >> +
> > >> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> > >> +            memory_region_unregister_iommu_notifier(&giommu->n);
> > >> +        }
> > >> +    }
> > >> +
> > >>      group->container = NULL;
> > >> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> > >> +        error_report("vfio: error disconnecting group %d from container",
> > >> +                     group->groupid);
> > >> +    }
> > >>
> > >>      if (QLIST_EMPTY(&container->group_list)) {
> > >>          VFIOAddressSpace *space = container->space;
> > >>          VFIOGuestIOMMU *giommu, *tmp;
> > >>
> > >> -        vfio_listener_release(container);
> > >>          QLIST_REMOVE(container, next);
> > >>
> > >>          QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
> > >> -            memory_region_unregister_iommu_notifier(&giommu->n);
> > >>              QLIST_REMOVE(giommu, giommu_next);
> > >>              g_free(giommu);
> > >>          }  
> > >
> > > I'm not spotting why this is a 2-pass process vs simply moving the
> > > existing QLIST_EMPTY cleanup above the ioctl.  Thanks,  
> > 
> > 
> > 
> > 
> > 
>
Alex Williamson May 25, 2016, 1:59 p.m. UTC | #5
On Wed, 25 May 2016 16:34:37 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, May 13, 2016 at 04:24:53PM -0600, Alex Williamson wrote:
> > On Fri, 13 May 2016 17:16:48 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > On 05/06/2016 08:39 AM, Alex Williamson wrote:  
> > > > On Wed,  4 May 2016 16:52:13 +1000
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >    
> > > >> This postpones VFIO container deinitialization to let region_del()
> > > >> callbacks (called via vfio_listener_release) do proper clean up
> > > >> while the group is still attached to the container.    
> > > >
> > > > Any mappings within the container should clean themselves up when the
> > > > container is deprivleged by removing the last group in the kernel. Is
> > > > the issue that that doesn't happen, which would be a spapr vfio kernel
> > > > bug, or that our QEMU side structures get all out of whack if we let
> > > > that happen?    
> > > 
> > > My mailbase got corrupted, missed that.
> > > 
> > > This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> > > Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> > > 01/19 and 02/19 right before 17/19, sorry about that.  
> > 
> > Which I object to, it's just ridiculous to have vfio start/stop
> > callbacks in a set of generic iommu region ops.  
> 
> It's ugly, but I don't actually see a better way to do this (the
> general concept of having vfio start/stop callbacks, that is, not the
> specifics of the patches).
> 
> The fact is that how we implement the guest side IOMMU *does* need to
> change depending on whether VFIO devices are present or not. 

No, how the guest side iommu is implemented needs to change depending
on whether there's someone, anyone, in QEMU that cares about the iommu,
which can be determined by whether the iommu notifier has any clients.
Alexey has posted another patch that does this.

> That's
> due essentially to incompatibilities between a couple of kernel
> mechanisms.  Which in itself is ugly, but nonetheless real.
> 
> A (usually blank) vfio on/off callback in the guest side IOMMU ops
> seems like the least-bad way to handle this.

I disagree, we already call memory_region_register_iommu_notifier() to
indicate we care about the guest iommu, so the abstraction is already
there, there's absolutely no reason to make a vfio specific interface.
Thanks,

Alex
David Gibson May 26, 2016, 1 a.m. UTC | #6
On Wed, May 25, 2016 at 07:59:26AM -0600, Alex Williamson wrote:
> On Wed, 25 May 2016 16:34:37 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, May 13, 2016 at 04:24:53PM -0600, Alex Williamson wrote:
> > > On Fri, 13 May 2016 17:16:48 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >   
> > > > On 05/06/2016 08:39 AM, Alex Williamson wrote:  
> > > > > On Wed,  4 May 2016 16:52:13 +1000
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >    
> > > > >> This postpones VFIO container deinitialization to let region_del()
> > > > >> callbacks (called via vfio_listener_release) do proper clean up
> > > > >> while the group is still attached to the container.    
> > > > >
> > > > > Any mappings within the container should clean themselves up when the
> > > > > container is deprivleged by removing the last group in the kernel. Is
> > > > > the issue that that doesn't happen, which would be a spapr vfio kernel
> > > > > bug, or that our QEMU side structures get all out of whack if we let
> > > > > that happen?    
> > > > 
> > > > My mailbase got corrupted, missed that.
> > > > 
> > > > This is mostly for "[PATCH qemu v16 17/19] spapr_iommu, vfio, memory: 
> > > > Notify IOMMU about starting/stopping being used by VFIO", I should have put 
> > > > 01/19 and 02/19 right before 17/19, sorry about that.  
> > > 
> > > Which I object to, it's just ridiculous to have vfio start/stop
> > > callbacks in a set of generic iommu region ops.  
> > 
> > It's ugly, but I don't actually see a better way to do this (the
> > general concept of having vfio start/stop callbacks, that is, not the
> > specifics of the patches).
> > 
> > The fact is that how we implement the guest side IOMMU *does* need to
> > change depending on whether VFIO devices are present or not. 
> 
> No, how the guest side iommu is implemented needs to change depending
> on whether there's someone, anyone, in QEMU that cares about the iommu,
> which can be determined by whether the iommu notifier has any clients.
> Alexey has posted another patch that does this.

*thinks*  ah, yes, you're right of course.  So instead we need some
hook that's triggered on transition of number of notifier listeners
from zero<->non-zero.

> > That's
> > due essentially to incompatibilities between a couple of kernel
> > mechanisms.  Which in itself is ugly, but nonetheless real.
> > 
> > A (usually blank) vfio on/off callback in the guest side IOMMU ops
> > seems like the least-bad way to handle this.
> 
> I disagree, we already call memory_region_register_iommu_notifier() to
> indicate we care about the guest iommu, so the abstraction is already
> there, there's absolutely no reason to make a vfio specific interface.
> Thanks,
> 
> Alex
>
diff mbox

Patch

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fe5ec6a..0b40262 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -921,23 +921,31 @@  static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOContainer *container = group->container;
 
-    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
-        error_report("vfio: error disconnecting group %d from container",
-                     group->groupid);
-    }
-
     QLIST_REMOVE(group, container_next);
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        VFIOGuestIOMMU *giommu;
+
+        vfio_listener_release(container);
+
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            memory_region_unregister_iommu_notifier(&giommu->n);
+        }
+    }
+
     group->container = NULL;
+    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+        error_report("vfio: error disconnecting group %d from container",
+                     group->groupid);
+    }
 
     if (QLIST_EMPTY(&container->group_list)) {
         VFIOAddressSpace *space = container->space;
         VFIOGuestIOMMU *giommu, *tmp;
 
-        vfio_listener_release(container);
         QLIST_REMOVE(container, next);
 
         QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
-            memory_region_unregister_iommu_notifier(&giommu->n);
             QLIST_REMOVE(giommu, giommu_next);
             g_free(giommu);
         }