mbox series

[0/4] Fix splats releated to using the iommu_group after destroying devices

Message ID 0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com (mailing list archive)
Headers show
Series Fix splats releated to using the iommu_group after destroying devices | expand

Message

Jason Gunthorpe Sept. 8, 2022, 6:44 p.m. UTC
The basic issue is that the iommu_group is being used by VFIO after all
the device drivers have been removed.

In part this is caused by bad logic inside the iommu core that doesn't
sequence removing the device from the group properly, and in another part
this is bad logic in VFIO continuing to use device->iommu_group after all
VFIO device drivers have been removed.

Fix both situations. Either fix alone should fix the bug reported, but
both together bring a nice robust design to this area.

This is a followup from this thread:

https://lore.kernel.org/kvm/20220831201236.77595-1-mjrosato@linux.ibm.com/

Matthew confirmed an earlier version of the series solved the issue, it
would be best if he would test this as well to confirm the various changes
are still OK.

The iommu patch is independent of the other patches, it can go through the
iommu rc tree.

Jason Gunthorpe (4):
  vfio: Simplify vfio_create_group()
  vfio: Move the sanity check of the group to vfio_create_group()
  vfio: Follow a strict lifetime for struct iommu_group *
  iommu: Fix ordering of iommu_release_device()

 drivers/iommu/iommu.c    |  36 ++++++--
 drivers/vfio/vfio_main.c | 172 +++++++++++++++++++++------------------
 2 files changed, 120 insertions(+), 88 deletions(-)


base-commit: 245898eb9275ce31942cff95d0bdc7412ad3d589

Comments

Matthew Rosato Sept. 9, 2022, 12:49 p.m. UTC | #1
On 9/8/22 2:44 PM, Jason Gunthorpe wrote:
> The basic issue is that the iommu_group is being used by VFIO after all
> the device drivers have been removed.
> 
> In part this is caused by bad logic inside the iommu core that doesn't
> sequence removing the device from the group properly, and in another part
> this is bad logic in VFIO continuing to use device->iommu_group after all
> VFIO device drivers have been removed.
> 
> Fix both situations. Either fix alone should fix the bug reported, but
> both together bring a nice robust design to this area.
> 
> This is a followup from this thread:
> 
> https://lore.kernel.org/kvm/20220831201236.77595-1-mjrosato@linux.ibm.com/
> 
> Matthew confirmed an earlier version of the series solved the issue, it
> would be best if he would test this as well to confirm the various changes
> are still OK.

FYI I've been running this series (+ the incremental to patch 4 you mentioned) against my original repro scenario in a loop overnight, looks good.

> 
> The iommu patch is independent of the other patches, it can go through the
> iommu rc tree.
> 
> Jason Gunthorpe (4):
>   vfio: Simplify vfio_create_group()
>   vfio: Move the sanity check of the group to vfio_create_group()
>   vfio: Follow a strict lifetime for struct iommu_group *
>   iommu: Fix ordering of iommu_release_device()
> 
>  drivers/iommu/iommu.c    |  36 ++++++--
>  drivers/vfio/vfio_main.c | 172 +++++++++++++++++++++------------------
>  2 files changed, 120 insertions(+), 88 deletions(-)
> 
> 
> base-commit: 245898eb9275ce31942cff95d0bdc7412ad3d589
Jason Gunthorpe Sept. 9, 2022, 4:24 p.m. UTC | #2
On Fri, Sep 09, 2022 at 08:49:40AM -0400, Matthew Rosato wrote:
> On 9/8/22 2:44 PM, Jason Gunthorpe wrote:
> > The basic issue is that the iommu_group is being used by VFIO after all
> > the device drivers have been removed.
> > 
> > In part this is caused by bad logic inside the iommu core that doesn't
> > sequence removing the device from the group properly, and in another part
> > this is bad logic in VFIO continuing to use device->iommu_group after all
> > VFIO device drivers have been removed.
> > 
> > Fix both situations. Either fix alone should fix the bug reported, but
> > both together bring a nice robust design to this area.
> > 
> > This is a followup from this thread:
> > 
> > https://lore.kernel.org/kvm/20220831201236.77595-1-mjrosato@linux.ibm.com/
> > 
> > Matthew confirmed an earlier version of the series solved the issue, it
> > would be best if he would test this as well to confirm the various changes
> > are still OK.
> 
> FYI I've been running this series (+ the incremental to patch 4 you
> mentioned) against my original repro scenario in a loop overnight,
> looks good.

Thanks Matthew, looks like we need some more time on the last patch
but I think the VFIO ones are OK if Alex wants to pick them before LPC
is over.

Jason