[1/4] kvm: Destroy & free KVM devices on release

Message ID	20131029161322.22578.18997.stgit@bling.home (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Subject: [PATCH 1/4] kvm: Destroy & free KVM devices on release To: kvm@vger.kernel.org, gleb@redhat.com From: Alex Williamson <alex.williamson@redhat.com> Cc: aik@ozlabs.ru, pbonzini@redhat.com, linux-kernel@vger.kernel.org Date: Tue, 29 Oct 2013 10:13:22 -0600 Message-ID: <20131029161322.22578.18997.stgit@bling.home> In-Reply-To: <20131029160019.22578.16409.stgit@bling.home> References: <20131029160019.22578.16409.stgit@bling.home> User-Agent: StGit/0.16 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: kvm-owner@vger.kernel.org Precedence: bulk

Alex Williamson Oct. 29, 2013, 4:13 p.m. UTC

The KVM device interface allocates a struct kvm_device and calls
kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
This returns a file descriptor to the user for them to set/get/check
further attributes.  On closing the file descriptor, one would assume
that kvm_device_ops.destroy is called and all traces of the device
would go away.  One would be wrong, it actually does nothing more
than release the struct kvm reference, waiting until the VM is
destroyed before doing more.  This leaves devices that only want a
single instance of themselves per VM in a tough spot.

To fix this, do full cleanup on release of the device file descriptor.
It's also non-symmetric that one of the existing devices frees the
struct kvm_device from it's .destroy function, while the other
doesn't.  KVM-core allocates the structure and should therefore be
responsible for freeing it.  Finally, add a missing kfree for the
device creation error path.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 arch/powerpc/kvm/book3s_xics.c |    1 -
 virt/kvm/kvm_main.c            |    5 +++++
 2 files changed, 5 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Oct. 30, 2013, 10:40 a.m. UTC | #1

On Tue, Oct 29, 2013 at 10:13:22AM -0600, Alex Williamson wrote:
> The KVM device interface allocates a struct kvm_device and calls
> kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
> This returns a file descriptor to the user for them to set/get/check
> further attributes.  On closing the file descriptor, one would assume
> that kvm_device_ops.destroy is called and all traces of the device
> would go away.  One would be wrong, it actually does nothing more
> than release the struct kvm reference, waiting until the VM is
> destroyed before doing more.  This leaves devices that only want a
> single instance of themselves per VM in a tough spot.
> 
This is by design. Otherwise locking will be needed on each device access
and for interrupt controllers this is unnecessary serialization and
overhead. Device API is not designed for devices that can go away while
machine is running anyway, so after creation device is only destroyed
during VM destruction.

> To fix this, do full cleanup on release of the device file descriptor.
> It's also non-symmetric that one of the existing devices frees the
> struct kvm_device from it's .destroy function, while the other
> doesn't.  KVM-core allocates the structure and should therefore be
> responsible for freeing it.  Finally, add a missing kfree for the
> device creation error path.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  arch/powerpc/kvm/book3s_xics.c |    1 -
>  virt/kvm/kvm_main.c            |    5 +++++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
> index a3a5cb8..9a82426 100644
> --- a/arch/powerpc/kvm/book3s_xics.c
> +++ b/arch/powerpc/kvm/book3s_xics.c
> @@ -1220,7 +1220,6 @@ static void kvmppc_xics_free(struct kvm_device *dev)
>  	for (i = 0; i <= xics->max_icsid; i++)
>  		kfree(xics->ics[i]);
>  	kfree(xics);
> -	kfree(dev);
>  }
>  
>  static int kvmppc_xics_create(struct kvm_device *dev, u32 type)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a9dd682..fec8320 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -572,6 +572,7 @@ static void kvm_destroy_devices(struct kvm *kvm)
>  
>  		list_del(node);
>  		dev->ops->destroy(dev);
> +		kfree(dev);
>  	}
>  }
>  
> @@ -2231,6 +2232,9 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
>  	struct kvm_device *dev = filp->private_data;
>  	struct kvm *kvm = dev->kvm;
>  
> +	list_del(&dev->vm_node);
> +	dev->ops->destroy(dev);
> +	kfree(dev);
>  	kvm_put_kvm(kvm);
>  	return 0;
>  }
> @@ -2294,6 +2298,7 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
>  	ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR | O_CLOEXEC);
>  	if (ret < 0) {
>  		ops->destroy(dev);
> +		kfree(dev);
>  		return ret;
>  	}
>  

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Williamson Oct. 30, 2013, 2:30 p.m. UTC | #2

On Wed, 2013-10-30 at 12:40 +0200, Gleb Natapov wrote:
> On Tue, Oct 29, 2013 at 10:13:22AM -0600, Alex Williamson wrote:
> > The KVM device interface allocates a struct kvm_device and calls
> > kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
> > This returns a file descriptor to the user for them to set/get/check
> > further attributes.  On closing the file descriptor, one would assume
> > that kvm_device_ops.destroy is called and all traces of the device
> > would go away.  One would be wrong, it actually does nothing more
> > than release the struct kvm reference, waiting until the VM is
> > destroyed before doing more.  This leaves devices that only want a
> > single instance of themselves per VM in a tough spot.
> > 
> This is by design. Otherwise locking will be needed on each device access
> and for interrupt controllers this is unnecessary serialization and
> overhead. Device API is not designed for devices that can go away while
> machine is running anyway, so after creation device is only destroyed
> during VM destruction.

Hmm, ok.  In that case I can drop this patch and I think the rest just
boils down to userspace use of the device.  I had been close()'ing the
kvm device fd when all QEMU vfio devices are detached, but I can just as
easily leave it open in case a new device is added later.  I'll send out
a new series after doing some more review and testing.  Do you have any
comments on the rest of the series?  Thanks,

Alex

> > To fix this, do full cleanup on release of the device file descriptor.
> > It's also non-symmetric that one of the existing devices frees the
> > struct kvm_device from it's .destroy function, while the other
> > doesn't.  KVM-core allocates the structure and should therefore be
> > responsible for freeing it.  Finally, add a missing kfree for the
> > device creation error path.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  arch/powerpc/kvm/book3s_xics.c |    1 -
> >  virt/kvm/kvm_main.c            |    5 +++++
> >  2 files changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
> > index a3a5cb8..9a82426 100644
> > --- a/arch/powerpc/kvm/book3s_xics.c
> > +++ b/arch/powerpc/kvm/book3s_xics.c
> > @@ -1220,7 +1220,6 @@ static void kvmppc_xics_free(struct kvm_device *dev)
> >  	for (i = 0; i <= xics->max_icsid; i++)
> >  		kfree(xics->ics[i]);
> >  	kfree(xics);
> > -	kfree(dev);
> >  }
> >  
> >  static int kvmppc_xics_create(struct kvm_device *dev, u32 type)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index a9dd682..fec8320 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -572,6 +572,7 @@ static void kvm_destroy_devices(struct kvm *kvm)
> >  
> >  		list_del(node);
> >  		dev->ops->destroy(dev);
> > +		kfree(dev);
> >  	}
> >  }
> >  
> > @@ -2231,6 +2232,9 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
> >  	struct kvm_device *dev = filp->private_data;
> >  	struct kvm *kvm = dev->kvm;
> >  
> > +	list_del(&dev->vm_node);
> > +	dev->ops->destroy(dev);
> > +	kfree(dev);
> >  	kvm_put_kvm(kvm);
> >  	return 0;
> >  }
> > @@ -2294,6 +2298,7 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
> >  	ret = anon_inode_getfd(ops->name, &kvm_device_fops, dev, O_RDWR | O_CLOEXEC);
> >  	if (ret < 0) {
> >  		ops->destroy(dev);
> > +		kfree(dev);
> >  		return ret;
> >  	}
> >  
> 
> --
> 			Gleb.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Oct. 30, 2013, 3:33 p.m. UTC | #3

On Wed, Oct 30, 2013 at 08:30:22AM -0600, Alex Williamson wrote:
> On Wed, 2013-10-30 at 12:40 +0200, Gleb Natapov wrote:
> > On Tue, Oct 29, 2013 at 10:13:22AM -0600, Alex Williamson wrote:
> > > The KVM device interface allocates a struct kvm_device and calls
> > > kvm_device_ops.create on it from KVM VM ioctl KVM_CREATE_DEVICE.
> > > This returns a file descriptor to the user for them to set/get/check
> > > further attributes.  On closing the file descriptor, one would assume
> > > that kvm_device_ops.destroy is called and all traces of the device
> > > would go away.  One would be wrong, it actually does nothing more
> > > than release the struct kvm reference, waiting until the VM is
> > > destroyed before doing more.  This leaves devices that only want a
> > > single instance of themselves per VM in a tough spot.
> > > 
> > This is by design. Otherwise locking will be needed on each device access
> > and for interrupt controllers this is unnecessary serialization and
> > overhead. Device API is not designed for devices that can go away while
> > machine is running anyway, so after creation device is only destroyed
> > during VM destruction.
> 
> Hmm, ok.  In that case I can drop this patch and I think the rest just
> boils down to userspace use of the device.  I had been close()'ing the
> kvm device fd when all QEMU vfio devices are detached, but I can just as
> easily leave it open in case a new device is added later.  I'll send out
> a new series after doing some more review and testing.  Do you have any
> comments on the rest of the series?  Thanks,
> 
If I understand 4/4 correctly if there is VFIO device connected we
assume non coherent domain. How hard it would be to do proper checking
in this path series?

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini Oct. 30, 2013, 3:42 p.m. UTC | #4

Il 30/10/2013 16:33, Gleb Natapov ha scritto:
> > Hmm, ok.  In that case I can drop this patch and I think the rest just
> > boils down to userspace use of the device.  I had been close()'ing the
> > kvm device fd when all QEMU vfio devices are detached, but I can just as
> > easily leave it open in case a new device is added later.  I'll send out
> > a new series after doing some more review and testing.  Do you have any
> > comments on the rest of the series?  Thanks,
>
> If I understand 4/4 correctly if there is VFIO device connected we
> assume non coherent domain. How hard it would be to do proper checking
> in this path series?

Yes, that's my understanding as well.  Is the performance impact measurable?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Oct. 30, 2013, 3:44 p.m. UTC | #5

On Wed, Oct 30, 2013 at 04:42:15PM +0100, Paolo Bonzini wrote:
> Il 30/10/2013 16:33, Gleb Natapov ha scritto:
> > > Hmm, ok.  In that case I can drop this patch and I think the rest just
> > > boils down to userspace use of the device.  I had been close()'ing the
> > > kvm device fd when all QEMU vfio devices are detached, but I can just as
> > > easily leave it open in case a new device is added later.  I'll send out
> > > a new series after doing some more review and testing.  Do you have any
> > > comments on the rest of the series?  Thanks,
> >
> > If I understand 4/4 correctly if there is VFIO device connected we
> > assume non coherent domain. How hard it would be to do proper checking
> > in this path series?
> 
> Yes, that's my understanding as well.  Is the performance impact measurable?
> 
We grant a guest permission to call wbinvd() needlessly.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Williamson Oct. 30, 2013, 3:56 p.m. UTC | #6

On Wed, 2013-10-30 at 16:42 +0100, Paolo Bonzini wrote:
> Il 30/10/2013 16:33, Gleb Natapov ha scritto:
> > > Hmm, ok.  In that case I can drop this patch and I think the rest just
> > > boils down to userspace use of the device.  I had been close()'ing the
> > > kvm device fd when all QEMU vfio devices are detached, but I can just as
> > > easily leave it open in case a new device is added later.  I'll send out
> > > a new series after doing some more review and testing.  Do you have any
> > > comments on the rest of the series?  Thanks,
> >
> > If I understand 4/4 correctly if there is VFIO device connected we
> > assume non coherent domain. How hard it would be to do proper checking
> > in this path series?
> 
> Yes, that's my understanding as well.  Is the performance impact measurable?

I have additional patches to do this, the blocker is that intel-iommu
strips IOMMU_CACHE from the flags I provide if the IOMMU domain as a
whole (ie. all of the hardware units involved in the domain) do not all
support Snoop Control (SC).  Thus I can't rely on simply tracking the
hardware capabilities of the domain because some IOMMU PTEs will have
SNP set, others will not.  It depends on the state of the IOMMU domain
at the time of the mapping.  Therefore the only way to switch back to
coherent from non-coherent is to unmap and remap everything.  This is
what legacy KVM device assignment does, but I can't see how that's not
racy wrt inflight DMA.

The patch approach I was taking is:

1) Enable KVM to handle the VM as non-coherent (which is trued, VFIO
never sets IOMMU_CACHE currently due to the above issues).

2) Get QEMU to enable the KVM device, fixing the coherency issue.

3) Fix Intel IOMMU to honor IOMMU_CACHE regardless of hw capabilities
(patch posted).

4) Make VFIO always map w/ IOMMU_CACHE

5) Create VFIO external user interface to query capabilities.

6) Update KVM device to use it.

As to performance, for graphics I can't tell a difference whether
NoSnoop is prevented or KVM does WBINVD.  I'm hoping though that we can
consider the mode enabled by this patch as a temporary step in the
process and we'll eventually get to a state where we only emulate WBINVD
when necessary.  Correctness trumps performance in the current round.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini Oct. 30, 2013, 4:10 p.m. UTC | #7

Il 30/10/2013 16:56, Alex Williamson ha scritto:
> On Wed, 2013-10-30 at 16:42 +0100, Paolo Bonzini wrote:
>> Il 30/10/2013 16:33, Gleb Natapov ha scritto:
>>>> Hmm, ok.  In that case I can drop this patch and I think the rest just
>>>> boils down to userspace use of the device.  I had been close()'ing the
>>>> kvm device fd when all QEMU vfio devices are detached, but I can just as
>>>> easily leave it open in case a new device is added later.  I'll send out
>>>> a new series after doing some more review and testing.  Do you have any
>>>> comments on the rest of the series?  Thanks,
>>>
>>> If I understand 4/4 correctly if there is VFIO device connected we
>>> assume non coherent domain. How hard it would be to do proper checking
>>> in this path series?
>>
>> Yes, that's my understanding as well.  Is the performance impact measurable?
> 
> I have additional patches to do this, the blocker is that intel-iommu
> strips IOMMU_CACHE from the flags I provide if the IOMMU domain as a
> whole (ie. all of the hardware units involved in the domain) do not all
> support Snoop Control (SC).  Thus I can't rely on simply tracking the
> hardware capabilities of the domain because some IOMMU PTEs will have
> SNP set, others will not.  It depends on the state of the IOMMU domain
> at the time of the mapping.  Therefore the only way to switch back to
> coherent from non-coherent is to unmap and remap everything.  This is
> what legacy KVM device assignment does, but I can't see how that's not
> racy wrt inflight DMA.
> 
> The patch approach I was taking is:
> 
> 1) Enable KVM to handle the VM as non-coherent (which is trued, VFIO
> never sets IOMMU_CACHE currently due to the above issues).
> 
> 2) Get QEMU to enable the KVM device, fixing the coherency issue.
> 
> 3) Fix Intel IOMMU to honor IOMMU_CACHE regardless of hw capabilities
> (patch posted).
> 
> 4) Make VFIO always map w/ IOMMU_CACHE
> 
> 5) Create VFIO external user interface to query capabilities.
> 
> 6) Update KVM device to use it.
> 
> As to performance, for graphics I can't tell a difference whether
> NoSnoop is prevented or KVM does WBINVD.  I'm hoping though that we can
> consider the mode enabled by this patch as a temporary step in the
> process and we'll eventually get to a state where we only emulate WBINVD
> when necessary.  Correctness trumps performance in the current round.
> Thanks,

Thanks for the explanation.  Gleb, this looks fine to me.  WDYT?

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Oct. 30, 2013, 4:18 p.m. UTC | #8

On Wed, Oct 30, 2013 at 05:10:37PM +0100, Paolo Bonzini wrote:
> Il 30/10/2013 16:56, Alex Williamson ha scritto:
> > On Wed, 2013-10-30 at 16:42 +0100, Paolo Bonzini wrote:
> >> Il 30/10/2013 16:33, Gleb Natapov ha scritto:
> >>>> Hmm, ok.  In that case I can drop this patch and I think the rest just
> >>>> boils down to userspace use of the device.  I had been close()'ing the
> >>>> kvm device fd when all QEMU vfio devices are detached, but I can just as
> >>>> easily leave it open in case a new device is added later.  I'll send out
> >>>> a new series after doing some more review and testing.  Do you have any
> >>>> comments on the rest of the series?  Thanks,
> >>>
> >>> If I understand 4/4 correctly if there is VFIO device connected we
> >>> assume non coherent domain. How hard it would be to do proper checking
> >>> in this path series?
> >>
> >> Yes, that's my understanding as well.  Is the performance impact measurable?
> > 
> > I have additional patches to do this, the blocker is that intel-iommu
> > strips IOMMU_CACHE from the flags I provide if the IOMMU domain as a
> > whole (ie. all of the hardware units involved in the domain) do not all
> > support Snoop Control (SC).  Thus I can't rely on simply tracking the
> > hardware capabilities of the domain because some IOMMU PTEs will have
> > SNP set, others will not.  It depends on the state of the IOMMU domain
> > at the time of the mapping.  Therefore the only way to switch back to
> > coherent from non-coherent is to unmap and remap everything.  This is
> > what legacy KVM device assignment does, but I can't see how that's not
> > racy wrt inflight DMA.
> > 
> > The patch approach I was taking is:
> > 
> > 1) Enable KVM to handle the VM as non-coherent (which is trued, VFIO
> > never sets IOMMU_CACHE currently due to the above issues).
> > 
> > 2) Get QEMU to enable the KVM device, fixing the coherency issue.
> > 
> > 3) Fix Intel IOMMU to honor IOMMU_CACHE regardless of hw capabilities
> > (patch posted).
> > 
> > 4) Make VFIO always map w/ IOMMU_CACHE
> > 
> > 5) Create VFIO external user interface to query capabilities.
> > 
> > 6) Update KVM device to use it.
> > 
> > As to performance, for graphics I can't tell a difference whether
> > NoSnoop is prevented or KVM does WBINVD.  I'm hoping though that we can
> > consider the mode enabled by this patch as a temporary step in the
> > process and we'll eventually get to a state where we only emulate WBINVD
> > when necessary.  Correctness trumps performance in the current round.
> > Thanks,
> 
> Thanks for the explanation.  Gleb, this looks fine to me.  WDYT?
> 
To me too. Alex please resend with first patch dropped
(kvm_vfio_destroy() will have to free dev).


--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/4] kvm: Destroy & free KVM devices on release

Commit Message

Comments

Patch