diff mbox series

[RFC] kvm: reverse call order of kvm_arch_destroy_vm() and kvm_destroy_devices()

Message ID 20220705185430.499688-1-akrowiak@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series [RFC] kvm: reverse call order of kvm_arch_destroy_vm() and kvm_destroy_devices() | expand

Commit Message

Anthony Krowiak July 5, 2022, 6:54 p.m. UTC
There is a new requirement for s390 secure execution guests that the
hypervisor ensures all AP queues are reset and disassociated from the
KVM guest before the secure configuration is torn down. It is the
responsibility of the vfio_ap device driver to handle this.

Prior to commit ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"),
the driver reset all AP queues passed through to a KVM guest when notified
that the KVM pointer was being set to NULL. Subsequently, the AP queues
are only reset when the fd for the mediated device used to pass the queues
through to the guest is closed (the vfio_ap_mdev_close_device() callback).
This is not a problem when userspace is well-behaved and uses the
KVM_DEV_VFIO_GROUP_DEL attribute to remove the VFIO group; however, if
userspace for some reason does not close the mdev fd, a secure execution
guest will tear down its configuration before the AP queues are
reset because the teardown is done in the kvm_arch_destroy_vm function
which is invoked prior to vm_destroy_devices.

This patch proposes a simple solution; rather than introducing a new
notifier into vfio or callback into KVM, what aoubt reversing the order
in which the kvm_arch_destroy_vm and kvm_destroy_devices are called. In
some very limited testing (i.e., the automated regression tests for
the vfio_ap device driver) this did not seem to cause any problems.

The question remains, is there a good technical reason why the VM
is destroyed before the devices it is using? This is not intuitive, so
this is a request for comments on this proposed patch. The assumption
here is that the medev fd will get closed when the devices are destroyed.

Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Matthew Rosato July 5, 2022, 7:30 p.m. UTC | #1
On 7/5/22 2:54 PM, Tony Krowiak wrote:
> There is a new requirement for s390 secure execution guests that the
> hypervisor ensures all AP queues are reset and disassociated from the
> KVM guest before the secure configuration is torn down. It is the
> responsibility of the vfio_ap device driver to handle this.
> 
> Prior to commit ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"),
> the driver reset all AP queues passed through to a KVM guest when notified
> that the KVM pointer was being set to NULL. Subsequently, the AP queues
> are only reset when the fd for the mediated device used to pass the queues
> through to the guest is closed (the vfio_ap_mdev_close_device() callback).
> This is not a problem when userspace is well-behaved and uses the
> KVM_DEV_VFIO_GROUP_DEL attribute to remove the VFIO group; however, if
> userspace for some reason does not close the mdev fd, a secure execution
> guest will tear down its configuration before the AP queues are
> reset because the teardown is done in the kvm_arch_destroy_vm function
> which is invoked prior to vm_destroy_devices.

To clarify, even before "vfio: remove VFIO_GROUP_NOTIFY_SET_KVM" if 
userspace did not delete the group via KVM_DEV_VFIO_GROUP_DEL then the 
old callback would also not have been triggered until 
kvm_destroy_devices() anyway (the callback would have been triggered 
with a NULL kvm pointer via a call from kvm_vfio_destroy(), triggered 
from kvm_destroy_devices()).

My point being: this behavior did not start with "vfio: remove 
VFIO_GROUP_NOTIFY_SET_KVM", that patch just removed the notifier since 
both actions always took place at device open/close time anyway.  So if 
destroying the devices before the vm isn't doable, a new 
notifier/whatever that sets the KVM assocation to NULL would also have 
to happen at an earlier point in time than VFIO_GROUP_NOTIFY_SET_KVM did 
(and should maybe be something that is optional/opt-in and used only by 
vfio drivers that need it to cleanup a KVM association at a point prior 
to the device being destroyed).  There should still be no need for any 
sort of notifier to set the (non-NULL) KVM association as it's already 
associated with the vfio group before device_open.

But let's first see if anyone can shed some understanding on the 
ordering between kvm_arch_destroy_vm and kvm_destroy_devices...

> 
> This patch proposes a simple solution; rather than introducing a new
> notifier into vfio or callback into KVM, what aoubt reversing the order
> in which the kvm_arch_destroy_vm and kvm_destroy_devices are called. In
> some very limited testing (i.e., the automated regression tests for
> the vfio_ap device driver) this did not seem to cause any problems.
> 
> The question remains, is there a good technical reason why the VM
> is destroyed before the devices it is using? This is not intuitive, so
> this is a request for comments on this proposed patch. The assumption
> here is that the medev fd will get closed when the devices are destroyed.
> 
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>   virt/kvm/kvm_main.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a49df8988cd6..edaf2918be9b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1248,8 +1248,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   #else
>   	kvm_flush_shadow_all(kvm);
>   #endif
> -	kvm_arch_destroy_vm(kvm);
>   	kvm_destroy_devices(kvm);
> +	kvm_arch_destroy_vm(kvm);
>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
Anthony Krowiak July 18, 2022, 2:11 p.m. UTC | #2
PING!

On 7/5/22 2:54 PM, Tony Krowiak wrote:
> There is a new requirement for s390 secure execution guests that the
> hypervisor ensures all AP queues are reset and disassociated from the
> KVM guest before the secure configuration is torn down. It is the
> responsibility of the vfio_ap device driver to handle this.
>
> Prior to commit ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"),
> the driver reset all AP queues passed through to a KVM guest when notified
> that the KVM pointer was being set to NULL. Subsequently, the AP queues
> are only reset when the fd for the mediated device used to pass the queues
> through to the guest is closed (the vfio_ap_mdev_close_device() callback).
> This is not a problem when userspace is well-behaved and uses the
> KVM_DEV_VFIO_GROUP_DEL attribute to remove the VFIO group; however, if
> userspace for some reason does not close the mdev fd, a secure execution
> guest will tear down its configuration before the AP queues are
> reset because the teardown is done in the kvm_arch_destroy_vm function
> which is invoked prior to vm_destroy_devices.
>
> This patch proposes a simple solution; rather than introducing a new
> notifier into vfio or callback into KVM, what aoubt reversing the order
> in which the kvm_arch_destroy_vm and kvm_destroy_devices are called. In
> some very limited testing (i.e., the automated regression tests for
> the vfio_ap device driver) this did not seem to cause any problems.
>
> The question remains, is there a good technical reason why the VM
> is destroyed before the devices it is using? This is not intuitive, so
> this is a request for comments on this proposed patch. The assumption
> here is that the medev fd will get closed when the devices are destroyed.
>
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>   virt/kvm/kvm_main.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a49df8988cd6..edaf2918be9b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1248,8 +1248,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   #else
>   	kvm_flush_shadow_all(kvm);
>   #endif
> -	kvm_arch_destroy_vm(kvm);
>   	kvm_destroy_devices(kvm);
> +	kvm_arch_destroy_vm(kvm);
>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
Anthony Krowiak July 27, 2022, 7 p.m. UTC | #3
Any Takers??????

On 7/5/22 2:54 PM, Tony Krowiak wrote:
> There is a new requirement for s390 secure execution guests that the
> hypervisor ensures all AP queues are reset and disassociated from the
> KVM guest before the secure configuration is torn down. It is the
> responsibility of the vfio_ap device driver to handle this.
>
> Prior to commit ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"),
> the driver reset all AP queues passed through to a KVM guest when notified
> that the KVM pointer was being set to NULL. Subsequently, the AP queues
> are only reset when the fd for the mediated device used to pass the queues
> through to the guest is closed (the vfio_ap_mdev_close_device() callback).
> This is not a problem when userspace is well-behaved and uses the
> KVM_DEV_VFIO_GROUP_DEL attribute to remove the VFIO group; however, if
> userspace for some reason does not close the mdev fd, a secure execution
> guest will tear down its configuration before the AP queues are
> reset because the teardown is done in the kvm_arch_destroy_vm function
> which is invoked prior to vm_destroy_devices.
>
> This patch proposes a simple solution; rather than introducing a new
> notifier into vfio or callback into KVM, what aoubt reversing the order
> in which the kvm_arch_destroy_vm and kvm_destroy_devices are called. In
> some very limited testing (i.e., the automated regression tests for
> the vfio_ap device driver) this did not seem to cause any problems.
>
> The question remains, is there a good technical reason why the VM
> is destroyed before the devices it is using? This is not intuitive, so
> this is a request for comments on this proposed patch. The assumption
> here is that the medev fd will get closed when the devices are destroyed.
>
> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> ---
>   virt/kvm/kvm_main.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a49df8988cd6..edaf2918be9b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1248,8 +1248,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   #else
>   	kvm_flush_shadow_all(kvm);
>   #endif
> -	kvm_arch_destroy_vm(kvm);
>   	kvm_destroy_devices(kvm);
> +	kvm_arch_destroy_vm(kvm);
>   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
Halil Pasic Aug. 1, 2022, 11:53 a.m. UTC | #4
On Wed, 27 Jul 2022 15:00:02 -0400
Anthony Krowiak <akrowiak@linux.ibm.com> wrote:

> Any Takers??????
> 
> On 7/5/22 2:54 PM, Tony Krowiak wrote:
> > There is a new requirement for s390 secure execution guests that the
> > hypervisor ensures all AP queues are reset and disassociated from the
> > KVM guest before the secure configuration is torn down. It is the
> > responsibility of the vfio_ap device driver to handle this.
> >
> > Prior to commit ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"),
> > the driver reset all AP queues passed through to a KVM guest when notified
> > that the KVM pointer was being set to NULL. Subsequently, the AP queues
> > are only reset when the fd for the mediated device used to pass the queues
> > through to the guest is closed (the vfio_ap_mdev_close_device() callback).
> > This is not a problem when userspace is well-behaved and uses the
> > KVM_DEV_VFIO_GROUP_DEL attribute to remove the VFIO group; however, if
> > userspace for some reason does not close the mdev fd, a secure execution
> > guest will tear down its configuration before the AP queues are
> > reset because the teardown is done in the kvm_arch_destroy_vm function
> > which is invoked prior to kvm_destroy_devices.

As Matt has pointed out: we did not have the guarantee we need prior
that commit. Please for the next version drop the digression about
the old behavior.

> >
> > This patch proposes a simple solution; rather than introducing a new
> > notifier into vfio or callback into KVM, what aoubt reversing the order
> > in which the kvm_arch_destroy_vm and kvm_destroy_devices are called. In
> > some very limited testing (i.e., the automated regression tests for
> > the vfio_ap device driver) this did not seem to cause any problems.
> >
> > The question remains, is there a good technical reason why the VM
> > is destroyed before the devices it is using? This is not intuitive, so
> > this is a request for comments on this proposed patch. The assumption
> > here is that the medev fd will get closed when the devices are destroyed.

I did some digging! The function and the corresponding mechanism was
introduced by  07f0a7bdec5c ("kvm: destroy emulated devices on VM
exit"). Before that patch we used to have ref-counting, and the refcound
got decremented in kvmppc_mpic_disconnect_vcpu() which in turn was
called by kvm_arch_vcpu_free(). So this was basically arch specific
stuff. For power (the patch came form power) the refcount was decremented
before calling kvmppc_core_vcpu_free(). So I conclude the old scheme
would have worked for us.

Since the patch does not state any technical reasons, my guess is, that
the choice was made somewhat arbitrarily under the assumption, that
there is no requirements or dependency with regards to the destruction
of devices or with regards towards severing the connection between
the devices and the VM. Under these assumptions the placement of
the invocation of kvm_destroy_devices after kvm_arch_destroy_vm()
did made sense, because if something that is destroyed in destroy_vm()
did hold a live reference to the device, this reference will be cleaned
up before kvm_destroy_devices() is invoked. So basically unless the
devices hold references to each other, things look good. If the
positions of  kvm_arch_destroy_vm() and kvm_destroy_devices() are
changed, then we basically need to assume that nothing that is destroyed
in kvm_arch_destoy_vm() may logically hold a live reference (remember
the refcount is gone, but pointers may still exist) to a kvm device.
Does that hold? @Antony, maybe you can answer this question for us...
Otherwise I will continue the digging from here, eventually.

Also I have concerns about the following comments:

static void kvm_destroy_devices(struct kvm *kvm)                                
{                                                                               
        struct kvm_device *dev, *tmp;                                           
                                                                                
        /*                                                                      
         * We do not need to take the kvm->lock here, because nobody else       
         * has a reference to the struct kvm at this point and therefore        
         * cannot access the devices list anyhow.  
[..]

Would this till hold when the order is changed?

struct kvm_device_ops { 
[..]
        /*                                                                      
         * Destroy is responsible for freeing dev.                              
         *                                                                      
         * Destroy may be called before or after destructors are called         
         * on emulated I/O regions, depending on whether a reference is         
         * held by a vcpu or other kvm component that gets destroyed            
         * after the emulated I/O.                                              
         */                                                                     
        void (*destroy)(struct kvm_device *dev);  

This seems to document the order of things as is.

Btw I would like to understand more about the lifecycle of these
emulated I/O regions....

@Paolo: I believe this is ultimately your truff. I'm just digging
through the code, and the history to try to help along with this. We
definitely need a solution for our problem. We would very much appreciate
having your opinion!

Regards,
Halil

> >
> > Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
> > ---
> >   virt/kvm/kvm_main.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index a49df8988cd6..edaf2918be9b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1248,8 +1248,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >   #else
> >   	kvm_flush_shadow_all(kvm);
> >   #endif
> > -	kvm_arch_destroy_vm(kvm);
> >   	kvm_destroy_devices(kvm);
> > +	kvm_arch_destroy_vm(kvm);
> >   	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> >   		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> >   		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
Anthony Krowiak Aug. 11, 2022, 2:39 p.m. UTC | #5
On 8/1/22 7:53 AM, Halil Pasic wrote:
> On Wed, 27 Jul 2022 15:00:02 -0400
> Anthony Krowiak <akrowiak@linux.ibm.com> wrote:
>
>> Any Takers??????
>>
>> On 7/5/22 2:54 PM, Tony Krowiak wrote:
>>> There is a new requirement for s390 secure execution guests that the
>>> hypervisor ensures all AP queues are reset and disassociated from the
>>> KVM guest before the secure configuration is torn down. It is the
>>> responsibility of the vfio_ap device driver to handle this.
>>>
>>> Prior to commit ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"),
>>> the driver reset all AP queues passed through to a KVM guest when notified
>>> that the KVM pointer was being set to NULL. Subsequently, the AP queues
>>> are only reset when the fd for the mediated device used to pass the queues
>>> through to the guest is closed (the vfio_ap_mdev_close_device() callback).
>>> This is not a problem when userspace is well-behaved and uses the
>>> KVM_DEV_VFIO_GROUP_DEL attribute to remove the VFIO group; however, if
>>> userspace for some reason does not close the mdev fd, a secure execution
>>> guest will tear down its configuration before the AP queues are
>>> reset because the teardown is done in the kvm_arch_destroy_vm function
>>> which is invoked prior to kvm_destroy_devices.
> As Matt has pointed out: we did not have the guarantee we need prior
> that commit. Please for the next version drop the digression about
> the old behavior.
>
>>> This patch proposes a simple solution; rather than introducing a new
>>> notifier into vfio or callback into KVM, what aoubt reversing the order
>>> in which the kvm_arch_destroy_vm and kvm_destroy_devices are called. In
>>> some very limited testing (i.e., the automated regression tests for
>>> the vfio_ap device driver) this did not seem to cause any problems.
>>>
>>> The question remains, is there a good technical reason why the VM
>>> is destroyed before the devices it is using? This is not intuitive, so
>>> this is a request for comments on this proposed patch. The assumption
>>> here is that the medev fd will get closed when the devices are destroyed.
> I did some digging! The function and the corresponding mechanism was
> introduced by  07f0a7bdec5c ("kvm: destroy emulated devices on VM
> exit"). Before that patch we used to have ref-counting, and the refcound
> got decremented in kvmppc_mpic_disconnect_vcpu() which in turn was
> called by kvm_arch_vcpu_free(). So this was basically arch specific
> stuff. For power (the patch came form power) the refcount was decremented
> before calling kvmppc_core_vcpu_free(). So I conclude the old scheme
> would have worked for us.
>
> Since the patch does not state any technical reasons, my guess is, that
> the choice was made somewhat arbitrarily under the assumption, that
> there is no requirements or dependency with regards to the destruction
> of devices or with regards towards severing the connection between
> the devices and the VM. Under these assumptions the placement of
> the invocation of kvm_destroy_devices after kvm_arch_destroy_vm()
> did made sense, because if something that is destroyed in destroy_vm()
> did hold a live reference to the device, this reference will be cleaned
> up before kvm_destroy_devices() is invoked. So basically unless the
> devices hold references to each other, things look good. If the
> positions of  kvm_arch_destroy_vm() and kvm_destroy_devices() are
> changed, then we basically need to assume that nothing that is destroyed
> in kvm_arch_destoy_vm() may logically hold a live reference (remember
> the refcount is gone, but pointers may still exist) to a kvm device.
> Does that hold? @Antony, maybe you can answer this question for us...


I do not have an answer for this without doing a deep dive into the 
code. I am not very familiar with the VM lifecycle. My hope was that 
someone who knows this area would respond to this RFC. I am copying the 
Signed-off-by email addresses for the patch (07f0a7bdec) you mentioned 
above; maybe they can provide some insight as to for their choice in 
ordering of the kvm_arch_destroy_vm() and kvm_destroy_devices() functions.



> Otherwise I will continue the digging from here, eventually.
>
> Also I have concerns about the following comments:
>
> static void kvm_destroy_devices(struct kvm *kvm)
> {
>          struct kvm_device *dev, *tmp;
>                                                                                  
>          /*
>           * We do not need to take the kvm->lock here, because nobody else
>           * has a reference to the struct kvm at this point and therefore
>           * cannot access the devices list anyhow.
> [..]
>
> Would this till hold when the order is changed?
>
> struct kvm_device_ops {
> [..]
>          /*
>           * Destroy is responsible for freeing dev.
>           *
>           * Destroy may be called before or after destructors are called
>           * on emulated I/O regions, depending on whether a reference is
>           * held by a vcpu or other kvm component that gets destroyed
>           * after the emulated I/O.
>           */
>          void (*destroy)(struct kvm_device *dev);
>
> This seems to document the order of things as is.
>
> Btw I would like to understand more about the lifecycle of these
> emulated I/O regions....
>
> @Paolo: I believe this is ultimately your truff. I'm just digging
> through the code, and the history to try to help along with this. We
> definitely need a solution for our problem. We would very much appreciate
> having your opinion!
>
> Regards,
> Halil
>
>>> Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
>>> ---
>>>    virt/kvm/kvm_main.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index a49df8988cd6..edaf2918be9b 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -1248,8 +1248,8 @@ static void kvm_destroy_vm(struct kvm *kvm)
>>>    #else
>>>    	kvm_flush_shadow_all(kvm);
>>>    #endif
>>> -	kvm_arch_destroy_vm(kvm);
>>>    	kvm_destroy_devices(kvm);
>>> +	kvm_arch_destroy_vm(kvm);
>>>    	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>>>    		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>>>    		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a49df8988cd6..edaf2918be9b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1248,8 +1248,8 @@  static void kvm_destroy_vm(struct kvm *kvm)
 #else
 	kvm_flush_shadow_all(kvm);
 #endif
-	kvm_arch_destroy_vm(kvm);
 	kvm_destroy_devices(kvm);
+	kvm_arch_destroy_vm(kvm);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);