diff mbox

[v2] KVM: arm/arm64: Handle hva aging while destroying the vm

Message ID 20170705085700.GA16881@e107814-lin.cambridge.arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Suzuki K Poulose July 5, 2017, 8:57 a.m. UTC
Hi Alex,

On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
> The kvm_age_hva callback may be called all the way concurrently while
> kvm_mmu_notifier_release() is running.
> 
> The release function sets kvm->arch.pgd = NULL which the aging function
> however implicitly relies on in stage2_get_pud(). That means they can
> race and the aging function may dereference a NULL pgd pointer.
> 
> This patch adds a check for that case, so that we leave the aging
> function silently.
> 
> Cc: stable@vger.kernel.org
> Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
> Signed-off-by: Alexander Graf <agraf@suse.de>
> 
> ---
> 
> v1 -> v2:
> 
>   - Fix commit message
>   - Add Fixes and stable tags
> ---
>  virt/kvm/arm/mmu.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index f2d5b6c..227931f 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>  	pgd_t *pgd;
>  	pud_t *pud;
>  
> +	/* Do we clash with kvm_free_stage2_pgd()? */
> +	if (!kvm->arch.pgd)
> +		return NULL;
> +

I think this check should be moved up in the chain. We call kvm_age_hva(), with
the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
if we find the PGD is null when we reach kvm_age_hva(), we could simply return
there, like we do for other call backs from the KVM mmu_notifier.


----8>----

kvm-arm: Handle hva aging while destroying vm

The mmu_notifier_release() callback of KVM triggers cleaning up
the stage2 page table on kvm-arm. However there could be other
notifier callbacks in parallel with the mmu_notifier_release(),
which could cause the call backs ending up in an empty stage2
page table. Make sure we check it for all the notifier callbacks.

Reported-by: Alex Graf <agraf@suse.de>
Cc: Christoffer Dall <cdall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: commit 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 virt/kvm/arm/mmu.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Alexander Graf July 6, 2017, 7:07 a.m. UTC | #1
On 05.07.17 10:57, Suzuki K Poulose wrote:
> Hi Alex,
> 
> On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
>> The kvm_age_hva callback may be called all the way concurrently while
>> kvm_mmu_notifier_release() is running.
>>
>> The release function sets kvm->arch.pgd = NULL which the aging function
>> however implicitly relies on in stage2_get_pud(). That means they can
>> race and the aging function may dereference a NULL pgd pointer.
>>
>> This patch adds a check for that case, so that we leave the aging
>> function silently.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>
>> ---
>>
>> v1 -> v2:
>>
>>    - Fix commit message
>>    - Add Fixes and stable tags
>> ---
>>   virt/kvm/arm/mmu.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index f2d5b6c..227931f 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>>   	pgd_t *pgd;
>>   	pud_t *pud;
>>   
>> +	/* Do we clash with kvm_free_stage2_pgd()? */
>> +	if (!kvm->arch.pgd)
>> +		return NULL;
>> +
> 
> I think this check should be moved up in the chain. We call kvm_age_hva(), with
> the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
> if we find the PGD is null when we reach kvm_age_hva(), we could simply return
> there, like we do for other call backs from the KVM mmu_notifier.

That probably works too - I'm not sure which version is more consistent 
as well as more maintainable in the long run. I'll leave the call here 
to Christoffer.


Alex
Christoffer Dall July 6, 2017, 7:45 a.m. UTC | #2
On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote:
> 
> 
> On 05.07.17 10:57, Suzuki K Poulose wrote:
> >Hi Alex,
> >
> >On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
> >>The kvm_age_hva callback may be called all the way concurrently while
> >>kvm_mmu_notifier_release() is running.
> >>
> >>The release function sets kvm->arch.pgd = NULL which the aging function
> >>however implicitly relies on in stage2_get_pud(). That means they can
> >>race and the aging function may dereference a NULL pgd pointer.
> >>
> >>This patch adds a check for that case, so that we leave the aging
> >>function silently.
> >>
> >>Cc: stable@vger.kernel.org
> >>Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
> >>Signed-off-by: Alexander Graf <agraf@suse.de>
> >>
> >>---
> >>
> >>v1 -> v2:
> >>
> >>   - Fix commit message
> >>   - Add Fixes and stable tags
> >>---
> >>  virt/kvm/arm/mmu.c | 4 ++++
> >>  1 file changed, 4 insertions(+)
> >>
> >>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>index f2d5b6c..227931f 100644
> >>--- a/virt/kvm/arm/mmu.c
> >>+++ b/virt/kvm/arm/mmu.c
> >>@@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> >>  	pgd_t *pgd;
> >>  	pud_t *pud;
> >>+	/* Do we clash with kvm_free_stage2_pgd()? */
> >>+	if (!kvm->arch.pgd)
> >>+		return NULL;
> >>+
> >
> >I think this check should be moved up in the chain. We call kvm_age_hva(), with
> >the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
> >if we find the PGD is null when we reach kvm_age_hva(), we could simply return
> >there, like we do for other call backs from the KVM mmu_notifier.
> 
> That probably works too - I'm not sure which version is more
> consistent as well as more maintainable in the long run. I'll leave
> the call here to Christoffer.
> 

Let's look at the callers to stage2_get_pmd, which is the only caller of
stage2_get_pud, where the problem was observed:

  user_mem_abort
   -> stage2_set_pmd_huge
      -> stage2_get_pmd
  
  user_mem_abort
   -> stage2_set_pte
      -> stage2_get_pmd

  handle_access_fault
   -> stage2_get_pmd

For the above three functions, pgd cannot ever be NULL, because this is
running in the context of a VCPU thread, which means the reference on
the VM fd must not reach zero, so no need to call that here.

  kvm_set_spte_handler
   -> stage2_set_pte
      -> stage2_get_pmd
 
This is called from kvm_set_spte_hva, which is one of the MMU notifiers,
so it can race similarly kvm_age_hva and kvm_test_age_hva, but it
already checks for !kvm->arch.pgd.

  kvm_phys_addr_ioremap
   -> stage2_set_pte
      -> stage2_get_pmd

This is called from two places: (1) The VGIC code (as part of
vgic_v2_map_resources) and can only be called in the context of running
a VCPU, so the pgd cannot be null by virtue of the same argument as for
user_mem_abort. (2) kvm_arch_prepare_memory_region calls
kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see
how the VM can be in the middle of being freed while handling ioctls on
the fd.  Therefore, following the same argument, this should be safe as
well.

  kvm_age_hva_handler and kvm_test_age_hva_handler
   -> stage2_get_pmd

Handled by the patch proposed by Suzuki.

What does all that tell us?  First, it should give us some comfort that we
don't have more races of this kind.  Second, it teels us that there are
a number of different and not-obvious call paths to stage2_pet_pud,
which could be an argument to simply check the pgd whenever it's called,
despite the minor runtime overhead.  On the other hand, the check itself
is only valid knowing that we synchronize against kvm_free_stage2_pgd
using the kvm->mmu_lock() and understanding that this only happens when
mmu notifiers call into the KVM MMU code outside the context of the VM.

The last consideration is the winning argument for me to put the check
in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's
important that we document why it's only these three high-level callers
(incl. kvm_set_spte_handler) that need the check, either in the code or
in the commit message.

Marc, thoughts?

Thanks,
-Christoffer
Christoffer Dall July 6, 2017, 8:14 a.m. UTC | #3
On Wed, Jul 05, 2017 at 09:57:00AM +0100, Suzuki K Poulose wrote:
> Hi Alex,
> 
> On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
> > The kvm_age_hva callback may be called all the way concurrently while
> > kvm_mmu_notifier_release() is running.
> > 
> > The release function sets kvm->arch.pgd = NULL which the aging function
> > however implicitly relies on in stage2_get_pud(). That means they can
> > race and the aging function may dereference a NULL pgd pointer.
> > 
> > This patch adds a check for that case, so that we leave the aging
> > function silently.
> > 
> > Cc: stable@vger.kernel.org
> > Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
> > Signed-off-by: Alexander Graf <agraf@suse.de>
> > 
> > ---
> > 
> > v1 -> v2:
> > 
> >   - Fix commit message
> >   - Add Fixes and stable tags
> > ---
> >  virt/kvm/arm/mmu.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> > index f2d5b6c..227931f 100644
> > --- a/virt/kvm/arm/mmu.c
> > +++ b/virt/kvm/arm/mmu.c
> > @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> >  	pgd_t *pgd;
> >  	pud_t *pud;
> >  
> > +	/* Do we clash with kvm_free_stage2_pgd()? */
> > +	if (!kvm->arch.pgd)
> > +		return NULL;
> > +
> 
> I think this check should be moved up in the chain. We call kvm_age_hva(), with
> the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
> if we find the PGD is null when we reach kvm_age_hva(), we could simply return
> there, like we do for other call backs from the KVM mmu_notifier.
> 
> 
> ----8>----
> 
> kvm-arm: Handle hva aging while destroying vm
> 
> The mmu_notifier_release() callback of KVM triggers cleaning up
> the stage2 page table on kvm-arm. However there could be other
> notifier callbacks in parallel with the mmu_notifier_release(),
> which could cause the call backs ending up in an empty stage2
> page table. Make sure we check it for all the notifier callbacks.
> 
> Reported-by: Alex Graf <agraf@suse.de>
> Cc: Christoffer Dall <cdall@linaro.org>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: stable@vger.kernel.org
> Fixes: commit 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  virt/kvm/arm/mmu.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index e2e5eff..db1c7b2 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1665,12 +1665,16 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *
>  
>  int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
>  {
> +	if (!kvm->arch.pgd)
> +		return 0;
>  	trace_kvm_age_hva(start, end);
>  	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
>  }
>  
>  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
>  {
> +	if (!kvm->arch.pgd)
> +		return 0;
>  	trace_kvm_test_age_hva(hva);
>  	return handle_hva_to_gpa(kvm, hva, hva, kvm_test_age_hva_handler, NULL);
>  }
> -- 
> 2.7.5
> 

For this patch:

Reviewed-by: Christoffer Dall <cdall@linaro.org>
Andrea Arcangeli July 6, 2017, 9:31 a.m. UTC | #4
Hello,

On Thu, Jul 06, 2017 at 09:45:13AM +0200, Christoffer Dall wrote:
> Let's look at the callers to stage2_get_pmd, which is the only caller of
> stage2_get_pud, where the problem was observed:
> 
>   user_mem_abort
>    -> stage2_set_pmd_huge
>       -> stage2_get_pmd
>   
>   user_mem_abort
>    -> stage2_set_pte
>       -> stage2_get_pmd
> 
>   handle_access_fault
>    -> stage2_get_pmd
> 
> For the above three functions, pgd cannot ever be NULL, because this is
> running in the context of a VCPU thread, which means the reference on
> the VM fd must not reach zero, so no need to call that here.

Just a minor nitpick: the !pgd bypass is necessary before the KVM fd
technically reaches zero.

exit_mm->mmput->exit_mmap() will invoke the __mmu_notifier_release
even if the KVM fd isn't zero yet.

This is because the secondary MMU page faults must be shutdown before
freeing the guest RAM (nothing can call handle_mm_fault or any
get_user_pages after mm->mm_users == 0), regardless if
mmu_notifier_unregister hasn't been called yet (i.e. if the /dev/kvm
fd is still open).

Usually the fd is closed immediately after exit_mmap, as exit_files is
called shortly after exit_mm() but there's a common window where the
fd is still open but the !pgd check is already necessary (plus the fd
could in theory be passed to other processes).

> using the kvm->mmu_lock() and understanding that this only happens when
> mmu notifiers call into the KVM MMU code outside the context of the VM.

Agreed.

The other arches don't need any special check to serialize against
kvm_mmu_notifier_release, they're just looking up shadow pagetables
through spte rmap (and they'll find nothing if
kvm_mmu_notifier_release already run).

In theory it would make more sense to put the overhead in the slow
path by adding a mutex to the mmu_notifier struct and then using that
to solve the race between mmu_notifier_release and
mmu_notifier_unregister, and then to hlist_del_init to unhash the mmu
notifier and then to call synchronize_srcu, before calling ->release
while holding some mutex. However that's going to be marginally slower
for the other arches.

In practice I doubt this is measurable and getting away with one less
mutex in mmu notifier_release vs mmu_notifier_unregister sounds
simpler but comments welcome...

Thanks,
Andrea
Suzuki K Poulose July 6, 2017, 9:34 a.m. UTC | #5
On 06/07/17 08:45, Christoffer Dall wrote:
> On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote:
>>
>>
>> On 05.07.17 10:57, Suzuki K Poulose wrote:
>>> Hi Alex,
>>>
>>> On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
>>>> The kvm_age_hva callback may be called all the way concurrently while
>>>> kvm_mmu_notifier_release() is running.
>>>>
>>>> The release function sets kvm->arch.pgd = NULL which the aging function
>>>> however implicitly relies on in stage2_get_pud(). That means they can
>>>> race and the aging function may dereference a NULL pgd pointer.
>>>>
>>>> This patch adds a check for that case, so that we leave the aging
>>>> function silently.
>>>>
>>>> Cc: stable@vger.kernel.org
>>>> Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>>
>>>> ---
>>>>
>>>> v1 -> v2:
>>>>
>>>>   - Fix commit message
>>>>   - Add Fixes and stable tags
>>>> ---
>>>>  virt/kvm/arm/mmu.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>> index f2d5b6c..227931f 100644
>>>> --- a/virt/kvm/arm/mmu.c
>>>> +++ b/virt/kvm/arm/mmu.c
>>>> @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>>>>  	pgd_t *pgd;
>>>>  	pud_t *pud;
>>>> +	/* Do we clash with kvm_free_stage2_pgd()? */
>>>> +	if (!kvm->arch.pgd)
>>>> +		return NULL;
>>>> +
>>>
>>> I think this check should be moved up in the chain. We call kvm_age_hva(), with
>>> the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
>>> if we find the PGD is null when we reach kvm_age_hva(), we could simply return
>>> there, like we do for other call backs from the KVM mmu_notifier.
>>
>> That probably works too - I'm not sure which version is more
>> consistent as well as more maintainable in the long run. I'll leave
>> the call here to Christoffer.
>>
>
> Let's look at the callers to stage2_get_pmd, which is the only caller of
> stage2_get_pud, where the problem was observed:
>
>   user_mem_abort
>    -> stage2_set_pmd_huge
>       -> stage2_get_pmd
>
>   user_mem_abort
>    -> stage2_set_pte
>       -> stage2_get_pmd
>
>   handle_access_fault
>    -> stage2_get_pmd
>
> For the above three functions, pgd cannot ever be NULL, because this is
> running in the context of a VCPU thread, which means the reference on
> the VM fd must not reach zero, so no need to call that here.

I think there is some problem here.  See below for more information.

>
>   kvm_set_spte_handler
>    -> stage2_set_pte
>       -> stage2_get_pmd
>
> This is called from kvm_set_spte_hva, which is one of the MMU notifiers,
> so it can race similarly kvm_age_hva and kvm_test_age_hva, but it
> already checks for !kvm->arch.pgd.
>
>   kvm_phys_addr_ioremap
>    -> stage2_set_pte
>       -> stage2_get_pmd
>
> This is called from two places: (1) The VGIC code (as part of
> vgic_v2_map_resources) and can only be called in the context of running
> a VCPU, so the pgd cannot be null by virtue of the same argument as for
> user_mem_abort. (2) kvm_arch_prepare_memory_region calls
> kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see
> how the VM can be in the middle of being freed while handling ioctls on
> the fd.  Therefore, following the same argument, this should be safe as
> well.
>
>   kvm_age_hva_handler and kvm_test_age_hva_handler
>    -> stage2_get_pmd
>
> Handled by the patch proposed by Suzuki.
>
> What does all that tell us?  First, it should give us some comfort that we
> don't have more races of this kind.  Second, it teels us that there are
> a number of different and not-obvious call paths to stage2_pet_pud,
> which could be an argument to simply check the pgd whenever it's called,
> despite the minor runtime overhead.  On the other hand, the check itself
> is only valid knowing that we synchronize against kvm_free_stage2_pgd
> using the kvm->mmu_lock() and understanding that this only happens when
> mmu notifiers call into the KVM MMU code outside the context of the VM.
>
> The last consideration is the winning argument for me to put the check
> in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's
> important that we document why it's only these three high-level callers
> (incl. kvm_set_spte_handler) that need the check, either in the code or
> in the commit message.

The only way we end up freeing the stage-2 PGD is via the mmu_notifier_release(),
which could be triggered via two different paths.

1) kvm_destroy_vm(), where all the VM resources has been released and the
refcount on the KVM instances are dropped, via kvm_put_kvm().

kvm_put_kvm()
  kvm_destroy_vm()
    mmu_notifier_unregsiter
      mmu_notifier_ops->release()
        kvm_arch_flush_shadow_all
          kvm_free_stage2_pgd -> free the page table with the mmu_lock held
                                 occasionally releasing it to avoid contention.
or

2) do_signal -> get_signal -> do_group_exit - >
      do_exit
         exit_mm
          mmput => __mmput
            exit_mmap
              mmu_notifier_release
	       mmu_notifier_ops->release
                  kvm_arch_flush_shadow_all
                    kvm_free_stage2_pgd

In the first case, all references to the VM are dropped and hence none of the
VCPU could still be executing.

However, in the second case it may not be. So we have a potential problem with
the VCPU trying to run even when the pages were unmapped. I think the root cause
of all these issues boils down to the assumption that KVM holds a reference to
MM (which is not necessarily the user space mapping. i.e mmgrab vs mmget).
I am not sure if the VCPU should hold a reference to the mmaps to make sure
it is safe to run. That way, the mmap stays until the VCPU eventually exits
and we are safe all the way around.

Cheers
Suzuki
Christoffer Dall July 6, 2017, 9:42 a.m. UTC | #6
On Thu, Jul 06, 2017 at 10:34:58AM +0100, Suzuki K Poulose wrote:
> On 06/07/17 08:45, Christoffer Dall wrote:
> >On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote:
> >>
> >>
> >>On 05.07.17 10:57, Suzuki K Poulose wrote:
> >>>Hi Alex,
> >>>
> >>>On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
> >>>>The kvm_age_hva callback may be called all the way concurrently while
> >>>>kvm_mmu_notifier_release() is running.
> >>>>
> >>>>The release function sets kvm->arch.pgd = NULL which the aging function
> >>>>however implicitly relies on in stage2_get_pud(). That means they can
> >>>>race and the aging function may dereference a NULL pgd pointer.
> >>>>
> >>>>This patch adds a check for that case, so that we leave the aging
> >>>>function silently.
> >>>>
> >>>>Cc: stable@vger.kernel.org
> >>>>Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
> >>>>Signed-off-by: Alexander Graf <agraf@suse.de>
> >>>>
> >>>>---
> >>>>
> >>>>v1 -> v2:
> >>>>
> >>>>  - Fix commit message
> >>>>  - Add Fixes and stable tags
> >>>>---
> >>>> virt/kvm/arm/mmu.c | 4 ++++
> >>>> 1 file changed, 4 insertions(+)
> >>>>
> >>>>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>>>index f2d5b6c..227931f 100644
> >>>>--- a/virt/kvm/arm/mmu.c
> >>>>+++ b/virt/kvm/arm/mmu.c
> >>>>@@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> >>>> 	pgd_t *pgd;
> >>>> 	pud_t *pud;
> >>>>+	/* Do we clash with kvm_free_stage2_pgd()? */
> >>>>+	if (!kvm->arch.pgd)
> >>>>+		return NULL;
> >>>>+
> >>>
> >>>I think this check should be moved up in the chain. We call kvm_age_hva(), with
> >>>the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
> >>>if we find the PGD is null when we reach kvm_age_hva(), we could simply return
> >>>there, like we do for other call backs from the KVM mmu_notifier.
> >>
> >>That probably works too - I'm not sure which version is more
> >>consistent as well as more maintainable in the long run. I'll leave
> >>the call here to Christoffer.
> >>
> >
> >Let's look at the callers to stage2_get_pmd, which is the only caller of
> >stage2_get_pud, where the problem was observed:
> >
> >  user_mem_abort
> >   -> stage2_set_pmd_huge
> >      -> stage2_get_pmd
> >
> >  user_mem_abort
> >   -> stage2_set_pte
> >      -> stage2_get_pmd
> >
> >  handle_access_fault
> >   -> stage2_get_pmd
> >
> >For the above three functions, pgd cannot ever be NULL, because this is
> >running in the context of a VCPU thread, which means the reference on
> >the VM fd must not reach zero, so no need to call that here.
> 
> I think there is some problem here.  See below for more information.
> 
> >
> >  kvm_set_spte_handler
> >   -> stage2_set_pte
> >      -> stage2_get_pmd
> >
> >This is called from kvm_set_spte_hva, which is one of the MMU notifiers,
> >so it can race similarly kvm_age_hva and kvm_test_age_hva, but it
> >already checks for !kvm->arch.pgd.
> >
> >  kvm_phys_addr_ioremap
> >   -> stage2_set_pte
> >      -> stage2_get_pmd
> >
> >This is called from two places: (1) The VGIC code (as part of
> >vgic_v2_map_resources) and can only be called in the context of running
> >a VCPU, so the pgd cannot be null by virtue of the same argument as for
> >user_mem_abort. (2) kvm_arch_prepare_memory_region calls
> >kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see
> >how the VM can be in the middle of being freed while handling ioctls on
> >the fd.  Therefore, following the same argument, this should be safe as
> >well.
> >
> >  kvm_age_hva_handler and kvm_test_age_hva_handler
> >   -> stage2_get_pmd
> >
> >Handled by the patch proposed by Suzuki.
> >
> >What does all that tell us?  First, it should give us some comfort that we
> >don't have more races of this kind.  Second, it teels us that there are
> >a number of different and not-obvious call paths to stage2_pet_pud,
> >which could be an argument to simply check the pgd whenever it's called,
> >despite the minor runtime overhead.  On the other hand, the check itself
> >is only valid knowing that we synchronize against kvm_free_stage2_pgd
> >using the kvm->mmu_lock() and understanding that this only happens when
> >mmu notifiers call into the KVM MMU code outside the context of the VM.
> >
> >The last consideration is the winning argument for me to put the check
> >in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's
> >important that we document why it's only these three high-level callers
> >(incl. kvm_set_spte_handler) that need the check, either in the code or
> >in the commit message.
> 
> The only way we end up freeing the stage-2 PGD is via the mmu_notifier_release(),
> which could be triggered via two different paths.
> 
> 1) kvm_destroy_vm(), where all the VM resources has been released and the
> refcount on the KVM instances are dropped, via kvm_put_kvm().
> 
> kvm_put_kvm()
>  kvm_destroy_vm()
>    mmu_notifier_unregsiter
>      mmu_notifier_ops->release()
>        kvm_arch_flush_shadow_all
>          kvm_free_stage2_pgd -> free the page table with the mmu_lock held
>                                 occasionally releasing it to avoid contention.
> or
> 
> 2) do_signal -> get_signal -> do_group_exit - >
>      do_exit
>         exit_mm
>          mmput => __mmput
>            exit_mmap
>              mmu_notifier_release
> 	       mmu_notifier_ops->release
>                  kvm_arch_flush_shadow_all
>                    kvm_free_stage2_pgd
> 
> In the first case, all references to the VM are dropped and hence none of the
> VCPU could still be executing.
> 
> However, in the second case it may not be. So we have a potential problem with
> the VCPU trying to run even when the pages were unmapped. I think the root cause
> of all these issues boils down to the assumption that KVM holds a reference to
> MM (which is not necessarily the user space mapping. i.e mmgrab vs mmget).
> I am not sure if the VCPU should hold a reference to the mmaps to make sure
> it is safe to run. That way, the mmap stays until the VCPU eventually exits
> and we are safe all the way around.

Hmmm, my assumption is that if a VCPU is running, it means there is a
VCPU thread that shares the struct mm which is running, so I don't
understand how mmput would be able to call exit_mmap in the scenario
above?

So the distinction here is that I don't assume that the VCPU fd holds a
reference to the mm, but I assume that the (running) VCPU thread does.
Is this incorrect?

Even if the VCPU thread is the only thread using the struct mm, I still
don't understand how this happens, because we can't be both handling
signals and be processing exits from the VM at the same time, can we?

Thanks,
-Christoffer
Christoffer Dall July 6, 2017, 9:43 a.m. UTC | #7
Hi Andrea,

On Thu, Jul 06, 2017 at 11:31:26AM +0200, Andrea Arcangeli wrote:
> Hello,
> 
> On Thu, Jul 06, 2017 at 09:45:13AM +0200, Christoffer Dall wrote:
> > Let's look at the callers to stage2_get_pmd, which is the only caller of
> > stage2_get_pud, where the problem was observed:
> > 
> >   user_mem_abort
> >    -> stage2_set_pmd_huge
> >       -> stage2_get_pmd
> >   
> >   user_mem_abort
> >    -> stage2_set_pte
> >       -> stage2_get_pmd
> > 
> >   handle_access_fault
> >    -> stage2_get_pmd
> > 
> > For the above three functions, pgd cannot ever be NULL, because this is
> > running in the context of a VCPU thread, which means the reference on
> > the VM fd must not reach zero, so no need to call that here.
> 
> Just a minor nitpick: the !pgd bypass is necessary before the KVM fd
> technically reaches zero.
> 
> exit_mm->mmput->exit_mmap() will invoke the __mmu_notifier_release
> even if the KVM fd isn't zero yet.

But is exit_mm possible when you have VCPU *threads* running in the VCPU
KVM_RUN ioctl ?

> 
> This is because the secondary MMU page faults must be shutdown before
> freeing the guest RAM (nothing can call handle_mm_fault or any
> get_user_pages after mm->mm_users == 0), regardless if
> mmu_notifier_unregister hasn't been called yet (i.e. if the /dev/kvm
> fd is still open).
> 
> Usually the fd is closed immediately after exit_mmap, as exit_files is
> called shortly after exit_mm() but there's a common window where the
> fd is still open but the !pgd check is already necessary (plus the fd
> could in theory be passed to other processes).
> 
> > using the kvm->mmu_lock() and understanding that this only happens when
> > mmu notifiers call into the KVM MMU code outside the context of the VM.
> 
> Agreed.
> 
> The other arches don't need any special check to serialize against
> kvm_mmu_notifier_release, they're just looking up shadow pagetables
> through spte rmap (and they'll find nothing if
> kvm_mmu_notifier_release already run).
> 
> In theory it would make more sense to put the overhead in the slow
> path by adding a mutex to the mmu_notifier struct and then using that
> to solve the race between mmu_notifier_release and
> mmu_notifier_unregister, and then to hlist_del_init to unhash the mmu
> notifier and then to call synchronize_srcu, before calling ->release
> while holding some mutex. However that's going to be marginally slower
> for the other arches.
> 
> In practice I doubt this is measurable and getting away with one less
> mutex in mmu notifier_release vs mmu_notifier_unregister sounds
> simpler but comments welcome...
> 

I think just checking the pgd pointer under the mmu_lock is completely
fine for arm/arm64, as long as we understand what's going on.

Thanks,
-Christoffer
Suzuki K Poulose July 14, 2017, 4:40 p.m. UTC | #8
On 06/07/17 10:42, Christoffer Dall wrote:
> On Thu, Jul 06, 2017 at 10:34:58AM +0100, Suzuki K Poulose wrote:
>> On 06/07/17 08:45, Christoffer Dall wrote:
>>> On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote:
>>>>
>>>>
>>>> On 05.07.17 10:57, Suzuki K Poulose wrote:
>>>>> Hi Alex,
>>>>>
>>>>> On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
>>>>>> The kvm_age_hva callback may be called all the way concurrently while
>>>>>> kvm_mmu_notifier_release() is running.
>>>>>>
>>>>>> The release function sets kvm->arch.pgd = NULL which the aging function
>>>>>> however implicitly relies on in stage2_get_pud(). That means they can
>>>>>> race and the aging function may dereference a NULL pgd pointer.
>>>>>>
>>>>>> This patch adds a check for that case, so that we leave the aging
>>>>>> function silently.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
>>>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> v1 -> v2:
>>>>>>
>>>>>>  - Fix commit message
>>>>>>  - Add Fixes and stable tags
>>>>>> ---
>>>>>> virt/kvm/arm/mmu.c | 4 ++++
>>>>>> 1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>>>> index f2d5b6c..227931f 100644
>>>>>> --- a/virt/kvm/arm/mmu.c
>>>>>> +++ b/virt/kvm/arm/mmu.c
>>>>>> @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>>>>>> 	pgd_t *pgd;
>>>>>> 	pud_t *pud;
>>>>>> +	/* Do we clash with kvm_free_stage2_pgd()? */
>>>>>> +	if (!kvm->arch.pgd)
>>>>>> +		return NULL;
>>>>>> +
>>>>>
>>>>> I think this check should be moved up in the chain. We call kvm_age_hva(), with
>>>>> the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
>>>>> if we find the PGD is null when we reach kvm_age_hva(), we could simply return
>>>>> there, like we do for other call backs from the KVM mmu_notifier.
>>>>
>>>> That probably works too - I'm not sure which version is more
>>>> consistent as well as more maintainable in the long run. I'll leave
>>>> the call here to Christoffer.
>>>>
>>>
>>> Let's look at the callers to stage2_get_pmd, which is the only caller of
>>> stage2_get_pud, where the problem was observed:
>>>
>>>  user_mem_abort
>>>   -> stage2_set_pmd_huge
>>>      -> stage2_get_pmd
>>>
>>>  user_mem_abort
>>>   -> stage2_set_pte
>>>      -> stage2_get_pmd
>>>
>>>  handle_access_fault
>>>   -> stage2_get_pmd
>>>
>>> For the above three functions, pgd cannot ever be NULL, because this is
>>> running in the context of a VCPU thread, which means the reference on
>>> the VM fd must not reach zero, so no need to call that here.
>>
>> I think there is some problem here.  See below for more information.
>>
>>>
>>>  kvm_set_spte_handler
>>>   -> stage2_set_pte
>>>      -> stage2_get_pmd
>>>
>>> This is called from kvm_set_spte_hva, which is one of the MMU notifiers,
>>> so it can race similarly kvm_age_hva and kvm_test_age_hva, but it
>>> already checks for !kvm->arch.pgd.
>>>
>>>  kvm_phys_addr_ioremap
>>>   -> stage2_set_pte
>>>      -> stage2_get_pmd
>>>
>>> This is called from two places: (1) The VGIC code (as part of
>>> vgic_v2_map_resources) and can only be called in the context of running
>>> a VCPU, so the pgd cannot be null by virtue of the same argument as for
>>> user_mem_abort. (2) kvm_arch_prepare_memory_region calls
>>> kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see
>>> how the VM can be in the middle of being freed while handling ioctls on
>>> the fd.  Therefore, following the same argument, this should be safe as
>>> well.
>>>
>>>  kvm_age_hva_handler and kvm_test_age_hva_handler
>>>   -> stage2_get_pmd
>>>
>>> Handled by the patch proposed by Suzuki.
>>>
>>> What does all that tell us?  First, it should give us some comfort that we
>>> don't have more races of this kind.  Second, it teels us that there are
>>> a number of different and not-obvious call paths to stage2_pet_pud,
>>> which could be an argument to simply check the pgd whenever it's called,
>>> despite the minor runtime overhead.  On the other hand, the check itself
>>> is only valid knowing that we synchronize against kvm_free_stage2_pgd
>>> using the kvm->mmu_lock() and understanding that this only happens when
>>> mmu notifiers call into the KVM MMU code outside the context of the VM.
>>>
>>> The last consideration is the winning argument for me to put the check
>>> in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's
>>> important that we document why it's only these three high-level callers
>>> (incl. kvm_set_spte_handler) that need the check, either in the code or
>>> in the commit message.
>>
>> The only way we end up freeing the stage-2 PGD is via the mmu_notifier_release(),
>> which could be triggered via two different paths.
>>
>> 1) kvm_destroy_vm(), where all the VM resources has been released and the
>> refcount on the KVM instances are dropped, via kvm_put_kvm().
>>
>> kvm_put_kvm()
>>  kvm_destroy_vm()
>>    mmu_notifier_unregsiter
>>      mmu_notifier_ops->release()
>>        kvm_arch_flush_shadow_all
>>          kvm_free_stage2_pgd -> free the page table with the mmu_lock held
>>                                 occasionally releasing it to avoid contention.
>> or
>>
>> 2) do_signal -> get_signal -> do_group_exit - >
>>      do_exit
>>         exit_mm
>>          mmput => __mmput
>>            exit_mmap
>>              mmu_notifier_release
>> 	       mmu_notifier_ops->release
>>                  kvm_arch_flush_shadow_all
>>                    kvm_free_stage2_pgd
>>
>> In the first case, all references to the VM are dropped and hence none of the
>> VCPU could still be executing.
>>
>> However, in the second case it may not be. So we have a potential problem with
>> the VCPU trying to run even when the pages were unmapped. I think the root cause
>> of all these issues boils down to the assumption that KVM holds a reference to
>> MM (which is not necessarily the user space mapping. i.e mmgrab vs mmget).
>> I am not sure if the VCPU should hold a reference to the mmaps to make sure
>> it is safe to run. That way, the mmap stays until the VCPU eventually exits
>> and we are safe all the way around.
>
> Hmmm, my assumption is that if a VCPU is running, it means there is a
> VCPU thread that shares the struct mm which is running, so I don't
> understand how mmput would be able to call exit_mmap in the scenario
> above?
>
> So the distinction here is that I don't assume that the VCPU fd holds a
> reference to the mm, but I assume that the (running) VCPU thread does.
> Is this incorrect?

Sorry, I lost this thread in between.

Hmm. You're right. The VCPU should have a refcount on mmap and it shouldn't
do anything with the mmu if it has dropped it. I was confused based on an
old bug report,[ See the description of commit 293f293637b55d "kvm-arm: Unmap shadow pagetables
  properly"], which was fixed.

Suzuki
Christoffer Dall July 16, 2017, 7:56 p.m. UTC | #9
On Fri, Jul 14, 2017 at 05:40:48PM +0100, Suzuki K Poulose wrote:
> On 06/07/17 10:42, Christoffer Dall wrote:
> >On Thu, Jul 06, 2017 at 10:34:58AM +0100, Suzuki K Poulose wrote:
> >>On 06/07/17 08:45, Christoffer Dall wrote:
> >>>On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote:
> >>>>
> >>>>
> >>>>On 05.07.17 10:57, Suzuki K Poulose wrote:
> >>>>>Hi Alex,
> >>>>>
> >>>>>On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
> >>>>>>The kvm_age_hva callback may be called all the way concurrently while
> >>>>>>kvm_mmu_notifier_release() is running.
> >>>>>>
> >>>>>>The release function sets kvm->arch.pgd = NULL which the aging function
> >>>>>>however implicitly relies on in stage2_get_pud(). That means they can
> >>>>>>race and the aging function may dereference a NULL pgd pointer.
> >>>>>>
> >>>>>>This patch adds a check for that case, so that we leave the aging
> >>>>>>function silently.
> >>>>>>
> >>>>>>Cc: stable@vger.kernel.org
> >>>>>>Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
> >>>>>>Signed-off-by: Alexander Graf <agraf@suse.de>
> >>>>>>
> >>>>>>---
> >>>>>>
> >>>>>>v1 -> v2:
> >>>>>>
> >>>>>> - Fix commit message
> >>>>>> - Add Fixes and stable tags
> >>>>>>---
> >>>>>>virt/kvm/arm/mmu.c | 4 ++++
> >>>>>>1 file changed, 4 insertions(+)
> >>>>>>
> >>>>>>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> >>>>>>index f2d5b6c..227931f 100644
> >>>>>>--- a/virt/kvm/arm/mmu.c
> >>>>>>+++ b/virt/kvm/arm/mmu.c
> >>>>>>@@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
> >>>>>>	pgd_t *pgd;
> >>>>>>	pud_t *pud;
> >>>>>>+	/* Do we clash with kvm_free_stage2_pgd()? */
> >>>>>>+	if (!kvm->arch.pgd)
> >>>>>>+		return NULL;
> >>>>>>+
> >>>>>
> >>>>>I think this check should be moved up in the chain. We call kvm_age_hva(), with
> >>>>>the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
> >>>>>if we find the PGD is null when we reach kvm_age_hva(), we could simply return
> >>>>>there, like we do for other call backs from the KVM mmu_notifier.
> >>>>
> >>>>That probably works too - I'm not sure which version is more
> >>>>consistent as well as more maintainable in the long run. I'll leave
> >>>>the call here to Christoffer.
> >>>>
> >>>
> >>>Let's look at the callers to stage2_get_pmd, which is the only caller of
> >>>stage2_get_pud, where the problem was observed:
> >>>
> >>> user_mem_abort
> >>>  -> stage2_set_pmd_huge
> >>>     -> stage2_get_pmd
> >>>
> >>> user_mem_abort
> >>>  -> stage2_set_pte
> >>>     -> stage2_get_pmd
> >>>
> >>> handle_access_fault
> >>>  -> stage2_get_pmd
> >>>
> >>>For the above three functions, pgd cannot ever be NULL, because this is
> >>>running in the context of a VCPU thread, which means the reference on
> >>>the VM fd must not reach zero, so no need to call that here.
> >>
> >>I think there is some problem here.  See below for more information.
> >>
> >>>
> >>> kvm_set_spte_handler
> >>>  -> stage2_set_pte
> >>>     -> stage2_get_pmd
> >>>
> >>>This is called from kvm_set_spte_hva, which is one of the MMU notifiers,
> >>>so it can race similarly kvm_age_hva and kvm_test_age_hva, but it
> >>>already checks for !kvm->arch.pgd.
> >>>
> >>> kvm_phys_addr_ioremap
> >>>  -> stage2_set_pte
> >>>     -> stage2_get_pmd
> >>>
> >>>This is called from two places: (1) The VGIC code (as part of
> >>>vgic_v2_map_resources) and can only be called in the context of running
> >>>a VCPU, so the pgd cannot be null by virtue of the same argument as for
> >>>user_mem_abort. (2) kvm_arch_prepare_memory_region calls
> >>>kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see
> >>>how the VM can be in the middle of being freed while handling ioctls on
> >>>the fd.  Therefore, following the same argument, this should be safe as
> >>>well.
> >>>
> >>> kvm_age_hva_handler and kvm_test_age_hva_handler
> >>>  -> stage2_get_pmd
> >>>
> >>>Handled by the patch proposed by Suzuki.
> >>>
> >>>What does all that tell us?  First, it should give us some comfort that we
> >>>don't have more races of this kind.  Second, it teels us that there are
> >>>a number of different and not-obvious call paths to stage2_pet_pud,
> >>>which could be an argument to simply check the pgd whenever it's called,
> >>>despite the minor runtime overhead.  On the other hand, the check itself
> >>>is only valid knowing that we synchronize against kvm_free_stage2_pgd
> >>>using the kvm->mmu_lock() and understanding that this only happens when
> >>>mmu notifiers call into the KVM MMU code outside the context of the VM.
> >>>
> >>>The last consideration is the winning argument for me to put the check
> >>>in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's
> >>>important that we document why it's only these three high-level callers
> >>>(incl. kvm_set_spte_handler) that need the check, either in the code or
> >>>in the commit message.
> >>
> >>The only way we end up freeing the stage-2 PGD is via the mmu_notifier_release(),
> >>which could be triggered via two different paths.
> >>
> >>1) kvm_destroy_vm(), where all the VM resources has been released and the
> >>refcount on the KVM instances are dropped, via kvm_put_kvm().
> >>
> >>kvm_put_kvm()
> >> kvm_destroy_vm()
> >>   mmu_notifier_unregsiter
> >>     mmu_notifier_ops->release()
> >>       kvm_arch_flush_shadow_all
> >>         kvm_free_stage2_pgd -> free the page table with the mmu_lock held
> >>                                occasionally releasing it to avoid contention.
> >>or
> >>
> >>2) do_signal -> get_signal -> do_group_exit - >
> >>     do_exit
> >>        exit_mm
> >>         mmput => __mmput
> >>           exit_mmap
> >>             mmu_notifier_release
> >>	       mmu_notifier_ops->release
> >>                 kvm_arch_flush_shadow_all
> >>                   kvm_free_stage2_pgd
> >>
> >>In the first case, all references to the VM are dropped and hence none of the
> >>VCPU could still be executing.
> >>
> >>However, in the second case it may not be. So we have a potential problem with
> >>the VCPU trying to run even when the pages were unmapped. I think the root cause
> >>of all these issues boils down to the assumption that KVM holds a reference to
> >>MM (which is not necessarily the user space mapping. i.e mmgrab vs mmget).
> >>I am not sure if the VCPU should hold a reference to the mmaps to make sure
> >>it is safe to run. That way, the mmap stays until the VCPU eventually exits
> >>and we are safe all the way around.
> >
> >Hmmm, my assumption is that if a VCPU is running, it means there is a
> >VCPU thread that shares the struct mm which is running, so I don't
> >understand how mmput would be able to call exit_mmap in the scenario
> >above?
> >
> >So the distinction here is that I don't assume that the VCPU fd holds a
> >reference to the mm, but I assume that the (running) VCPU thread does.
> >Is this incorrect?
> 
> Sorry, I lost this thread in between.

No worries.

> 
> Hmm. You're right. The VCPU should have a refcount on mmap and it shouldn't
> do anything with the mmu if it has dropped it. I was confused based on an
> old bug report,[ See the description of commit 293f293637b55d "kvm-arm: Unmap shadow pagetables
>  properly"], which was fixed.
> 

OK, that makes sense to me.  I hope Andrea also agrees with this, so as
long as VCPU thread is running, exit_mm() will not be called.

Thanks,
-Christoffer
Suzuki K Poulose July 17, 2017, 1:03 p.m. UTC | #10
On 16/07/17 20:56, Christoffer Dall wrote:
> On Fri, Jul 14, 2017 at 05:40:48PM +0100, Suzuki K Poulose wrote:
>> On 06/07/17 10:42, Christoffer Dall wrote:
>>> On Thu, Jul 06, 2017 at 10:34:58AM +0100, Suzuki K Poulose wrote:
>>>> On 06/07/17 08:45, Christoffer Dall wrote:
>>>>> On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote:
>>>>>>
>>>>>>
>>>>>> On 05.07.17 10:57, Suzuki K Poulose wrote:
>>>>>>> Hi Alex,
>>>>>>>
>>>>>>> On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
>>>>>>>> The kvm_age_hva callback may be called all the way concurrently while
>>>>>>>> kvm_mmu_notifier_release() is running.
>>>>>>>>
>>>>>>>> The release function sets kvm->arch.pgd = NULL which the aging function
>>>>>>>> however implicitly relies on in stage2_get_pud(). That means they can
>>>>>>>> race and the aging function may dereference a NULL pgd pointer.
>>>>>>>>
>>>>>>>> This patch adds a check for that case, so that we leave the aging
>>>>>>>> function silently.
>>>>>>>>
>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>> Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
>>>>>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> v1 -> v2:
>>>>>>>>
>>>>>>>> - Fix commit message
>>>>>>>> - Add Fixes and stable tags
>>>>>>>> ---
>>>>>>>> virt/kvm/arm/mmu.c | 4 ++++
>>>>>>>> 1 file changed, 4 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>>>>>> index f2d5b6c..227931f 100644
>>>>>>>> --- a/virt/kvm/arm/mmu.c
>>>>>>>> +++ b/virt/kvm/arm/mmu.c
>>>>>>>> @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>>>>>>>> 	pgd_t *pgd;
>>>>>>>> 	pud_t *pud;
>>>>>>>> +	/* Do we clash with kvm_free_stage2_pgd()? */
>>>>>>>> +	if (!kvm->arch.pgd)
>>>>>>>> +		return NULL;
>>>>>>>> +
>>>>>>>
>>>>>>> I think this check should be moved up in the chain. We call kvm_age_hva(), with
>>>>>>> the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
>>>>>>> if we find the PGD is null when we reach kvm_age_hva(), we could simply return
>>>>>>> there, like we do for other call backs from the KVM mmu_notifier.
>>>>>>
>>>>>> That probably works too - I'm not sure which version is more
>>>>>> consistent as well as more maintainable in the long run. I'll leave
>>>>>> the call here to Christoffer.
>>>>>>
>>>>>
>>>>> Let's look at the callers to stage2_get_pmd, which is the only caller of
>>>>> stage2_get_pud, where the problem was observed:
>>>>>
>>>>> user_mem_abort
>>>>>  -> stage2_set_pmd_huge
>>>>>     -> stage2_get_pmd
>>>>>
>>>>> user_mem_abort
>>>>>  -> stage2_set_pte
>>>>>     -> stage2_get_pmd
>>>>>
>>>>> handle_access_fault
>>>>>  -> stage2_get_pmd
>>>>>
>>>>> For the above three functions, pgd cannot ever be NULL, because this is
>>>>> running in the context of a VCPU thread, which means the reference on
>>>>> the VM fd must not reach zero, so no need to call that here.
>>>>
>>>> I think there is some problem here.  See below for more information.
>>>>
>>>>>
>>>>> kvm_set_spte_handler
>>>>>  -> stage2_set_pte
>>>>>     -> stage2_get_pmd
>>>>>
>>>>> This is called from kvm_set_spte_hva, which is one of the MMU notifiers,
>>>>> so it can race similarly kvm_age_hva and kvm_test_age_hva, but it
>>>>> already checks for !kvm->arch.pgd.
>>>>>
>>>>> kvm_phys_addr_ioremap
>>>>>  -> stage2_set_pte
>>>>>     -> stage2_get_pmd
>>>>>
>>>>> This is called from two places: (1) The VGIC code (as part of
>>>>> vgic_v2_map_resources) and can only be called in the context of running
>>>>> a VCPU, so the pgd cannot be null by virtue of the same argument as for
>>>>> user_mem_abort. (2) kvm_arch_prepare_memory_region calls
>>>>> kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see
>>>>> how the VM can be in the middle of being freed while handling ioctls on
>>>>> the fd.  Therefore, following the same argument, this should be safe as
>>>>> well.
>>>>>
>>>>> kvm_age_hva_handler and kvm_test_age_hva_handler
>>>>>  -> stage2_get_pmd
>>>>>
>>>>> Handled by the patch proposed by Suzuki.
>>>>>
>>>>> What does all that tell us?  First, it should give us some comfort that we
>>>>> don't have more races of this kind.  Second, it teels us that there are
>>>>> a number of different and not-obvious call paths to stage2_pet_pud,
>>>>> which could be an argument to simply check the pgd whenever it's called,
>>>>> despite the minor runtime overhead.  On the other hand, the check itself
>>>>> is only valid knowing that we synchronize against kvm_free_stage2_pgd
>>>>> using the kvm->mmu_lock() and understanding that this only happens when
>>>>> mmu notifiers call into the KVM MMU code outside the context of the VM.
>>>>>
>>>>> The last consideration is the winning argument for me to put the check
>>>>> in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's
>>>>> important that we document why it's only these three high-level callers
>>>>> (incl. kvm_set_spte_handler) that need the check, either in the code or
>>>>> in the commit message.
>>>>
>>>> The only way we end up freeing the stage-2 PGD is via the mmu_notifier_release(),
>>>> which could be triggered via two different paths.
>>>>
>>>> 1) kvm_destroy_vm(), where all the VM resources has been released and the
>>>> refcount on the KVM instances are dropped, via kvm_put_kvm().
>>>>
>>>> kvm_put_kvm()
>>>> kvm_destroy_vm()
>>>>   mmu_notifier_unregsiter
>>>>     mmu_notifier_ops->release()
>>>>       kvm_arch_flush_shadow_all
>>>>         kvm_free_stage2_pgd -> free the page table with the mmu_lock held
>>>>                                occasionally releasing it to avoid contention.
>>>> or
>>>>
>>>> 2) do_signal -> get_signal -> do_group_exit - >
>>>>     do_exit
>>>>        exit_mm
>>>>         mmput => __mmput
>>>>           exit_mmap
>>>>             mmu_notifier_release
>>>> 	       mmu_notifier_ops->release
>>>>                 kvm_arch_flush_shadow_all
>>>>                   kvm_free_stage2_pgd
>>>>
>>>> In the first case, all references to the VM are dropped and hence none of the
>>>> VCPU could still be executing.
>>>>
>>>> However, in the second case it may not be. So we have a potential problem with
>>>> the VCPU trying to run even when the pages were unmapped. I think the root cause
>>>> of all these issues boils down to the assumption that KVM holds a reference to
>>>> MM (which is not necessarily the user space mapping. i.e mmgrab vs mmget).
>>>> I am not sure if the VCPU should hold a reference to the mmaps to make sure
>>>> it is safe to run. That way, the mmap stays until the VCPU eventually exits
>>>> and we are safe all the way around.
>>>
>>> Hmmm, my assumption is that if a VCPU is running, it means there is a
>>> VCPU thread that shares the struct mm which is running, so I don't
>>> understand how mmput would be able to call exit_mmap in the scenario
>>> above?
>>>
>>> So the distinction here is that I don't assume that the VCPU fd holds a
>>> reference to the mm, but I assume that the (running) VCPU thread does.
>>> Is this incorrect?
>>
>> Sorry, I lost this thread in between.
>
> No worries.
>
>>
>> Hmm. You're right. The VCPU should have a refcount on mmap and it shouldn't
>> do anything with the mmu if it has dropped it. I was confused based on an
>> old bug report,[ See the description of commit 293f293637b55d "kvm-arm: Unmap shadow pagetables
>>  properly"], which was fixed.
>>
>

This keeps haunting me :(. I happened to look at another report from Alex back in April,
where a VCPU ends up in an mmu_notifier call back after the PGD was free'd ! So it does
look like you could end up in freeing the stage2 page table and yet another VCPU could
be running.

See : https://lkml.org/lkml/2017/4/21/820

I am trying to see if I can reproduce the hypothesis.

Suzuki

> OK, that makes sense to me.  I hope Andrea also agrees with this, so as
> long as VCPU thread is running, exit_mm() will not be called.
>
> Thanks,
> -Christoffer
>
Christoffer Dall July 17, 2017, 2:45 p.m. UTC | #11
On Mon, Jul 17, 2017 at 3:03 PM, Suzuki K Poulose
<Suzuki.Poulose@arm.com> wrote:
> On 16/07/17 20:56, Christoffer Dall wrote:
>>
>> On Fri, Jul 14, 2017 at 05:40:48PM +0100, Suzuki K Poulose wrote:
>>>
>>> On 06/07/17 10:42, Christoffer Dall wrote:
>>>>
>>>> On Thu, Jul 06, 2017 at 10:34:58AM +0100, Suzuki K Poulose wrote:
>>>>>
>>>>> On 06/07/17 08:45, Christoffer Dall wrote:
>>>>>>
>>>>>> On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 05.07.17 10:57, Suzuki K Poulose wrote:
>>>>>>>>
>>>>>>>> Hi Alex,
>>>>>>>>
>>>>>>>> On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
>>>>>>>>>
>>>>>>>>> The kvm_age_hva callback may be called all the way concurrently
>>>>>>>>> while
>>>>>>>>> kvm_mmu_notifier_release() is running.
>>>>>>>>>
>>>>>>>>> The release function sets kvm->arch.pgd = NULL which the aging
>>>>>>>>> function
>>>>>>>>> however implicitly relies on in stage2_get_pud(). That means they
>>>>>>>>> can
>>>>>>>>> race and the aging function may dereference a NULL pgd pointer.
>>>>>>>>>
>>>>>>>>> This patch adds a check for that case, so that we leave the aging
>>>>>>>>> function silently.
>>>>>>>>>
>>>>>>>>> Cc: stable@vger.kernel.org
>>>>>>>>> Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
>>>>>>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> v1 -> v2:
>>>>>>>>>
>>>>>>>>> - Fix commit message
>>>>>>>>> - Add Fixes and stable tags
>>>>>>>>> ---
>>>>>>>>> virt/kvm/arm/mmu.c | 4 ++++
>>>>>>>>> 1 file changed, 4 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>>>>>>> index f2d5b6c..227931f 100644
>>>>>>>>> --- a/virt/kvm/arm/mmu.c
>>>>>>>>> +++ b/virt/kvm/arm/mmu.c
>>>>>>>>> @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm,
>>>>>>>>> struct kvm_mmu_memory_cache *cache
>>>>>>>>>         pgd_t *pgd;
>>>>>>>>>         pud_t *pud;
>>>>>>>>> +       /* Do we clash with kvm_free_stage2_pgd()? */
>>>>>>>>> +       if (!kvm->arch.pgd)
>>>>>>>>> +               return NULL;
>>>>>>>>> +
>>>>>>>>
>>>>>>>>
>>>>>>>> I think this check should be moved up in the chain. We call
>>>>>>>> kvm_age_hva(), with
>>>>>>>> the kvm->mmu_lock held and we don't release it till we reach here.
>>>>>>>> So, ideally,
>>>>>>>> if we find the PGD is null when we reach kvm_age_hva(), we could
>>>>>>>> simply return
>>>>>>>> there, like we do for other call backs from the KVM mmu_notifier.
>>>>>>>
>>>>>>>
>>>>>>> That probably works too - I'm not sure which version is more
>>>>>>> consistent as well as more maintainable in the long run. I'll leave
>>>>>>> the call here to Christoffer.
>>>>>>>
>>>>>>
>>>>>> Let's look at the callers to stage2_get_pmd, which is the only caller
>>>>>> of
>>>>>> stage2_get_pud, where the problem was observed:
>>>>>>
>>>>>> user_mem_abort
>>>>>>  -> stage2_set_pmd_huge
>>>>>>     -> stage2_get_pmd
>>>>>>
>>>>>> user_mem_abort
>>>>>>  -> stage2_set_pte
>>>>>>     -> stage2_get_pmd
>>>>>>
>>>>>> handle_access_fault
>>>>>>  -> stage2_get_pmd
>>>>>>
>>>>>> For the above three functions, pgd cannot ever be NULL, because this
>>>>>> is
>>>>>> running in the context of a VCPU thread, which means the reference on
>>>>>> the VM fd must not reach zero, so no need to call that here.
>>>>>
>>>>>
>>>>> I think there is some problem here.  See below for more information.
>>>>>
>>>>>>
>>>>>> kvm_set_spte_handler
>>>>>>  -> stage2_set_pte
>>>>>>     -> stage2_get_pmd
>>>>>>
>>>>>> This is called from kvm_set_spte_hva, which is one of the MMU
>>>>>> notifiers,
>>>>>> so it can race similarly kvm_age_hva and kvm_test_age_hva, but it
>>>>>> already checks for !kvm->arch.pgd.
>>>>>>
>>>>>> kvm_phys_addr_ioremap
>>>>>>  -> stage2_set_pte
>>>>>>     -> stage2_get_pmd
>>>>>>
>>>>>> This is called from two places: (1) The VGIC code (as part of
>>>>>> vgic_v2_map_resources) and can only be called in the context of
>>>>>> running
>>>>>> a VCPU, so the pgd cannot be null by virtue of the same argument as
>>>>>> for
>>>>>> user_mem_abort. (2) kvm_arch_prepare_memory_region calls
>>>>>> kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see
>>>>>> how the VM can be in the middle of being freed while handling ioctls
>>>>>> on
>>>>>> the fd.  Therefore, following the same argument, this should be safe
>>>>>> as
>>>>>> well.
>>>>>>
>>>>>> kvm_age_hva_handler and kvm_test_age_hva_handler
>>>>>>  -> stage2_get_pmd
>>>>>>
>>>>>> Handled by the patch proposed by Suzuki.
>>>>>>
>>>>>> What does all that tell us?  First, it should give us some comfort
>>>>>> that we
>>>>>> don't have more races of this kind.  Second, it teels us that there
>>>>>> are
>>>>>> a number of different and not-obvious call paths to stage2_pet_pud,
>>>>>> which could be an argument to simply check the pgd whenever it's
>>>>>> called,
>>>>>> despite the minor runtime overhead.  On the other hand, the check
>>>>>> itself
>>>>>> is only valid knowing that we synchronize against kvm_free_stage2_pgd
>>>>>> using the kvm->mmu_lock() and understanding that this only happens
>>>>>> when
>>>>>> mmu notifiers call into the KVM MMU code outside the context of the
>>>>>> VM.
>>>>>>
>>>>>> The last consideration is the winning argument for me to put the check
>>>>>> in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's
>>>>>> important that we document why it's only these three high-level
>>>>>> callers
>>>>>> (incl. kvm_set_spte_handler) that need the check, either in the code
>>>>>> or
>>>>>> in the commit message.
>>>>>
>>>>>
>>>>> The only way we end up freeing the stage-2 PGD is via the
>>>>> mmu_notifier_release(),
>>>>> which could be triggered via two different paths.
>>>>>
>>>>> 1) kvm_destroy_vm(), where all the VM resources has been released and
>>>>> the
>>>>> refcount on the KVM instances are dropped, via kvm_put_kvm().
>>>>>
>>>>> kvm_put_kvm()
>>>>> kvm_destroy_vm()
>>>>>   mmu_notifier_unregsiter
>>>>>     mmu_notifier_ops->release()
>>>>>       kvm_arch_flush_shadow_all
>>>>>         kvm_free_stage2_pgd -> free the page table with the mmu_lock
>>>>> held
>>>>>                                occasionally releasing it to avoid
>>>>> contention.
>>>>> or
>>>>>
>>>>> 2) do_signal -> get_signal -> do_group_exit - >
>>>>>     do_exit
>>>>>        exit_mm
>>>>>         mmput => __mmput
>>>>>           exit_mmap
>>>>>             mmu_notifier_release
>>>>>                mmu_notifier_ops->release
>>>>>                 kvm_arch_flush_shadow_all
>>>>>                   kvm_free_stage2_pgd
>>>>>
>>>>> In the first case, all references to the VM are dropped and hence none
>>>>> of the
>>>>> VCPU could still be executing.
>>>>>
>>>>> However, in the second case it may not be. So we have a potential
>>>>> problem with
>>>>> the VCPU trying to run even when the pages were unmapped. I think the
>>>>> root cause
>>>>> of all these issues boils down to the assumption that KVM holds a
>>>>> reference to
>>>>> MM (which is not necessarily the user space mapping. i.e mmgrab vs
>>>>> mmget).
>>>>> I am not sure if the VCPU should hold a reference to the mmaps to make
>>>>> sure
>>>>> it is safe to run. That way, the mmap stays until the VCPU eventually
>>>>> exits
>>>>> and we are safe all the way around.
>>>>
>>>>
>>>> Hmmm, my assumption is that if a VCPU is running, it means there is a
>>>> VCPU thread that shares the struct mm which is running, so I don't
>>>> understand how mmput would be able to call exit_mmap in the scenario
>>>> above?
>>>>
>>>> So the distinction here is that I don't assume that the VCPU fd holds a
>>>> reference to the mm, but I assume that the (running) VCPU thread does.
>>>> Is this incorrect?
>>>
>>>
>>> Sorry, I lost this thread in between.
>>
>>
>> No worries.
>>
>>>
>>> Hmm. You're right. The VCPU should have a refcount on mmap and it
>>> shouldn't
>>> do anything with the mmu if it has dropped it. I was confused based on an
>>> old bug report,[ See the description of commit 293f293637b55d "kvm-arm:
>>> Unmap shadow pagetables
>>>  properly"], which was fixed.
>>>
>>
>
> This keeps haunting me :(. I happened to look at another report from Alex
> back in April,
> where a VCPU ends up in an mmu_notifier call back after the PGD was free'd !
> So it does
> look like you could end up in freeing the stage2 page table and yet another
> VCPU could
> be running.
>
> See : https://lkml.org/lkml/2017/4/21/820
>
> I am trying to see if I can reproduce the hypothesis.
>
I would also very much like to get to the bottom of this, and at the
very least try to get a valid explanation as to how a thread can be
*running* for a process where there are zero references to the struct
mm?

I guess I am asking where this mmput() can happen for a perfectly
running thread, which hasn't processes signals or exited itself yet.

The dump you reference above seems to indicate that it's happening
under memory pressure and trying to unmap memory from the VM to
allocate memory to the VM, but all seems to be happening within a VCPU
thread, or am I reading this wrong?

Thanks,
-Christoffer
Andrea Arcangeli July 17, 2017, 3:16 p.m. UTC | #12
On Mon, Jul 17, 2017 at 04:45:10PM +0200, Christoffer Dall wrote:
> I would also very much like to get to the bottom of this, and at the
> very least try to get a valid explanation as to how a thread can be
> *running* for a process where there are zero references to the struct
> mm?

A thread shouldn't be possibly be running if mm->mm_users is zero.

> I guess I am asking where this mmput() can happen for a perfectly
> running thread, which hasn't processes signals or exited itself yet.

mmput runs during exit(), after that point the vcpu can't run the KVM
ioctl anymore.

> The dump you reference above seems to indicate that it's happening
> under memory pressure and trying to unmap memory from the VM to
> allocate memory to the VM, but all seems to be happening within a VCPU
> thread, or am I reading this wrong?

In the oops the pgd was none while KVM vcpu ioctl was running, the
most likely explanation is there were two VM running in parallel in
the host, and the other one was quitting (mm_count of the other VM was
zero, while mm_count of the VM that oopsed within the vcpu ioctl was >
0). The oops information itself can't tell if there was one or two VM
running in the host so > 1 VM running is the most plausible
explanation that doesn't break the above in invariants. It'd be nice
if Alexander can confirm it, if he remembers about that specific setup
after a couple of months since it happened.

Even if there was just one VM running in the host, it would more
likely mean something inside KVM ARM code is clearing the pgd before
mm_users reaches zero, i.e. before the last mmput.

It's very unlikely mm_users could have been > 0 while the vcpu thread
was running as many more things would fall apart in such case, not
just the needed pgd check during mmu notifier post process exit.

Thanks,
Andrea
Christoffer Dall July 17, 2017, 6:23 p.m. UTC | #13
On Mon, Jul 17, 2017 at 05:16:17PM +0200, Andrea Arcangeli wrote:
> On Mon, Jul 17, 2017 at 04:45:10PM +0200, Christoffer Dall wrote:
> > I would also very much like to get to the bottom of this, and at the
> > very least try to get a valid explanation as to how a thread can be
> > *running* for a process where there are zero references to the struct
> > mm?
> 
> A thread shouldn't be possibly be running if mm->mm_users is zero.
> 

ok, good, then I don't have to re-take OS 101.

> > I guess I am asking where this mmput() can happen for a perfectly
> > running thread, which hasn't processes signals or exited itself yet.
> 
> mmput runs during exit(), after that point the vcpu can't run the KVM
> ioctl anymore.
> 

also very comforting that we agree on this.

> > The dump you reference above seems to indicate that it's happening
> > under memory pressure and trying to unmap memory from the VM to
> > allocate memory to the VM, but all seems to be happening within a VCPU
> > thread, or am I reading this wrong?
> 
> In the oops the pgd was none while KVM vcpu ioctl was running, the
> most likely explanation is there were two VM running in parallel in
> the host, and the other one was quitting (mm_count of the other VM was
> zero, while mm_count of the VM that oopsed within the vcpu ioctl was >
> 0). The oops information itself can't tell if there was one or two VM
> running in the host so > 1 VM running is the most plausible
> explanation that doesn't break the above in invariants.

That's very keenly observed, and a really good explanation.


> It'd be nice
> if Alexander can confirm it, if he remembers about that specific setup
> after a couple of months since it happened.

My guess is that this was observed on the suse build machines with
arm64, and Alex ususally explains that these machines run *lots* of VMs
at the same time, so this sounds very likely.

Alex, can you confirm this was the type of workload?

> 
> Even if there was just one VM running in the host, it would more
> likely mean something inside KVM ARM code is clearing the pgd before
> mm_users reaches zero, i.e. before the last mmput.

I don't think we have this.

> 
> It's very unlikely mm_users could have been > 0 while the vcpu thread
> was running as many more things would fall apart in such case, not
> just the needed pgd check during mmu notifier post process exit.
> 

That was my rationale exactly.  Thanks for confirming!

-Christoffer
Alexander Graf July 17, 2017, 8:49 p.m. UTC | #14
On 17.07.17 20:23, Christoffer Dall wrote:
> On Mon, Jul 17, 2017 at 05:16:17PM +0200, Andrea Arcangeli wrote:
>> On Mon, Jul 17, 2017 at 04:45:10PM +0200, Christoffer Dall wrote:
>>> I would also very much like to get to the bottom of this, and at the
>>> very least try to get a valid explanation as to how a thread can be
>>> *running* for a process where there are zero references to the struct
>>> mm?
>>
>> A thread shouldn't be possibly be running if mm->mm_users is zero.
>>
> 
> ok, good, then I don't have to re-take OS 101.
> 
>>> I guess I am asking where this mmput() can happen for a perfectly
>>> running thread, which hasn't processes signals or exited itself yet.
>>
>> mmput runs during exit(), after that point the vcpu can't run the KVM
>> ioctl anymore.
>>
> 
> also very comforting that we agree on this.
> 
>>> The dump you reference above seems to indicate that it's happening
>>> under memory pressure and trying to unmap memory from the VM to
>>> allocate memory to the VM, but all seems to be happening within a VCPU
>>> thread, or am I reading this wrong?
>>
>> In the oops the pgd was none while KVM vcpu ioctl was running, the
>> most likely explanation is there were two VM running in parallel in
>> the host, and the other one was quitting (mm_count of the other VM was
>> zero, while mm_count of the VM that oopsed within the vcpu ioctl was >
>> 0). The oops information itself can't tell if there was one or two VM
>> running in the host so > 1 VM running is the most plausible
>> explanation that doesn't break the above in invariants.
> 
> That's very keenly observed, and a really good explanation.
> 
> 
>> It'd be nice
>> if Alexander can confirm it, if he remembers about that specific setup
>> after a couple of months since it happened.
> 
> My guess is that this was observed on the suse build machines with
> arm64, and Alex ususally explains that these machines run *lots* of VMs
> at the same time, so this sounds very likely.
> 
> Alex, can you confirm this was the type of workload?

Yes, most KVM issues I see are either with OBS (lots of build VMs in 
parallel) or OpenQA (lots of VMs in parallel clicking, typing and 
matching screen contents).

I don't remember which one of the two use cases that particular dump 
came from, but in any case there was certainly more than one VM running. 
We're usually in the range of 20-40 VMs per system.


Alex
diff mbox

Patch

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e2e5eff..db1c7b2 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1665,12 +1665,16 @@  static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *
 
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
+	if (!kvm->arch.pgd)
+		return 0;
 	trace_kvm_age_hva(start, end);
 	return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
 {
+	if (!kvm->arch.pgd)
+		return 0;
 	trace_kvm_test_age_hva(hva);
 	return handle_hva_to_gpa(kvm, hva, hva, kvm_test_age_hva_handler, NULL);
 }