diff mbox

kvm: x86: vmx: fix vpid leak

Message ID 20180719185907.21335-1-rkagan@virtuozzo.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roman Kagan July 19, 2018, 6:59 p.m. UTC
VPID for the nested vcpu is allocated at vmx_create_vcpu whenever nested
vmx is turned on with the module parameter.

However, it's only freed if the L1 guest has executed VMXON which is not
a given.

As a result, on a system with nested==on every creation+deletion of an
L1 vcpu without running an L2 guest results in leaking one vpid.  Since
the total number of vpids is limited to 64k, they can eventually get
exhausted, preventing L2 from starting.

Delay allocation of the L2 vpid until VMXON emulation, thus matching its
freeing.

Fixes: 5c614b3583e7b6dab0c86356fa36c2bcbb8322a0
Cc: stable@vger.kernel.org
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
---
 arch/x86/kvm/vmx.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Jim Mattson July 19, 2018, 7:44 p.m. UTC | #1
If we're using nested EPT, why not do away with this allocation
altogether, and just use the vpid from vmcs12? The TLB is tagged by
{PCID, EP4TA, VPID}, and the shadow EP4TA will be different from any
L1 EP4TA.

On Thu, Jul 19, 2018 at 11:59 AM, Roman Kagan <rkagan@virtuozzo.com> wrote:
> VPID for the nested vcpu is allocated at vmx_create_vcpu whenever nested
> vmx is turned on with the module parameter.
>
> However, it's only freed if the L1 guest has executed VMXON which is not
> a given.
>
> As a result, on a system with nested==on every creation+deletion of an
> L1 vcpu without running an L2 guest results in leaking one vpid.  Since
> the total number of vpids is limited to 64k, they can eventually get
> exhausted, preventing L2 from starting.
>
> Delay allocation of the L2 vpid until VMXON emulation, thus matching its
> freeing.
>
> Fixes: 5c614b3583e7b6dab0c86356fa36c2bcbb8322a0
> Cc: stable@vger.kernel.org
> Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
> ---
>  arch/x86/kvm/vmx.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e30da9a2430c..fcb33ba15bb9 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7893,6 +7893,8 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu)
>                      HRTIMER_MODE_REL_PINNED);
>         vmx->nested.preemption_timer.function = vmx_preemption_timer_fn;
>
> +       vmx->nested.vpid02 = allocate_vpid();
> +
>         vmx->nested.vmxon = true;
>         return 0;
>
> @@ -10370,11 +10372,9 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
>                         goto free_vmcs;
>         }
>
> -       if (nested) {
> +       if (nested)
>                 nested_vmx_setup_ctls_msrs(&vmx->nested.msrs,
>                                            kvm_vcpu_apicv_active(&vmx->vcpu));
> -               vmx->nested.vpid02 = allocate_vpid();
> -       }
>
>         vmx->nested.posted_intr_nv = -1;
>         vmx->nested.current_vmptr = -1ull;
> @@ -10391,7 +10391,6 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
>         return &vmx->vcpu;
>
>  free_vmcs:
> -       free_vpid(vmx->nested.vpid02);
>         free_loaded_vmcs(vmx->loaded_vmcs);
>  free_msrs:
>         kfree(vmx->guest_msrs);
> --
> 2.17.1
>
Roman Kagan July 20, 2018, 3:31 p.m. UTC | #2
On Thu, Jul 19, 2018 at 12:44:53PM -0700, Jim Mattson wrote:
> If we're using nested EPT, why not do away with this allocation
> altogether, and just use the vpid from vmcs12? The TLB is tagged by
> {PCID, EP4TA, VPID}, and the shadow EP4TA will be different from any
> L1 EP4TA.

Hmm, I'm wondering why the original commit brought any improvement then?

The commit message says

> VPID is used to tag address space and avoid a TLB flush. Currently L0 use
> the same VPID to run L1 and all its guests. KVM flushes VPID when switching
> between L1 and L2.
> 
> This patch advertises VPID to the L1 hypervisor, then address space of L1
> and L2 can be separately treated and avoid TLB flush when swithing between
> L1 and L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid
> w/ an invvpid.

but my understanding of your comment is that even if L1 and L2 share the
vpid, flushing one doesn't flush the other, so all of the performance
gain of that commit comes from skipping the flush when entering the same
L2 vcpu and/or advertising vpid support in L1 hypervisor.  Is this
correct?

Thanks,
Roman.

> On Thu, Jul 19, 2018 at 11:59 AM, Roman Kagan <rkagan@virtuozzo.com> wrote:
> > VPID for the nested vcpu is allocated at vmx_create_vcpu whenever nested
> > vmx is turned on with the module parameter.
> >
> > However, it's only freed if the L1 guest has executed VMXON which is not
> > a given.
> >
> > As a result, on a system with nested==on every creation+deletion of an
> > L1 vcpu without running an L2 guest results in leaking one vpid.  Since
> > the total number of vpids is limited to 64k, they can eventually get
> > exhausted, preventing L2 from starting.
> >
> > Delay allocation of the L2 vpid until VMXON emulation, thus matching its
> > freeing.
> >
> > Fixes: 5c614b3583e7b6dab0c86356fa36c2bcbb8322a0
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
> > ---
> >  arch/x86/kvm/vmx.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index e30da9a2430c..fcb33ba15bb9 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7893,6 +7893,8 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu)
> >                      HRTIMER_MODE_REL_PINNED);
> >         vmx->nested.preemption_timer.function = vmx_preemption_timer_fn;
> >
> > +       vmx->nested.vpid02 = allocate_vpid();
> > +
> >         vmx->nested.vmxon = true;
> >         return 0;
> >
> > @@ -10370,11 +10372,9 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
> >                         goto free_vmcs;
> >         }
> >
> > -       if (nested) {
> > +       if (nested)
> >                 nested_vmx_setup_ctls_msrs(&vmx->nested.msrs,
> >                                            kvm_vcpu_apicv_active(&vmx->vcpu));
> > -               vmx->nested.vpid02 = allocate_vpid();
> > -       }
> >
> >         vmx->nested.posted_intr_nv = -1;
> >         vmx->nested.current_vmptr = -1ull;
> > @@ -10391,7 +10391,6 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
> >         return &vmx->vcpu;
> >
> >  free_vmcs:
> > -       free_vpid(vmx->nested.vpid02);
> >         free_loaded_vmcs(vmx->loaded_vmcs);
> >  free_msrs:
> >         kfree(vmx->guest_msrs);
> > --
> > 2.17.1
> >
Paolo Bonzini July 20, 2018, 4:07 p.m. UTC | #3
On 19/07/2018 21:44, Jim Mattson wrote:
> If we're using nested EPT, why not do away with this allocation
> altogether, and just use the vpid from vmcs12? The TLB is tagged by
> {PCID, EP4TA, VPID}, and the shadow EP4TA will be different from any
> L1 EP4TA.

For correctness that's true, but INVEPT and INVVPID would impact
performance of L1 and especially of other guests sharing the host with L1.

In particular, we map an all-context INVVPID from L1 to a single-context
INVVPID on the vpid02.  I think we could also replace the
KVM_REQ_TLB_FLUSH in INVEPT with a single-context INVVPID; the
"kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu)" in handle_invept in fact
doesn't make much sense.

Paolo
Roman Kagan July 23, 2018, 6:22 p.m. UTC | #4
On Fri, Jul 20, 2018 at 06:07:01PM +0200, Paolo Bonzini wrote:
> On 19/07/2018 21:44, Jim Mattson wrote:
> > If we're using nested EPT, why not do away with this allocation
> > altogether, and just use the vpid from vmcs12? The TLB is tagged by
> > {PCID, EP4TA, VPID}, and the shadow EP4TA will be different from any
> > L1 EP4TA.
> 
> For correctness that's true, but INVEPT and INVVPID would impact
> performance of L1 and especially of other guests sharing the host with L1.
> 
> In particular, we map an all-context INVVPID from L1 to a single-context
> INVVPID on the vpid02.  I think we could also replace the
> KVM_REQ_TLB_FLUSH in INVEPT with a single-context INVVPID; the
> "kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu)" in handle_invept in fact
> doesn't make much sense.

I'll be unavailable for a couple of weeks starting the coming Wednesday
so I won't be able to work on this.

Meanwhile this leak is real, and the failures to run nested guests are
not very easy to diagnose, so, unless there's someone else ready to do a
better fix, it may make sense to consider my patch short term and
revisit it later with a cleaner solution.

What do you think?

Roman.
Jim Mattson July 23, 2018, 6:26 p.m. UTC | #5
Sounds good to me.


On Mon, Jul 23, 2018 at 11:22 AM Roman Kagan <rkagan@virtuozzo.com> wrote:

> On Fri, Jul 20, 2018 at 06:07:01PM +0200, Paolo Bonzini wrote:
> > On 19/07/2018 21:44, Jim Mattson wrote:
> > > If we're using nested EPT, why not do away with this allocation
> > > altogether, and just use the vpid from vmcs12? The TLB is tagged by
> > > {PCID, EP4TA, VPID}, and the shadow EP4TA will be different from any
> > > L1 EP4TA.
> >
> > For correctness that's true, but INVEPT and INVVPID would impact
> > performance of L1 and especially of other guests sharing the host with
> L1.
> >
> > In particular, we map an all-context INVVPID from L1 to a single-context
> > INVVPID on the vpid02.  I think we could also replace the
> > KVM_REQ_TLB_FLUSH in INVEPT with a single-context INVVPID; the
> > "kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu)" in handle_invept in fact
> > doesn't make much sense.
>
> I'll be unavailable for a couple of weeks starting the coming Wednesday
> so I won't be able to work on this.
>
> Meanwhile this leak is real, and the failures to run nested guests are
> not very easy to diagnose, so, unless there's someone else ready to do a
> better fix, it may make sense to consider my patch short term and
> revisit it later with a cleaner solution.
>
> What do you think?
>
> Roman.
>
<div>Sounds good to me.</div><div><br></div><div><br><div class="gmail_quote"><div dir="ltr">On Mon, Jul 23, 2018 at 11:22 AM Roman Kagan &lt;<a href="mailto:rkagan@virtuozzo.com">rkagan@virtuozzo.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Fri, Jul 20, 2018 at 06:07:01PM +0200, Paolo Bonzini wrote:<br>
&gt; On 19/07/2018 21:44, Jim Mattson wrote:<br>
&gt; &gt; If we&#39;re using nested EPT, why not do away with this allocation<br>
&gt; &gt; altogether, and just use the vpid from vmcs12? The TLB is tagged by<br>
&gt; &gt; {PCID, EP4TA, VPID}, and the shadow EP4TA will be different from any<br>
&gt; &gt; L1 EP4TA.<br>
&gt; <br>
&gt; For correctness that&#39;s true, but INVEPT and INVVPID would impact<br>
&gt; performance of L1 and especially of other guests sharing the host with L1.<br>
&gt; <br>
&gt; In particular, we map an all-context INVVPID from L1 to a single-context<br>
&gt; INVVPID on the vpid02.  I think we could also replace the<br>
&gt; KVM_REQ_TLB_FLUSH in INVEPT with a single-context INVVPID; the<br>
&gt; &quot;kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu)&quot; in handle_invept in fact<br>
&gt; doesn&#39;t make much sense.<br>
<br>
I&#39;ll be unavailable for a couple of weeks starting the coming Wednesday<br>
so I won&#39;t be able to work on this.<br>
<br>
Meanwhile this leak is real, and the failures to run nested guests are<br>
not very easy to diagnose, so, unless there&#39;s someone else ready to do a<br>
better fix, it may make sense to consider my patch short term and<br>
revisit it later with a cleaner solution.<br>
<br>
What do you think?<br>
<br>
Roman.<br>
</blockquote></div></div>
Paolo Bonzini July 24, 2018, 11:02 a.m. UTC | #6
On 23/07/2018 20:22, Roman Kagan wrote:
> On Fri, Jul 20, 2018 at 06:07:01PM +0200, Paolo Bonzini wrote:
>> On 19/07/2018 21:44, Jim Mattson wrote:
>>> If we're using nested EPT, why not do away with this allocation
>>> altogether, and just use the vpid from vmcs12? The TLB is tagged by
>>> {PCID, EP4TA, VPID}, and the shadow EP4TA will be different from any
>>> L1 EP4TA.
>>
>> For correctness that's true, but INVEPT and INVVPID would impact
>> performance of L1 and especially of other guests sharing the host with L1.
>>
>> In particular, we map an all-context INVVPID from L1 to a single-context
>> INVVPID on the vpid02.  I think we could also replace the
>> KVM_REQ_TLB_FLUSH in INVEPT with a single-context INVVPID; the
>> "kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu)" in handle_invept in fact
>> doesn't make much sense.
> 
> I'll be unavailable for a couple of weeks starting the coming Wednesday
> so I won't be able to work on this.
> 
> Meanwhile this leak is real, and the failures to run nested guests are
> not very easy to diagnose, so, unless there's someone else ready to do a
> better fix, it may make sense to consider my patch short term and
> revisit it later with a cleaner solution.
> 
> What do you think?

Yes, your patch is already in kvm/master. :)

Paolo
diff mbox

Patch

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e30da9a2430c..fcb33ba15bb9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7893,6 +7893,8 @@  static int enter_vmx_operation(struct kvm_vcpu *vcpu)
 		     HRTIMER_MODE_REL_PINNED);
 	vmx->nested.preemption_timer.function = vmx_preemption_timer_fn;
 
+	vmx->nested.vpid02 = allocate_vpid();
+
 	vmx->nested.vmxon = true;
 	return 0;
 
@@ -10370,11 +10372,9 @@  static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 			goto free_vmcs;
 	}
 
-	if (nested) {
+	if (nested)
 		nested_vmx_setup_ctls_msrs(&vmx->nested.msrs,
 					   kvm_vcpu_apicv_active(&vmx->vcpu));
-		vmx->nested.vpid02 = allocate_vpid();
-	}
 
 	vmx->nested.posted_intr_nv = -1;
 	vmx->nested.current_vmptr = -1ull;
@@ -10391,7 +10391,6 @@  static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	return &vmx->vcpu;
 
 free_vmcs:
-	free_vpid(vmx->nested.vpid02);
 	free_loaded_vmcs(vmx->loaded_vmcs);
 free_msrs:
 	kfree(vmx->guest_msrs);