[v2,0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use

Message ID	20210517135054.1914802-1-vkuznets@redhat.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Vitaly Kuznetsov <vkuznets@redhat.com> To: kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Maxim Levitsky <mlevitsk@redhat.com>, linux-kernel@vger.kernel.org Subject: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Date: Mon, 17 May 2021 15:50:47 +0200 Message-Id: <20210517135054.1914802-1-vkuznets@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	KVM: nVMX: Fixes for nested state migration when eVMCS is in use \| expand [v2,0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use [v2,1/7] KVM: nVMX: Introduce nested_evmcs_is_used() [v2,2/7] KVM: nVMX: Release enlightened VMCS on VMCLEAR [v2,3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state… [v2,4/7] KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() [v2,5/7] KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() [v2,6/7] KVM: nVMX: Request to sync eVMCS from VMCS12 after migration [v2,7/7] KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never lost

Vitaly Kuznetsov May 17, 2021, 1:50 p.m. UTC

Changes since v1 (Sean):
- Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
- Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
  copy_enlightened_to_vmcs12().

Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
 + WSL2) was crashing immediately after migration. It was also reported
that we have more issues to fix as, while the failure rate was lowered 
signifincatly, it was still possible to observe crashes after several
dozens of migration. Turns out, the issue arises when we manage to issue
KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
the flag itself is not part of saved nested state. A few other less 
significant issues are fixed along the way.

While there's no proof this series fixes all eVMCS related problems,
Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
crashing in testing.

Patches are based on the current kvm/next tree.

Vitaly Kuznetsov (7):
  KVM: nVMX: Introduce nested_evmcs_is_used()
  KVM: nVMX: Release enlightened VMCS on VMCLEAR
  KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
    vmx_get_nested_state()
  KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
  KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
  KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
  KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
    lost

 arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
 .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
 2 files changed, 115 insertions(+), 59 deletions(-)

Maxim Levitsky May 24, 2021, 12:08 p.m. UTC | #1

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> Changes since v1 (Sean):
> - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
>   copy_enlightened_to_vmcs12().
> 
> Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
>  + WSL2) was crashing immediately after migration. It was also reported
> that we have more issues to fix as, while the failure rate was lowered 
> signifincatly, it was still possible to observe crashes after several
> dozens of migration. Turns out, the issue arises when we manage to issue
> KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> the flag itself is not part of saved nested state. A few other less 
> significant issues are fixed along the way.
> 
> While there's no proof this series fixes all eVMCS related problems,
> Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> crashing in testing.
> 
> Patches are based on the current kvm/next tree.
> 
> Vitaly Kuznetsov (7):
>   KVM: nVMX: Introduce nested_evmcs_is_used()
>   KVM: nVMX: Release enlightened VMCS on VMCLEAR
>   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
>     vmx_get_nested_state()
>   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
>   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
>   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
>   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
>     lost
> 
>  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
>  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
>  2 files changed, 115 insertions(+), 59 deletions(-)
> 

Hi Vitaly!

In addition to the review of this patch series, I would like
to share an idea on how to avoid the hack of mapping the evmcs
in nested_vmx_vmexit, because I think I found a possible generic
solution to this and similar issues:

The solution is to always set nested_run_pending after 
nested migration (which means that we won't really
need to migrate this flag anymore).

I was thinking a lot about it and I think that there is no downside to this,
other than sometimes a one extra vmexit after migration.

Otherwise there is always a risk of the following scenario:

  1. We migrate with nested_run_pending=0 (but don't restore all the state
  yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
  or just the guest memory map is not up to date, guest is in smm or something
  like that)

  2. Userspace calls some ioctl that causes a nested vmexit

  This can happen today if the userspace calls 
  kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events

  3. Userspace finally sets correct guest's msrs, correct guest memory map and only
  then calls KVM_RUN

This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
but we have to do so to complete the nested vmexit.

To some extent, the entry to the nested mode after a migration is only complete
when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.

This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
nested vmexit path at all. 

Best regards,
	Maxim Levitsky

Vitaly Kuznetsov May 24, 2021, 12:44 p.m. UTC | #2

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
>> Changes since v1 (Sean):
>> - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
>> - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
>>   copy_enlightened_to_vmcs12().
>> 
>> Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
>> migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
>>  + WSL2) was crashing immediately after migration. It was also reported
>> that we have more issues to fix as, while the failure rate was lowered 
>> signifincatly, it was still possible to observe crashes after several
>> dozens of migration. Turns out, the issue arises when we manage to issue
>> KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
>> to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
>> the flag itself is not part of saved nested state. A few other less 
>> significant issues are fixed along the way.
>> 
>> While there's no proof this series fixes all eVMCS related problems,
>> Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
>> crashing in testing.
>> 
>> Patches are based on the current kvm/next tree.
>> 
>> Vitaly Kuznetsov (7):
>>   KVM: nVMX: Introduce nested_evmcs_is_used()
>>   KVM: nVMX: Release enlightened VMCS on VMCLEAR
>>   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
>>     vmx_get_nested_state()
>>   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
>>   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
>>   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
>>   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
>>     lost
>> 
>>  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
>>  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
>>  2 files changed, 115 insertions(+), 59 deletions(-)
>> 
>
> Hi Vitaly!
>
> In addition to the review of this patch series,

Thanks by the way!

>  I would like
> to share an idea on how to avoid the hack of mapping the evmcs
> in nested_vmx_vmexit, because I think I found a possible generic
> solution to this and similar issues:
>
> The solution is to always set nested_run_pending after 
> nested migration (which means that we won't really
> need to migrate this flag anymore).
>
> I was thinking a lot about it and I think that there is no downside to this,
> other than sometimes a one extra vmexit after migration.
>
> Otherwise there is always a risk of the following scenario:
>
>   1. We migrate with nested_run_pending=0 (but don't restore all the state
>   yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
>   or just the guest memory map is not up to date, guest is in smm or something
>   like that)
>
>   2. Userspace calls some ioctl that causes a nested vmexit
>
>   This can happen today if the userspace calls 
>   kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
>
>   3. Userspace finally sets correct guest's msrs, correct guest memory map and only
>   then calls KVM_RUN
>
> This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
> if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
> but we have to do so to complete the nested vmexit.

Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
only one place which calls copy_vmcs12_to_enlightened():
nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
vmx_prepare_switch_to_guest() so unless userspace decided to execute
not-fully-restored guest this should not happen. I'm probably missing
something in your scenario)

>
> To some extent, the entry to the nested mode after a migration is only complete
> when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
>
> This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
> nested vmexit path at all. 

Remember, we have three possible states when nested state is
transferred:
1) L2 was running
2) L1 was running
3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).

Is 'nested_run_pending' suitable for all of them? Could you maybe draft
a patch so we can see how this works (in both 'normal' and 'evmcs'
cases)?

Paolo Bonzini May 24, 2021, 2:01 p.m. UTC | #3

On 17/05/21 15:50, Vitaly Kuznetsov wrote:
> Changes since v1 (Sean):
> - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
>    copy_enlightened_to_vmcs12().
> 
> Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
>   + WSL2) was crashing immediately after migration. It was also reported
> that we have more issues to fix as, while the failure rate was lowered
> signifincatly, it was still possible to observe crashes after several
> dozens of migration. Turns out, the issue arises when we manage to issue
> KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> the flag itself is not part of saved nested state. A few other less
> significant issues are fixed along the way.
> 
> While there's no proof this series fixes all eVMCS related problems,
> Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> crashing in testing.
> 
> Patches are based on the current kvm/next tree.
> 
> Vitaly Kuznetsov (7):
>    KVM: nVMX: Introduce nested_evmcs_is_used()
>    KVM: nVMX: Release enlightened VMCS on VMCLEAR
>    KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
>      vmx_get_nested_state()
>    KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
>    KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
>    KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
>    KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
>      lost
> 
>   arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
>   .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
>   2 files changed, 115 insertions(+), 59 deletions(-)
> 

Looks good, I'm possibly expecting a v3 depending on what you think 
about my patch 1 suggestion.

Paolo

Maxim Levitsky May 26, 2021, 2:41 p.m. UTC | #4

On Mon, 2021-05-24 at 14:44 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> > > Changes since v1 (Sean):
> > > - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> > > - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
> > >   copy_enlightened_to_vmcs12().
> > > 
> > > Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> > > migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
> > >  + WSL2) was crashing immediately after migration. It was also reported
> > > that we have more issues to fix as, while the failure rate was lowered 
> > > signifincatly, it was still possible to observe crashes after several
> > > dozens of migration. Turns out, the issue arises when we manage to issue
> > > KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> > > to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> > > the flag itself is not part of saved nested state. A few other less 
> > > significant issues are fixed along the way.
> > > 
> > > While there's no proof this series fixes all eVMCS related problems,
> > > Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> > > crashing in testing.
> > > 
> > > Patches are based on the current kvm/next tree.
> > > 
> > > Vitaly Kuznetsov (7):
> > >   KVM: nVMX: Introduce nested_evmcs_is_used()
> > >   KVM: nVMX: Release enlightened VMCS on VMCLEAR
> > >   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
> > >     vmx_get_nested_state()
> > >   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
> > >   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
> > >   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
> > >   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
> > >     lost
> > > 
> > >  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
> > >  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
> > >  2 files changed, 115 insertions(+), 59 deletions(-)
> > > 
> > 
> > Hi Vitaly!
> > 
> > In addition to the review of this patch series,
> 
> Thanks by the way!
No problem!

> 
> >  I would like
> > to share an idea on how to avoid the hack of mapping the evmcs
> > in nested_vmx_vmexit, because I think I found a possible generic
> > solution to this and similar issues:
> > 
> > The solution is to always set nested_run_pending after 
> > nested migration (which means that we won't really
> > need to migrate this flag anymore).
> > 
> > I was thinking a lot about it and I think that there is no downside to this,
> > other than sometimes a one extra vmexit after migration.
> > 
> > Otherwise there is always a risk of the following scenario:
> > 
> >   1. We migrate with nested_run_pending=0 (but don't restore all the state
> >   yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
> >   or just the guest memory map is not up to date, guest is in smm or something
> >   like that)
> > 
> >   2. Userspace calls some ioctl that causes a nested vmexit
> > 
> >   This can happen today if the userspace calls 
> >   kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
> > 
> >   3. Userspace finally sets correct guest's msrs, correct guest memory map and only
> >   then calls KVM_RUN
> > 
> > This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
> > if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
> > but we have to do so to complete the nested vmexit.
> 
> Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
> only one place which calls copy_vmcs12_to_enlightened():
> nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
> vmx_prepare_switch_to_guest() so unless userspace decided to execute
> not-fully-restored guest this should not happen. I'm probably missing
> something in your scenario)
You are right! 
The evmcs write is delayed to the next vmentry.

However since we are now mapping the evmcs during nested vmexit,
and this can fail for example that HV assist msr is not up to date.

For example consider this: 

1. Userspace first sets nested state
2. Userspace calls KVM_GET_MP_STATE.
3. Nested vmexit that happened in 2 will end up not be able to map the evmcs,
since HV_ASSIST msr is not yet loaded.


Also the vmcb write (that is for SVM) _is_ done right away on nested vmexit 
and conceptually has the same issue.
(if memory map is not up to date, we might not be able to read/write the 
vmcb12 on nested vmexit)


> 
> > To some extent, the entry to the nested mode after a migration is only complete
> > when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
> > 
> > This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
> > nested vmexit path at all. 
> 
> Remember, we have three possible states when nested state is
> transferred:
> 1) L2 was running
> 2) L1 was running
> 3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).

I understand. This suggestion wasn't meant to fix the case 3, but more to fix
case 1, where we are in L2, migrate, and then immediately decide to 
do a nested vmexit before we processed the KVM_REQ_GET_NESTED_STATE_PAGES
request, and also before potentially before the guest state was fully uploaded
(see that KVM_GET_MP_STATE thing).
 
In a nutshell, I vote for not allowing nested vmexits from the moment
when we set the nested state and until the moment we enter the nested
guest once (maybe with request for immediate vmexit),
because during this time period, the guest state is not fully consistent.

Best regards,
	Maxim Levitsky

> 
> Is 'nested_run_pending' suitable for all of them? Could you maybe draft
> a patch so we can see how this works (in both 'normal' and 'evmcs'
> cases)?
>

Vitaly Kuznetsov May 27, 2021, 8:01 a.m. UTC | #5

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-24 at 14:44 +0200, Vitaly Kuznetsov wrote:
>> Maxim Levitsky <mlevitsk@redhat.com> writes:
>> 
>> > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
>> > > Changes since v1 (Sean):
>> > > - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
>> > > - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
>> > >   copy_enlightened_to_vmcs12().
>> > > 
>> > > Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
>> > > migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
>> > >  + WSL2) was crashing immediately after migration. It was also reported
>> > > that we have more issues to fix as, while the failure rate was lowered 
>> > > signifincatly, it was still possible to observe crashes after several
>> > > dozens of migration. Turns out, the issue arises when we manage to issue
>> > > KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
>> > > to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
>> > > the flag itself is not part of saved nested state. A few other less 
>> > > significant issues are fixed along the way.
>> > > 
>> > > While there's no proof this series fixes all eVMCS related problems,
>> > > Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
>> > > crashing in testing.
>> > > 
>> > > Patches are based on the current kvm/next tree.
>> > > 
>> > > Vitaly Kuznetsov (7):
>> > >   KVM: nVMX: Introduce nested_evmcs_is_used()
>> > >   KVM: nVMX: Release enlightened VMCS on VMCLEAR
>> > >   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
>> > >     vmx_get_nested_state()
>> > >   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
>> > >   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
>> > >   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
>> > >   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
>> > >     lost
>> > > 
>> > >  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
>> > >  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
>> > >  2 files changed, 115 insertions(+), 59 deletions(-)
>> > > 
>> > 
>> > Hi Vitaly!
>> > 
>> > In addition to the review of this patch series,
>> 
>> Thanks by the way!
> No problem!
>
>> 
>> >  I would like
>> > to share an idea on how to avoid the hack of mapping the evmcs
>> > in nested_vmx_vmexit, because I think I found a possible generic
>> > solution to this and similar issues:
>> > 
>> > The solution is to always set nested_run_pending after 
>> > nested migration (which means that we won't really
>> > need to migrate this flag anymore).
>> > 
>> > I was thinking a lot about it and I think that there is no downside to this,
>> > other than sometimes a one extra vmexit after migration.
>> > 
>> > Otherwise there is always a risk of the following scenario:
>> > 
>> >   1. We migrate with nested_run_pending=0 (but don't restore all the state
>> >   yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
>> >   or just the guest memory map is not up to date, guest is in smm or something
>> >   like that)
>> > 
>> >   2. Userspace calls some ioctl that causes a nested vmexit
>> > 
>> >   This can happen today if the userspace calls 
>> >   kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
>> > 
>> >   3. Userspace finally sets correct guest's msrs, correct guest memory map and only
>> >   then calls KVM_RUN
>> > 
>> > This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
>> > if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
>> > but we have to do so to complete the nested vmexit.
>> 
>> Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
>> only one place which calls copy_vmcs12_to_enlightened():
>> nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
>> vmx_prepare_switch_to_guest() so unless userspace decided to execute
>> not-fully-restored guest this should not happen. I'm probably missing
>> something in your scenario)
> You are right! 
> The evmcs write is delayed to the next vmentry.
>
> However since we are now mapping the evmcs during nested vmexit,
> and this can fail for example that HV assist msr is not up to date.
>
> For example consider this: 
>
> 1. Userspace first sets nested state
> 2. Userspace calls KVM_GET_MP_STATE.
> 3. Nested vmexit that happened in 2 will end up not be able to map the evmcs,
> since HV_ASSIST msr is not yet loaded.
>
>
> Also the vmcb write (that is for SVM) _is_ done right away on nested vmexit 
> and conceptually has the same issue.
> (if memory map is not up to date, we might not be able to read/write the 
> vmcb12 on nested vmexit)
>

It seems we have one correct way to restore a guest and a number of
incorrect ones :-) It may happen that this is not even a nested-only
thing (think about trying to resore caps, regs, msrs, cpuids, in a
random sequence). I'd vote for documenting the right one somewhere, even
if we'll just be extracting it from QEMU.

>
>> 
>> > To some extent, the entry to the nested mode after a migration is only complete
>> > when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
>> > 
>> > This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
>> > nested vmexit path at all. 
>> 
>> Remember, we have three possible states when nested state is
>> transferred:
>> 1) L2 was running
>> 2) L1 was running
>> 3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).
>
> I understand. This suggestion wasn't meant to fix the case 3, but more to fix
> case 1, where we are in L2, migrate, and then immediately decide to 
> do a nested vmexit before we processed the KVM_REQ_GET_NESTED_STATE_PAGES
> request, and also before potentially before the guest state was fully uploaded
> (see that KVM_GET_MP_STATE thing).
>  
> In a nutshell, I vote for not allowing nested vmexits from the moment
> when we set the nested state and until the moment we enter the nested
> guest once (maybe with request for immediate vmexit),
> because during this time period, the guest state is not fully consistent.
>

Using 'nested_run_pending=1' perhaps? Or, we can get back to 'vm_bugged'
idea and kill the guest immediately if something forces such an exit.

Maxim Levitsky May 27, 2021, 2:11 p.m. UTC | #6

On Thu, 2021-05-27 at 10:01 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Mon, 2021-05-24 at 14:44 +0200, Vitaly Kuznetsov wrote:
> > > Maxim Levitsky <mlevitsk@redhat.com> writes:
> > > 
> > > > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> > > > > Changes since v1 (Sean):
> > > > > - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> > > > > - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
> > > > >   copy_enlightened_to_vmcs12().
> > > > > 
> > > > > Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> > > > > migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
> > > > >  + WSL2) was crashing immediately after migration. It was also reported
> > > > > that we have more issues to fix as, while the failure rate was lowered 
> > > > > signifincatly, it was still possible to observe crashes after several
> > > > > dozens of migration. Turns out, the issue arises when we manage to issue
> > > > > KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> > > > > to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> > > > > the flag itself is not part of saved nested state. A few other less 
> > > > > significant issues are fixed along the way.
> > > > > 
> > > > > While there's no proof this series fixes all eVMCS related problems,
> > > > > Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> > > > > crashing in testing.
> > > > > 
> > > > > Patches are based on the current kvm/next tree.
> > > > > 
> > > > > Vitaly Kuznetsov (7):
> > > > >   KVM: nVMX: Introduce nested_evmcs_is_used()
> > > > >   KVM: nVMX: Release enlightened VMCS on VMCLEAR
> > > > >   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
> > > > >     vmx_get_nested_state()
> > > > >   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
> > > > >   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
> > > > >   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
> > > > >   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
> > > > >     lost
> > > > > 
> > > > >  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
> > > > >  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
> > > > >  2 files changed, 115 insertions(+), 59 deletions(-)
> > > > > 
> > > > 
> > > > Hi Vitaly!
> > > > 
> > > > In addition to the review of this patch series,
> > > 
> > > Thanks by the way!
> > No problem!
> > 
> > > >  I would like
> > > > to share an idea on how to avoid the hack of mapping the evmcs
> > > > in nested_vmx_vmexit, because I think I found a possible generic
> > > > solution to this and similar issues:
> > > > 
> > > > The solution is to always set nested_run_pending after 
> > > > nested migration (which means that we won't really
> > > > need to migrate this flag anymore).
> > > > 
> > > > I was thinking a lot about it and I think that there is no downside to this,
> > > > other than sometimes a one extra vmexit after migration.
> > > > 
> > > > Otherwise there is always a risk of the following scenario:
> > > > 
> > > >   1. We migrate with nested_run_pending=0 (but don't restore all the state
> > > >   yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
> > > >   or just the guest memory map is not up to date, guest is in smm or something
> > > >   like that)
> > > > 
> > > >   2. Userspace calls some ioctl that causes a nested vmexit
> > > > 
> > > >   This can happen today if the userspace calls 
> > > >   kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
> > > > 
> > > >   3. Userspace finally sets correct guest's msrs, correct guest memory map and only
> > > >   then calls KVM_RUN
> > > > 
> > > > This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
> > > > if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
> > > > but we have to do so to complete the nested vmexit.
> > > 
> > > Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
> > > only one place which calls copy_vmcs12_to_enlightened():
> > > nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
> > > vmx_prepare_switch_to_guest() so unless userspace decided to execute
> > > not-fully-restored guest this should not happen. I'm probably missing
> > > something in your scenario)
> > You are right! 
> > The evmcs write is delayed to the next vmentry.
> > 
> > However since we are now mapping the evmcs during nested vmexit,
> > and this can fail for example that HV assist msr is not up to date.
> > 
> > For example consider this: 
> > 
> > 1. Userspace first sets nested state
> > 2. Userspace calls KVM_GET_MP_STATE.
> > 3. Nested vmexit that happened in 2 will end up not be able to map the evmcs,
> > since HV_ASSIST msr is not yet loaded.
> > 
> > 
> > Also the vmcb write (that is for SVM) _is_ done right away on nested vmexit 
> > and conceptually has the same issue.
> > (if memory map is not up to date, we might not be able to read/write the 
> > vmcb12 on nested vmexit)
> > 
> 
> It seems we have one correct way to restore a guest and a number of
> incorrect ones :-) It may happen that this is not even a nested-only
> thing (think about trying to resore caps, regs, msrs, cpuids, in a
> random sequence). I'd vote for documenting the right one somewhere, even
> if we'll just be extracting it from QEMU.
> 
> > > > To some extent, the entry to the nested mode after a migration is only complete
> > > > when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
> > > > 
> > > > This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
> > > > nested vmexit path at all. 
> > > 
> > > Remember, we have three possible states when nested state is
> > > transferred:
> > > 1) L2 was running
> > > 2) L1 was running
> > > 3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).
> > 
> > I understand. This suggestion wasn't meant to fix the case 3, but more to fix
> > case 1, where we are in L2, migrate, and then immediately decide to 
> > do a nested vmexit before we processed the KVM_REQ_GET_NESTED_STATE_PAGES
> > request, and also before potentially before the guest state was fully uploaded
> > (see that KVM_GET_MP_STATE thing).
> >  
> > In a nutshell, I vote for not allowing nested vmexits from the moment
> > when we set the nested state and until the moment we enter the nested
> > guest once (maybe with request for immediate vmexit),
> > because during this time period, the guest state is not fully consistent.
> > 
> 
> Using 'nested_run_pending=1' perhaps? Or, we can get back to 'vm_bugged'
> idea and kill the guest immediately if something forces such an exit.

Exactly, this is my idea. Set the nested_run_pending=1 always after the migration
It shoudn't cause any issues and it would avoid cases like that.

That variable can then be renamed too to something like 'nested_vmexit_not_allowed'
or something like that.

Paolo, what do you think?

Best regards,
	Maxim Levitsky

>

Paolo Bonzini May 27, 2021, 2:17 p.m. UTC | #7

On 27/05/21 16:11, Maxim Levitsky wrote:
>> Using 'nested_run_pending=1' perhaps? Or, we can get back to 'vm_bugged'
>> idea and kill the guest immediately if something forces such an exit.
> Exactly, this is my idea. Set the nested_run_pending=1 always after the migration
> It shoudn't cause any issues and it would avoid cases like that.
> 
> That variable can then be renamed too to something like 'nested_vmexit_not_allowed'
> or something like that.
> 
> Paolo, what do you think?

(If it works :)) that's clever.  It can even be set unconditionally on 
the save side and would even work for new->old migration.

Paolo

[v2,0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use

Message

Comments