diff mbox series

Why do we need KVM_REQ_GET_NESTED_STATE_PAGES after all

Message ID fc6bea3249f26e8dd973ce1bd1e3f6f42c142469.camel@redhat.com (mailing list archive)
State New, archived
Headers show
Series Why do we need KVM_REQ_GET_NESTED_STATE_PAGES after all | expand

Commit Message

Maxim Levitsky Jan. 27, 2022, 4:03 p.m. UTC
I would like to raise a question about this elephant in the room which I wanted to understand for 
quite a long time.
 
For my nested AVIC work I once again need to change the KVM_REQ_GET_NESTED_STATE_PAGES code and once
again I am asking myself, maybe we can get rid of this code, after all?
  
And of course if we don't need it, getting rid of it would be a very happy day in my life 
(and likely in the life of all other KVM developers as well).
 
Needless to say that it already caused few CVE worthy issues which thankfully we patched before
it got to production, and on top of that, having different code flow for a very rare code path 
because nested migration only happens if you are lucky enough to have nested guest 
actually running during migration, which happens only once in a while, even when L1 is fully loaded. 
 
In my testing I always disable HLT exits to actually ensure that nested guest runs all the time,
sans vmexits.
 
====================================================================================================
 
So first of all what KVM_REQ_GET_NESTED_STATE_PAGES is:
 
KVM_REQ_GET_NESTED_STATE_PAGES exists to delay reading/writing/pinning of the guest memory, which is 
pointed by various fields of current vmcs/vmcb to next KVM_RUN, under an assumption that it is not 
safe to do in case when we do a non natural VM entry (which is either due to resume of nested
guest that was interrupted by SMM or when we set the nested state after migration.
 
The alleged reason for that is that either kvm's mmu is not in 'right state', or that userspace
VMM will do some modifications to the VM after the entry and before we actually enter the nested guest.
 
====================================================================================================
 
There are two cases where the KVM_REQ_GET_NESTED_STATE_PAGES is used:
 
1. Exit from SMM, back to an interrupted nested guest:
 
Note that on SVM we actually read the VMCB from the guest memory and HSAVE area,
already violating the rule of not reading guest memory.
On VMX at least we use the cached VMCS.
 
 
2. Loading the nested state.
 
In this case indeed both vmcb and vmcs come from the migration stream, thus we don't touch the 
guest memory when setting the nested state.
 
====================================================================================================
 
Now let see how the guest memory would be accessed on nested VM entry if we didn't use
KVM_REQ_GET_NESTED_STATE_PAGES:
 
First of all, AFAIK all SVM/VMX memory structures including the VMCB/VMCS itself are accessed 
by their physical address, thus we don't need to worry about guest's paging, and 
neither about nested NPT/EPT paging.
 
Shadow pages that MMU created are not relevant also because they are not used by KVM itself to
read guest's memory.
 
To access guest physical memory from KVM the following is done:
 
A. a set of memslots is chosen (calling __kvm_memslots())
 
   KVM keeps two sets of memory slots, one for 'Normal' address space and one for SMM address space.
 
   These memslots are setup by VMM, which usually updates it when the guest modifies
   virtual chipset's SMM related settings.
 
   Note that on standard Q35/i440fx chipsets which qemu emulates, those settings themselves are
   accessible through pci config space regardless of SMM, but have a lock bit which prevents non 
   SMM code to change them after VM's BIOS setup and locked them..
 
   Lots of places in the KVM hardcoded to use the 'Normal' memslots 
   (everyone who uses kvm_memslots for example).
   Others use the kvm_vcpu_memslots, which chooses memslots based on 'arch.hflags & HF_SMM_MASK'
 
   Thankfully we don't allow VMX/SVM in SMM (and I really hope that we never will), thus all the guest 
   memory structures including VMCB/VMCS would reside in the main guest memory and thus can be
   reached through 'Normal' memslots.
 
   From this we can deduce that loading of the nested state will never happen when the 
   VCPU is in SMM mode.
   KVM verifies this and denies setting the nested state if that is attempted.
 
   On returning from SMM to a nested guest, kvm_smm_changed is the first thing to be called,
   prior to calling vendor code which resumes the nested guest.
   kvm_smm_changed clears the HF_SMM_MASK flag, which selects Which memslots to take when doing 
   guest physical to host virtual translation.
 
   Based on this reasoning, we can say that nested guest entries we will never access the 
   SMM memslots, even if we attempt to access the guest memory during the setting of the nested state.
 
 
B. From the 'Normal' memslots, the host virtual address of the page containing the guest physical 
   memory is obtained.
   Page itself might be swapped out or not even present at all if post-copy migration is used.
 
   AFAIK, I don't see a reason why QEMU or any other KVM user would not upload the correct memory slot
   set, after a nested migration.
 
   In fact qemu rarely modifies the memory slots. It does have some memslots for devices which
   Have ram like memory but if the guest attempts to place there VM related structure, it is welcome
   to keep both pieces.
 
   Actual RAM memslots should only be modified during runtime when RAM hotplug happens or so,
   which even if happen after nested migration, would not be an issue as the fact that they 
   Happen after a migration means that guest VM structures couldn’t be in these memslots.
 
   Qemu resets and loads the state for all the devices, and they match the original devices on the 
   migration source, and seems to upload the nested state last.
   It might not migrate all ram, but rather register it with userfaultd, but that isn't an
   issue (see below)
 
   For returns from the SMM, the chances that this memslot map is not up to date are even lower.
   In fact, returning from smm usually won't even cause a userspace vmexit to let qemu mess with
   this map.
 
 
3. Once the host’s virtual address of the guest's page obtained, stock kernel paging 
   facilities are used to obtain the page (swap-in, userfaultd, etc) and
    then obtain its physical address.
 
   Once again on return from SMM, there should be no issues.
 
   After nested migration page might not be present but that will lead to either swapping it in,
   or asking qemu via userfaultd to bring it via userfaultd interface while doing post-copy migration.
   
   In regard to post-copy migration, other VMMs ought to work the same (use userfaultd).
   In theory a VMM could instead handle SIGSEGV, as an indication of a page fault but that would
   not really work due to many reasons.
   
   
So, having said all that, the only justification for KVM_REQ_GET_NESTED_STATE_PAGES
would be an VMM which first sets the CPU nested state and then setups the VMAs for guest memory
and uploads them to KVM via memslots.
 
Since nested migration is something relatively new, there is good chance that nobody does this,
and then requiring that guest memory is present before setting a nested state (or at least memory
which is pointed by the VMX/SVM structs) seems like a reasonable requirement.
 
On top of that, that simplifies the API usage by not delaying the error to the next VM run,
if truly there is an issue with this guest memory pointed by the nested state.
 
 
PS: Few technical notes:
 
====================================================================================================
A. Note on eVMCS usage of KVM_REQ_GET_NESTED_STATE_PAGES:
====================================================================================================
 
When eVMCS which is a guest memory page, representing a VM, similar to VMCB is used 
(this is HV specific PV feature), VMCS fields are loaded from it.
 
Thankfully after nested migration or return from SMM, we already have up to date vmcs12 either read
from the migration stream or saved in kernel memory.
 
We only need the guest physical address of this memory page, to map and pin it, so that later on,
after the nested entry/exit we could read/write it.
 
It address is (indirectly) obtained from msr HV_X64_MSR_VP_ASSIST_PAGE, which qemu restores after
it sets the nested state.
 
Currently to support this, setting nested state with eVMCS is allowed without this msr set,
and later KVM_REQ_GET_NESTED_STATE_PAGES actually maps the eVMCS page.
 
There is also a workaround that was relatively recently added to map eVMCS also when
KVM_REQ_GET_NESTED_STATE_PAGES haven’t got called after nested migration which can happen if we
have nested VM exit prior to entering the guest even once after setting the nested state.
Workaround was to also map eVMCS on nested VM exit.
 
IMHO the right way to fix this, assuming that we drop KVM_REQ_GET_NESTED_STATE_PAGES is to just
map the eVMCS when the HV_X64_MSR_VP_ASSIST_PAGE is set by qemu (host write) after nested state with
active eVMCS was set (that is we are nested but with vmptr == -1 )
 

====================================================================================================
B: Some digital archaeology for reference:
====================================================================================================

First of all we have those two commits:
 
commit a7d5c7ce41ac1e2537d78ddb57ef0ac4f737aa19
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Sep 22 07:43:14 2020 -0400
 
	KVM: nSVM: delay MSR permission processing to first nested VM run
    
	Allow userspace to set up the memory map after KVM_SET_NESTED_STATE;
	to do so, move the call to nested_svm_vmrun_msrpm inside the
	KVM_REQ_GET_NESTED_STATE_PAGES handler (which is currently
	not used by nSVM).  This is similar to what VMX does already.
    
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 
commit 729c15c20f1a7c9ad1d09a603ad1aa7fb60b1f88
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Tue Sep 22 06:53:57 2020 -0400
 
	KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES
    
	We are going to use it for SVM too, so use a more generic name.
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
	

These two commits were added around the same time I  started fixing the SVM's nested migration
which was just implemented and didn't yet receive any real-world testing.

SVM's code used to not even read the nested msr bitmap which I fixed, and there is a good change
that the above commit was added just in case to make SVM's code do the same thing as VMX does.
	

If we go deeper, we get this commit, from which it all started:
 
 
commit 7f7f1ba33cf2c21d001821313088c231db42ff40
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed Jul 18 18:49:01 2018 +0200
 
	KVM: x86: do not load vmcs12 pages while still in SMM
    
	If the vCPU enters system management mode while running a nested guest,
	RSM starts processing the vmentry while still in SMM.  In that case,
	however, the pages pointed to by the vmcs12 might be incorrectly
	loaded from SMRAM.  To avoid this, delay the handling of the pages
	until just before the next vmentry.  This is done with a new request
	and a new entry in kvm_x86_ops, which we will be able to reuse for
	nested VMX state migration.
    
	Extracted from a patch by Jim Mattson and KarimAllah Ahmed.
    
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
 

That commit was part of initial patch series which implemented the nested migration,
and it could be very well that the above commit was just a precation.
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1740085.html


Looking back at the old code before this commit, I wonder if that was needed even back then:
 
In the original code just prior to the this commit we had this weird code:
 
                vcpu->arch.hflags &= ~HF_SMM_MASK;
                ret = enter_vmx_non_root_mode(vcpu, NULL);
                vcpu->arch.hflags |= HF_SMM_MASK; <- why do we set it back 
 
 
Yet, it did clear the HF_SMM_MASK just prior to nested entry,
thus correct memslots should be accessed even back then, and I don't think that MSR bitmap would
be loaded from SMM memory.


====================================================================================================
C: POC
====================================================================================================

I attached the patch which removes the KVM_REQ_GET_NESTED_STATE_PAGES which I lightly tested.
I didn't test the eVMCS side of things. Works for me on SVM and VMX.

I also added debug prints to qemu and used virtio-mem which was suspected to set memslots
after setting nested state. It seems that it really doesn't.



Best regards,
   Maxim Levitsky

Comments

Jim Mattson Jan. 27, 2022, 7:39 p.m. UTC | #1
On Thu, Jan 27, 2022 at 8:04 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> I would like to raise a question about this elephant in the room which I wanted to understand for
> quite a long time.
>
> For my nested AVIC work I once again need to change the KVM_REQ_GET_NESTED_STATE_PAGES code and once
> again I am asking myself, maybe we can get rid of this code, after all?

We (GCE) use it so that, during post-copy, a vCPU thread can exit to
userspace and demand these pages from the source itself, rather than
funneling all demands through a single "demand paging listener"
thread, which I believe is the equivalent of qemu's userfaultfd "fault
handler" thread. Our (internal) post-copy mechanism scales quite well,
because most demand paging requests are triggered by an EPT violation,
which happens to be a convenient place to exit to userspace. Very few
pages are typically demanded as a result of
kvm_vcpu_{read,write}_guest, where the vCPU thread is so deep in the
kernel call stack that it has to request the page via the demand
paging listener thread. With nested virtualization, the various vmcs12
pages consulted directly by kvm (bypassing the EPT tables) were a
scalability issue.

(Note that, unlike upstream, we don't call nested_get_vmcs12_pages
directly from VMLAUNCH/VMRESUME emulation; we always call it as a
result of this request that you don't like.)

As we work on converting from our (hacky) demand paging scheme to
userfaultfd, we will have to solve the scalability issue anyway
(unless someone else beats us to it). Eventually, I expect that our
need for this request will go away.

Honestly, without the exits to userspace, I don't really see how this
request buys you anything upstream. When I originally submitted it, I
was prepared for rejection, but Paolo said that qemu had a similar
need for it, and I happily never questioned that assertion.
Maxim Levitsky Jan. 30, 2022, 2:29 p.m. UTC | #2
On Thu, 2022-01-27 at 11:39 -0800, Jim Mattson wrote:
> On Thu, Jan 27, 2022 at 8:04 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > I would like to raise a question about this elephant in the room which I wanted to understand for
> > quite a long time.
> > 
> > For my nested AVIC work I once again need to change the KVM_REQ_GET_NESTED_STATE_PAGES code and once
> > again I am asking myself, maybe we can get rid of this code, after all?
> 
> We (GCE) use it so that, during post-copy, a vCPU thread can exit to

Thank you very much for a very detailed response!

> userspace and demand these pages from the source itself, rather than
> funneling all demands through a single "demand paging listener"
That is something I didn't think of!

The question is however, can that happen between setting the nested state
and running a vCPU first time. 

I guess it is possible in therory that you set the nested state, then run 1 vCPU, 
which faults and exits to userspace,
then userspace populates the faulted memory which triggers memslots update,
and only then you run another vCPU for first time.
needs will be need when this request is processed, and it will fail if they aren't.



> thread, which I believe is the equivalent of qemu's userfaultfd "fault
> handler" thread. Our (internal) post-copy mechanism scales quite well,
> because most demand paging requests are triggered by an EPT violation,
> which happens to be a convenient place to exit to userspace. Very few
> pages are typically demanded as a result of
> kvm_vcpu_{read,write}_guest, where the vCPU thread is so deep in the
> kernel call stack that it has to request the page via the demand
> paging listener thread. With nested virtualization, the various vmcs12
> pages consulted directly by kvm (bypassing the EPT tables) were a
> scalability issue.
I assume that you patched all these calls to exit to userspace for the
demand paging scheme you are using.

> 
> (Note that, unlike upstream, we don't call nested_get_vmcs12_pages
> directly from VMLAUNCH/VMRESUME emulation; we always call it as a
> result of this request that you don't like.)
Also I guess something specific to your downstream patches.


> 
> As we work on converting from our (hacky) demand paging scheme to
> userfaultfd, we will have to solve the scalability issue anyway
> (unless someone else beats us to it). Eventually, I expect that our
> need for this request will go away.

Great!

The question is, if we remove it now, will that affect you?

What if we depricate it (add option to keep the current behavier,
but keep an module param to revert back to old behavier, with
the eventual goal of removing it.


> 
> Honestly, without the exits to userspace, I don't really see how this
> request buys you anything upstream. When I originally submitted it, I
> was prepared for rejection, but Paolo said that qemu had a similar
> need for it, and I happily never questioned that assertion.

Exactly! I didn't questioned it as well, because I didn't knew MMU at all,
and it is one of the harderst KVM parts - who knows what it caches,
and what magic it needs to be up to date.

But now, I don't think there are still large areas of MMU that I don't understand,
thus I started asking myself why it is need.

That request is a ripe source of bugs. Just off my hand, Vitaly spent at least a week
understanding why after vmcb01/02 split, eVMCS stopped working, only to figure out that
KVM_REQ_GET_NESTED_STATE_PAGES might not be called after nested entry since there
could be nested VM exit before we even enter the guest, and since then one more
hack has to be added to work that around (nothing against the hack, its not the
root cause of the problem).

I also indirectly caused and fixed a CVE like issue, which started 
 with the patch that added KVM_REQ_GET_NESTED_STATE_PAGES to SVM - 
it made KVM switch to nested msr bitmap
and back then there was no vmcb01/02 split. Problem is that back then
we didn't cancel that request if we have VM exit right after VM entry,
so it would be still pending on VM entry, and it will switch to nested MSR bitmap
even if we are no longer nested - then I added patch to free the nested state
on demand, and boom - we have L1 using freed (and usualy zeroed) MSR bitmap - free
access to all host msrs from L1...

There is another hidden issue in this request, that it makes it impossible to
handle failure gracefully.

If for example, loading nested state pages needs to allocate memory and that
fails, we could just fail the nested VM entry if that request wasn't there.
While this is not ideal for the nested guest, it likely to survive, and
might even retry entering the nested guest.

On the other hand during the request if something fails, nested state is
already loaded, and all we can do is to kill the L1.


Thanks again!
Best regards,
	Maxim Levitsky

>
Jim Mattson Jan. 30, 2022, 11:46 p.m. UTC | #3
On Sun, Jan 30, 2022 at 6:29 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Thu, 2022-01-27 at 11:39 -0800, Jim Mattson wrote:
> > On Thu, Jan 27, 2022 at 8:04 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > I would like to raise a question about this elephant in the room which I wanted to understand for
> > > quite a long time.
> > >
> > > For my nested AVIC work I once again need to change the KVM_REQ_GET_NESTED_STATE_PAGES code and once
> > > again I am asking myself, maybe we can get rid of this code, after all?
> >
> > We (GCE) use it so that, during post-copy, a vCPU thread can exit to
>
> Thank you very much for a very detailed response!
>
> > userspace and demand these pages from the source itself, rather than
> > funneling all demands through a single "demand paging listener"
> That is something I didn't think of!
>
> The question is however, can that happen between setting the nested state
> and running a vCPU first time.
>
> I guess it is possible in therory that you set the nested state, then run 1 vCPU,
> which faults and exits to userspace,
> then userspace populates the faulted memory which triggers memslots update,
> and only then you run another vCPU for first time.
> needs will be need when this request is processed, and it will fail if they aren't.

That isn't quite how our post-copy works.  There is no memslots
update. When post-copy starts, the source sends a bitmap of dirty
pages to the destination. This bitmap is installed in kvm via a
Google-local ioctl. Then, all vCPUs on the destination start running,
while the source pushes dirty pages to the destination. As EPT
violations occur for cold mappings at the destination, we check the
bitmap to see if we need to demand the page from the source
out-of-band. We know that the page will eventually come across, but we
need it *now*, so that the vCPU thread that incurred the EPT violation
can resume. If the page is in the dirty bitmap, the vCPU thread that
incurred the EPT violation will exit to userspace and request the page
from the source. When the page arrives, we overwrite the stale page in
the memslot and clear the bit in the dirty page bitmap.

For cases where kvm needs a dirty page to do software page walks or
instruction emulation or whatever, the vCPU thread sends a request to
the "demand page listener" thread via a netlink socket and then
blocks, waiting for the bit to clear in the dirty bitmap. For various
reasons, including the serialization on the listener thread and the
unreliability of the netlink socket, we didn't want to have all of the
vmcs12 pages use the netlink request path. (This concern may be
largely moot, since modification of most pages hanging off of the VMCS
while the (v)CPU is in VMX non-root mode leads to undefined behavior.)

>
> > thread, which I believe is the equivalent of qemu's userfaultfd "fault
> > handler" thread. Our (internal) post-copy mechanism scales quite well,
> > because most demand paging requests are triggered by an EPT violation,
> > which happens to be a convenient place to exit to userspace. Very few
> > pages are typically demanded as a result of
> > kvm_vcpu_{read,write}_guest, where the vCPU thread is so deep in the
> > kernel call stack that it has to request the page via the demand
> > paging listener thread. With nested virtualization, the various vmcs12
> > pages consulted directly by kvm (bypassing the EPT tables) were a
> > scalability issue.
> I assume that you patched all these calls to exit to userspace for the
> demand paging scheme you are using.
>
> >
> > (Note that, unlike upstream, we don't call nested_get_vmcs12_pages
> > directly from VMLAUNCH/VMRESUME emulation; we always call it as a
> > result of this request that you don't like.)
> Also I guess something specific to your downstream patches.
>
>
> >
> > As we work on converting from our (hacky) demand paging scheme to
> > userfaultfd, we will have to solve the scalability issue anyway
> > (unless someone else beats us to it). Eventually, I expect that our
> > need for this request will go away.
>
> Great!
>
> The question is, if we remove it now, will that affect you?

No. By the time we get to 5.17, I don't expect any of our current
post-copy scheme to survive.

> What if we depricate it (add option to keep the current behavier,
> but keep an module param to revert back to old behavier, with
> the eventual goal of removing it.

Don't bother on our account. :-)
>
> >
> > Honestly, without the exits to userspace, I don't really see how this
> > request buys you anything upstream. When I originally submitted it, I
> > was prepared for rejection, but Paolo said that qemu had a similar
> > need for it, and I happily never questioned that assertion.
>
> Exactly! I didn't questioned it as well, because I didn't knew MMU at all,
> and it is one of the harderst KVM parts - who knows what it caches,
> and what magic it needs to be up to date.

The other question, which may have been what motivated Paolo to keep
the GET_NESTED_VMCS12_PAGES request, is whether or not the memslots
have been established before userspace calls the ioctl for
SET_NESTED_STATE. If not, it is impossible to do any GPA->HPA lookups
for VMCS12 fields that are necessary for the vCPU to enter VMX
non-root mode as part of SET_NESTED_STATE. The only field I can think
of off the top of my head is the VMCS12 APIC-access address, which,
sadly, has to be backed by an L1 memslot, even though the page is
never accessed. (VMware always sets the APIC-access address to 0, for
all VMCSs, at L1 or L2. That makes things a lot easier.) If the
memslots are established before SET_NESTED_STATE, this isn't a
problem, but I don't recall that constraint being documented anywhere.

> But now, I don't think there are still large areas of MMU that I don't understand,
> thus I started asking myself why it is need.
>
> That request is a ripe source of bugs. Just off my hand, Vitaly spent at least a week
> understanding why after vmcb01/02 split, eVMCS stopped working, only to figure out that
> KVM_REQ_GET_NESTED_STATE_PAGES might not be called after nested entry since there
> could be nested VM exit before we even enter the guest, and since then one more
> hack has to be added to work that around (nothing against the hack, its not the
> root cause of the problem).
>
> I also indirectly caused and fixed a CVE like issue, which started
>  with the patch that added KVM_REQ_GET_NESTED_STATE_PAGES to SVM -
> it made KVM switch to nested msr bitmap
> and back then there was no vmcb01/02 split. Problem is that back then
> we didn't cancel that request if we have VM exit right after VM entry,
> so it would be still pending on VM entry, and it will switch to nested MSR bitmap
> even if we are no longer nested - then I added patch to free the nested state
> on demand, and boom - we have L1 using freed (and usualy zeroed) MSR bitmap - free
> access to all host msrs from L1...
>
> There is another hidden issue in this request, that it makes it impossible to
> handle failure gracefully.
>
> If for example, loading nested state pages needs to allocate memory and that
> fails, we could just fail the nested VM entry if that request wasn't there.
> While this is not ideal for the nested guest, it likely to survive, and
> might even retry entering the nested guest.
>
> On the other hand during the request if something fails, nested state is
> already loaded, and all we can do is to kill the L1.

I apologize for all of that pain.

>
> Thanks again!
> Best regards,
>         Maxim Levitsky
>
> >
>
>
>
>
Maxim Levitsky Jan. 31, 2022, 10:31 a.m. UTC | #4
On Sun, 2022-01-30 at 15:46 -0800, Jim Mattson wrote:
> On Sun, Jan 30, 2022 at 6:29 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Thu, 2022-01-27 at 11:39 -0800, Jim Mattson wrote:
> > > On Thu, Jan 27, 2022 at 8:04 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > > I would like to raise a question about this elephant in the room which I wanted to understand for
> > > > quite a long time.
> > > > 
> > > > For my nested AVIC work I once again need to change the KVM_REQ_GET_NESTED_STATE_PAGES code and once
> > > > again I am asking myself, maybe we can get rid of this code, after all?
> > > 
> > > We (GCE) use it so that, during post-copy, a vCPU thread can exit to
> > 
> > Thank you very much for a very detailed response!
> > 
> > > userspace and demand these pages from the source itself, rather than
> > > funneling all demands through a single "demand paging listener"
> > That is something I didn't think of!
> > 
> > The question is however, can that happen between setting the nested state
> > and running a vCPU first time.
> > 
> > I guess it is possible in therory that you set the nested state, then run 1 vCPU,
> > which faults and exits to userspace,
> > then userspace populates the faulted memory which triggers memslots update,
> > and only then you run another vCPU for first time.
> > needs will be need when this request is processed, and it will fail if they aren't.
> 
> That isn't quite how our post-copy works.  There is no memslots
> update. When post-copy starts, the source sends a bitmap of dirty
> pages to the destination. This bitmap is installed in kvm via a
> Google-local ioctl. Then, all vCPUs on the destination start running,
> while the source pushes dirty pages to the destination. As EPT
> violations occur for cold mappings at the destination, we check the
> bitmap to see if we need to demand the page from the source
> out-of-band. We know that the page will eventually come across, but we
> need it *now*, so that the vCPU thread that incurred the EPT violation
> can resume. If the page is in the dirty bitmap, the vCPU thread that
> incurred the EPT violation will exit to userspace and request the page
> from the source. When the page arrives, we overwrite the stale page in
> the memslot and clear the bit in the dirty page bitmap.
> 
> For cases where kvm needs a dirty page to do software page walks or
> instruction emulation or whatever, the vCPU thread sends a request to
> the "demand page listener" thread via a netlink socket and then
> blocks, waiting for the bit to clear in the dirty bitmap. For various
> reasons, including the serialization on the listener thread and the
> unreliability of the netlink socket, we didn't want to have all of the
> vmcs12 pages use the netlink request path. (This concern may be
> largely moot, since modification of most pages hanging off of the VMCS
> while the (v)CPU is in VMX non-root mode leads to undefined behavior.)
> 
> > > thread, which I believe is the equivalent of qemu's userfaultfd "fault
> > > handler" thread. Our (internal) post-copy mechanism scales quite well,
> > > because most demand paging requests are triggered by an EPT violation,
> > > which happens to be a convenient place to exit to userspace. Very few
> > > pages are typically demanded as a result of
> > > kvm_vcpu_{read,write}_guest, where the vCPU thread is so deep in the
> > > kernel call stack that it has to request the page via the demand
> > > paging listener thread. With nested virtualization, the various vmcs12
> > > pages consulted directly by kvm (bypassing the EPT tables) were a
> > > scalability issue.
> > I assume that you patched all these calls to exit to userspace for the
> > demand paging scheme you are using.
> > 
> > > (Note that, unlike upstream, we don't call nested_get_vmcs12_pages
> > > directly from VMLAUNCH/VMRESUME emulation; we always call it as a
> > > result of this request that you don't like.)
> > Also I guess something specific to your downstream patches.
> > 
> > 
> > > As we work on converting from our (hacky) demand paging scheme to
> > > userfaultfd, we will have to solve the scalability issue anyway
> > > (unless someone else beats us to it). Eventually, I expect that our
> > > need for this request will go away.
> > 
> > Great!
> > 
> > The question is, if we remove it now, will that affect you?
> 
> No. By the time we get to 5.17, I don't expect any of our current
> post-copy scheme to survive.
> 
> > What if we depricate it (add option to keep the current behavier,
> > but keep an module param to revert back to old behavier, with
> > the eventual goal of removing it.
> 
> Don't bother on our account. :-)
> > > Honestly, without the exits to userspace, I don't really see how this
> > > request buys you anything upstream. When I originally submitted it, I
> > > was prepared for rejection, but Paolo said that qemu had a similar
> > > need for it, and I happily never questioned that assertion.
> > 
> > Exactly! I didn't questioned it as well, because I didn't knew MMU at all,
> > and it is one of the harderst KVM parts - who knows what it caches,
> > and what magic it needs to be up to date.
> 
> The other question, which may have been what motivated Paolo to keep
> the GET_NESTED_VMCS12_PAGES request, is whether or not the memslots
> have been established before userspace calls the ioctl for
> SET_NESTED_STATE. If not, it is impossible to do any GPA->HPA lookups
> for VMCS12 fields that are necessary for the vCPU to enter VMX
> non-root mode as part of SET_NESTED_STATE. The only field I can think
> of off the top of my head is the VMCS12 APIC-access address, which,
> sadly, has to be backed by an L1 memslot, even though the page is
> never accessed. (VMware always sets the APIC-access address to 0, for
> all VMCSs, at L1 or L2. That makes things a lot easier.) If the
> memslots are established before SET_NESTED_STATE, this isn't a
> problem, but I don't recall that constraint being documented anywhere.

This is the exactly only reason for KVM_REQ_GET_NESTED_STATE_PAGES and what it does.
The only thing KVM_REQ_GET_NESTED_STATE_PAGES buy us is exactly this - ability to
set memslots after setting the nested state.
 
This is indeed not documented, but there is hope that none of the userspace relies on this
(the opposite is also not documented)
 
Qemu AFAIK doesn't rely on this - I can't be 100% sure because the code is complex and I 
could have missed some corner case like that virtio-mem case Paolo mentioned to me 
(I checked it, and there seems to be no reason for it to set memslots after migration).
 
Practically speaking, memslots can update any moment, but there is no reason to have different
memslots between sending the migration state and restoring it. As soon as restore of the migration
state is complete, yes there could be memslots modifications, in theory even before the VM runs,
but these modifications can't not make memory that a nested VMCS/VMCB reference available again,
which is the only thing that KVM_REQ_GET_NESTED_STATE_PAGES buys us.
 
TL;DR - no matter how userspace modifies the memslots after the migration,
it must restore the memslots that contain pages that are referenced by the nested state,
exactly as is.

The only change to userspace expected behavier by removing the KVM_REQ_GET_NESTED_STATE_PAGES
would be that userspace would have to restore those memslots prior to setting the nested state
(but not the memory itself contained within, as it can be later brought in via userfaltd).


About the referenced pages:
In addition to apic access page, there are more pages that are referenced from vmcb/vmcs:
There is apic backing page for both AVIC/APICv which we map and passthrough to the guest,
there is PIR descriptior for APICv, there is the msr bitmap, and io bitmap,
there are msr load/store lists on VMX, and probably more).


> 
> > But now, I don't think there are still large areas of MMU that I don't understand,
> > thus I started asking myself why it is need.
> > 
> > That request is a ripe source of bugs. Just off my hand, Vitaly spent at least a week
> > understanding why after vmcb01/02 split, eVMCS stopped working, only to figure out that
> > KVM_REQ_GET_NESTED_STATE_PAGES might not be called after nested entry since there
> > could be nested VM exit before we even enter the guest, and since then one more
> > hack has to be added to work that around (nothing against the hack, its not the
> > root cause of the problem).
> > 
> > I also indirectly caused and fixed a CVE like issue, which started
> >  with the patch that added KVM_REQ_GET_NESTED_STATE_PAGES to SVM -
> > it made KVM switch to nested msr bitmap
> > and back then there was no vmcb01/02 split. Problem is that back then
> > we didn't cancel that request if we have VM exit right after VM entry,
> > so it would be still pending on VM entry, and it will switch to nested MSR bitmap
> > even if we are no longer nested - then I added patch to free the nested state
> > on demand, and boom - we have L1 using freed (and usualy zeroed) MSR bitmap - free
> > access to all host msrs from L1...
> > 
> > There is another hidden issue in this request, that it makes it impossible to
> > handle failure gracefully.
> > 
> > If for example, loading nested state pages needs to allocate memory and that
> > fails, we could just fail the nested VM entry if that request wasn't there.
> > While this is not ideal for the nested guest, it likely to survive, and
> > might even retry entering the nested guest.
> > 
> > On the other hand during the request if something fails, nested state is
> > already loaded, and all we can do is to kill the L1.
> 
> I apologize for all of that pain.

No problem. IMHO the request was created out of abundance of caution 
(which I understand very well), and back then nobody could predict the issues it will create,
but as it stands now, it might be better to remove it that keep on adding more code to the mess.

I btw tested the attached patch a bit more and it seems to survive so far.

Best regards,
	Maxim Levitsky


> 
> > Thanks again!
> > Best regards,
> >         Maxim Levitsky
> > 
> > 
> > 
> >
Jim Mattson Jan. 31, 2022, 5:37 p.m. UTC | #5
On Mon, Jan 31, 2022 at 2:32 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Sun, 2022-01-30 at 15:46 -0800, Jim Mattson wrote:
> > On Sun, Jan 30, 2022 at 6:29 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > On Thu, 2022-01-27 at 11:39 -0800, Jim Mattson wrote:
> > > > On Thu, Jan 27, 2022 at 8:04 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > > > I would like to raise a question about this elephant in the room which I wanted to understand for
> > > > > quite a long time.
> > > > >
> > > > > For my nested AVIC work I once again need to change the KVM_REQ_GET_NESTED_STATE_PAGES code and once
> > > > > again I am asking myself, maybe we can get rid of this code, after all?
> > > >
> > > > We (GCE) use it so that, during post-copy, a vCPU thread can exit to
> > >
> > > Thank you very much for a very detailed response!
> > >
> > > > userspace and demand these pages from the source itself, rather than
> > > > funneling all demands through a single "demand paging listener"
> > > That is something I didn't think of!
> > >
> > > The question is however, can that happen between setting the nested state
> > > and running a vCPU first time.
> > >
> > > I guess it is possible in therory that you set the nested state, then run 1 vCPU,
> > > which faults and exits to userspace,
> > > then userspace populates the faulted memory which triggers memslots update,
> > > and only then you run another vCPU for first time.
> > > needs will be need when this request is processed, and it will fail if they aren't.
> >
> > That isn't quite how our post-copy works.  There is no memslots
> > update. When post-copy starts, the source sends a bitmap of dirty
> > pages to the destination. This bitmap is installed in kvm via a
> > Google-local ioctl. Then, all vCPUs on the destination start running,
> > while the source pushes dirty pages to the destination. As EPT
> > violations occur for cold mappings at the destination, we check the
> > bitmap to see if we need to demand the page from the source
> > out-of-band. We know that the page will eventually come across, but we
> > need it *now*, so that the vCPU thread that incurred the EPT violation
> > can resume. If the page is in the dirty bitmap, the vCPU thread that
> > incurred the EPT violation will exit to userspace and request the page
> > from the source. When the page arrives, we overwrite the stale page in
> > the memslot and clear the bit in the dirty page bitmap.
> >
> > For cases where kvm needs a dirty page to do software page walks or
> > instruction emulation or whatever, the vCPU thread sends a request to
> > the "demand page listener" thread via a netlink socket and then
> > blocks, waiting for the bit to clear in the dirty bitmap. For various
> > reasons, including the serialization on the listener thread and the
> > unreliability of the netlink socket, we didn't want to have all of the
> > vmcs12 pages use the netlink request path. (This concern may be
> > largely moot, since modification of most pages hanging off of the VMCS
> > while the (v)CPU is in VMX non-root mode leads to undefined behavior.)
> >
> > > > thread, which I believe is the equivalent of qemu's userfaultfd "fault
> > > > handler" thread. Our (internal) post-copy mechanism scales quite well,
> > > > because most demand paging requests are triggered by an EPT violation,
> > > > which happens to be a convenient place to exit to userspace. Very few
> > > > pages are typically demanded as a result of
> > > > kvm_vcpu_{read,write}_guest, where the vCPU thread is so deep in the
> > > > kernel call stack that it has to request the page via the demand
> > > > paging listener thread. With nested virtualization, the various vmcs12
> > > > pages consulted directly by kvm (bypassing the EPT tables) were a
> > > > scalability issue.
> > > I assume that you patched all these calls to exit to userspace for the
> > > demand paging scheme you are using.
> > >
> > > > (Note that, unlike upstream, we don't call nested_get_vmcs12_pages
> > > > directly from VMLAUNCH/VMRESUME emulation; we always call it as a
> > > > result of this request that you don't like.)
> > > Also I guess something specific to your downstream patches.
> > >
> > >
> > > > As we work on converting from our (hacky) demand paging scheme to
> > > > userfaultfd, we will have to solve the scalability issue anyway
> > > > (unless someone else beats us to it). Eventually, I expect that our
> > > > need for this request will go away.
> > >
> > > Great!
> > >
> > > The question is, if we remove it now, will that affect you?
> >
> > No. By the time we get to 5.17, I don't expect any of our current
> > post-copy scheme to survive.
> >
> > > What if we depricate it (add option to keep the current behavier,
> > > but keep an module param to revert back to old behavier, with
> > > the eventual goal of removing it.
> >
> > Don't bother on our account. :-)
> > > > Honestly, without the exits to userspace, I don't really see how this
> > > > request buys you anything upstream. When I originally submitted it, I
> > > > was prepared for rejection, but Paolo said that qemu had a similar
> > > > need for it, and I happily never questioned that assertion.
> > >
> > > Exactly! I didn't questioned it as well, because I didn't knew MMU at all,
> > > and it is one of the harderst KVM parts - who knows what it caches,
> > > and what magic it needs to be up to date.
> >
> > The other question, which may have been what motivated Paolo to keep
> > the GET_NESTED_VMCS12_PAGES request, is whether or not the memslots
> > have been established before userspace calls the ioctl for
> > SET_NESTED_STATE. If not, it is impossible to do any GPA->HPA lookups
> > for VMCS12 fields that are necessary for the vCPU to enter VMX
> > non-root mode as part of SET_NESTED_STATE. The only field I can think
> > of off the top of my head is the VMCS12 APIC-access address, which,
> > sadly, has to be backed by an L1 memslot, even though the page is
> > never accessed. (VMware always sets the APIC-access address to 0, for
> > all VMCSs, at L1 or L2. That makes things a lot easier.) If the
> > memslots are established before SET_NESTED_STATE, this isn't a
> > problem, but I don't recall that constraint being documented anywhere.
>
> This is the exactly only reason for KVM_REQ_GET_NESTED_STATE_PAGES and what it does.
> The only thing KVM_REQ_GET_NESTED_STATE_PAGES buy us is exactly this - ability to
> set memslots after setting the nested state.
>
> This is indeed not documented, but there is hope that none of the userspace relies on this
> (the opposite is also not documented)
>
> Qemu AFAIK doesn't rely on this - I can't be 100% sure because the code is complex and I
> could have missed some corner case like that virtio-mem case Paolo mentioned to me
> (I checked it, and there seems to be no reason for it to set memslots after migration).
>
> Practically speaking, memslots can update any moment, but there is no reason to have different
> memslots between sending the migration state and restoring it. As soon as restore of the migration
> state is complete, yes there could be memslots modifications, in theory even before the VM runs,
> but these modifications can't not make memory that a nested VMCS/VMCB reference available again,
> which is the only thing that KVM_REQ_GET_NESTED_STATE_PAGES buys us.
>
> TL;DR - no matter how userspace modifies the memslots after the migration,
> it must restore the memslots that contain pages that are referenced by the nested state,
> exactly as is.
>
> The only change to userspace expected behavier by removing the KVM_REQ_GET_NESTED_STATE_PAGES
> would be that userspace would have to restore those memslots prior to setting the nested state
> (but not the memory itself contained within, as it can be later brought in via userfaltd).
>
>
> About the referenced pages:
> In addition to apic access page, there are more pages that are referenced from vmcb/vmcs:
> There is apic backing page for both AVIC/APICv which we map and passthrough to the guest,
> there is PIR descriptior for APICv, there is the msr bitmap, and io bitmap,
> there are msr load/store lists on VMX, and probably more).

Pages that are only accessed on VM-entry or VM-exit are not a problem.
The pages that could be trouble are the ones used to construct a
vmc[sb]02 for an L2 VM that's in VMX non-root mode (guest mode) at the
time of migration. So, for instance, the vmc[sb]12 I/O bitmaps are no
trouble, because we're going to configure the vmc[sb]02 to exit on any
port access. However, the vmc[sb]12 MSR bitmap(s) could be a problem,
since they are used to construct the vmc[sb]02 MSR bitmap(s). Also a
potential problem are L1 pages that are directly accessed from the
vmc[sb]02, such as the virtual APIC page.

>
> >
> > > But now, I don't think there are still large areas of MMU that I don't understand,
> > > thus I started asking myself why it is need.
> > >
> > > That request is a ripe source of bugs. Just off my hand, Vitaly spent at least a week
> > > understanding why after vmcb01/02 split, eVMCS stopped working, only to figure out that
> > > KVM_REQ_GET_NESTED_STATE_PAGES might not be called after nested entry since there
> > > could be nested VM exit before we even enter the guest, and since then one more
> > > hack has to be added to work that around (nothing against the hack, its not the
> > > root cause of the problem).
> > >
> > > I also indirectly caused and fixed a CVE like issue, which started
> > >  with the patch that added KVM_REQ_GET_NESTED_STATE_PAGES to SVM -
> > > it made KVM switch to nested msr bitmap
> > > and back then there was no vmcb01/02 split. Problem is that back then
> > > we didn't cancel that request if we have VM exit right after VM entry,
> > > so it would be still pending on VM entry, and it will switch to nested MSR bitmap
> > > even if we are no longer nested - then I added patch to free the nested state
> > > on demand, and boom - we have L1 using freed (and usualy zeroed) MSR bitmap - free
> > > access to all host msrs from L1...
> > >
> > > There is another hidden issue in this request, that it makes it impossible to
> > > handle failure gracefully.
> > >
> > > If for example, loading nested state pages needs to allocate memory and that
> > > fails, we could just fail the nested VM entry if that request wasn't there.
> > > While this is not ideal for the nested guest, it likely to survive, and
> > > might even retry entering the nested guest.
> > >
> > > On the other hand during the request if something fails, nested state is
> > > already loaded, and all we can do is to kill the L1.
> >
> > I apologize for all of that pain.
>
> No problem. IMHO the request was created out of abundance of caution
> (which I understand very well), and back then nobody could predict the issues it will create,
> but as it stands now, it might be better to remove it that keep on adding more code to the mess.
>
> I btw tested the attached patch a bit more and it seems to survive so far.

To be reasonably certain that you do not need this code, you should
test migration of active L2 VMs with both pre-copy and background
post-copy disabled. In other words, the only way to get a page
transferred from the source to the destination is by faulting it in on
the destination.

> Best regards,
>         Maxim Levitsky
>
>
> >
> > > Thanks again!
> > > Best regards,
> > >         Maxim Levitsky
> > >
> > >
> > >
> > >
>
>
diff mbox series

Patch

From 129302dceb659c9d561676faaf33cb3ec64d02c5 Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk@redhat.com>
Date: Wed, 26 Jan 2022 17:20:18 +0200
Subject: [PATCH] KVM: x86: get rid of KVM_REQ_GET_NESTED_STATE_PAGES

TODO: add commit description

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +-
 arch/x86/kvm/hyperv.c           |  4 ++
 arch/x86/kvm/svm/nested.c       | 51 ++++-------------
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/svm/svm.h          |  3 +-
 arch/x86/kvm/vmx/nested.c       | 99 +++++++++------------------------
 arch/x86/kvm/x86.c              |  6 --
 7 files changed, 46 insertions(+), 124 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 82bc3e3e9b935..5a58835b5af0d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -92,7 +92,6 @@ 
 #define KVM_REQ_HV_EXIT			KVM_ARCH_REQ(21)
 #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
 #define KVM_REQ_LOAD_EOI_EXITMAP	KVM_ARCH_REQ(23)
-#define KVM_REQ_GET_NESTED_STATE_PAGES	KVM_ARCH_REQ(24)
 #define KVM_REQ_APICV_UPDATE \
 	KVM_ARCH_REQ_FLAGS(25, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_TLB_FLUSH_CURRENT	KVM_ARCH_REQ(26)
@@ -1502,12 +1501,14 @@  struct kvm_x86_nested_ops {
 	int (*set_state)(struct kvm_vcpu *vcpu,
 			 struct kvm_nested_state __user *user_kvm_nested_state,
 			 struct kvm_nested_state *kvm_state);
-	bool (*get_nested_state_pages)(struct kvm_vcpu *vcpu);
 	int (*write_log_dirty)(struct kvm_vcpu *vcpu, gpa_t l2_gpa);
 
 	int (*enable_evmcs)(struct kvm_vcpu *vcpu,
 			    uint16_t *vmcs_version);
 	uint16_t (*get_evmcs_version)(struct kvm_vcpu *vcpu);
+
+	bool (*get_evmcs_page)(struct kvm_vcpu *vcpu);
+
 };
 
 struct kvm_x86_init_ops {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 6e38a7d22e97a..c15886efa9a66 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1494,6 +1494,10 @@  static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host)
 					    gfn_to_gpa(gfn) | KVM_MSR_ENABLED,
 					    sizeof(struct hv_vp_assist_page)))
 			return 1;
+
+		if (host && kvm_x86_ops.nested_ops->get_evmcs_page)
+			if (kvm_x86_ops.nested_ops->get_evmcs_page(vcpu))
+				return 1;
 		break;
 	}
 	case HV_X64_MSR_EOI:
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index edd6cf1344112..66fa2478d89f6 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -677,7 +677,7 @@  static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
 }
 
 int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
-			 struct vmcb *vmcb12, bool from_vmrun)
+			 struct vmcb *vmcb12)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	int ret;
@@ -707,15 +707,13 @@  int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
 	nested_vmcb02_prepare_save(svm, vmcb12);
 
 	ret = nested_svm_load_cr3(&svm->vcpu, svm->nested.save.cr3,
-				  nested_npt_enabled(svm), from_vmrun);
+				  nested_npt_enabled(svm), true);
 	if (ret)
 		return ret;
 
 	if (!npt_enabled)
 		vcpu->arch.mmu->inject_page_fault = svm_inject_page_fault_nested;
 
-	if (!from_vmrun)
-		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 
 	svm_set_gif(svm, true);
 
@@ -783,7 +781,7 @@  int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 
 	svm->nested.nested_run_pending = 1;
 
-	if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, true))
+	if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12))
 		goto out_exit_err;
 
 	if (nested_svm_vmrun_msrpm(svm))
@@ -867,8 +865,6 @@  int nested_svm_vmexit(struct vcpu_svm *svm)
 	svm->nested.vmcb12_gpa = 0;
 	WARN_ON_ONCE(svm->nested.nested_run_pending);
 
-	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
-
 	/* in case we halted in L2 */
 	svm->vcpu.arch.mp_state = KVM_MP_STATE_RUNNABLE;
 
@@ -1066,8 +1062,6 @@  void svm_leave_nested(struct vcpu_svm *svm)
 		nested_svm_uninit_mmu_context(vcpu);
 		vmcb_mark_all_dirty(svm->vmcb);
 	}
-
-	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 }
 
 static int nested_svm_exit_handled_msr(struct vcpu_svm *svm)
@@ -1541,11 +1535,11 @@  static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	 */
 
 	ret = nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
-				  nested_npt_enabled(svm), false);
+				  nested_npt_enabled(svm),
+				  !vcpu->arch.pdptrs_from_userspace);
 	if (WARN_ON_ONCE(ret))
 		goto out_free;
 
-
 	/*
 	 * All checks done, we can enter guest mode. Userspace provides
 	 * vmcb12.control, which will be combined with L1 and stored into
@@ -1570,32 +1564,6 @@  static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 
 	svm_switch_vmcb(svm, &svm->nested.vmcb02);
 	nested_vmcb02_prepare_control(svm);
-	kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
-
-	ret = 0;
-out_free:
-	kfree(save);
-	kfree(ctl);
-
-	return ret;
-}
-
-static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-
-	if (WARN_ON(!is_guest_mode(vcpu)))
-		return true;
-
-	if (!vcpu->arch.pdptrs_from_userspace &&
-	    !nested_npt_enabled(svm) && is_pae_paging(vcpu))
-		/*
-		 * Reload the guest's PDPTRs since after a migration
-		 * the guest CR3 might be restored prior to setting the nested
-		 * state which can lead to a load of wrong PDPTRs.
-		 */
-		if (CC(!load_pdptrs(vcpu, vcpu->arch.cr3)))
-			return false;
 
 	if (!nested_svm_vmrun_msrpm(svm)) {
 		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
@@ -1605,13 +1573,18 @@  static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
 		return false;
 	}
 
-	return true;
+	ret = 0;
+out_free:
+	kfree(save);
+	kfree(ctl);
+
+	return ret;
 }
 
+
 struct kvm_x86_nested_ops svm_nested_ops = {
 	.check_events = svm_check_nested_events,
 	.triple_fault = nested_svm_triple_fault,
-	.get_nested_state_pages = svm_get_nested_state_pages,
 	.get_state = svm_get_nested_state,
 	.set_state = svm_set_nested_state,
 };
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0e6ad19d205c9..00a9396f726b9 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4334,7 +4334,7 @@  static int svm_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
 	vmcb12 = map.hva;
 	nested_copy_vmcb_control_to_cache(svm, &vmcb12->control);
 	nested_copy_vmcb_save_to_cache(svm, &vmcb12->save);
-	ret = enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, false);
+	ret = enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12);
 
 	if (ret)
 		goto unmap_save;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 3acef0dfc9b94..5979f41632756 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -552,8 +552,7 @@  static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
 	return vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
 }
 
-int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
-			 u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
+int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb_gpa, struct vmcb *vmcb12);
 void svm_leave_nested(struct vcpu_svm *svm);
 void svm_free_nested(struct vcpu_svm *svm);
 int svm_allocate_nested(struct vcpu_svm *svm);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 5a2595a6bf08c..cb259b58a9110 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -293,8 +293,6 @@  static void free_nested(struct kvm_vcpu *vcpu)
 	if (!vmx->nested.vmxon && !vmx->nested.smm.vmxon)
 		return;
 
-	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
-
 	vmx->nested.vmxon = false;
 	vmx->nested.smm.vmxon = false;
 	vmx->nested.vmxon_ptr = INVALID_GPA;
@@ -2592,7 +2590,8 @@  static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 
 	/* Shadow page tables on either EPT or shadow page tables. */
 	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
-				from_vmentry, entry_failure_code))
+				from_vmentry || !vcpu->arch.pdptrs_from_userspace,
+				entry_failure_code))
 		return -EINVAL;
 
 	/*
@@ -3124,7 +3123,7 @@  static int nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
+bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -3160,18 +3159,6 @@  static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 	struct page *page;
 	u64 hpa;
 
-	if (!vcpu->arch.pdptrs_from_userspace &&
-	    !nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) {
-		/*
-		 * Reload the guest's PDPTRs since after a migration
-		 * the guest CR3 might be restored prior to setting the nested
-		 * state which can lead to a load of wrong PDPTRs.
-		 */
-		if (CC(!load_pdptrs(vcpu, vcpu->arch.cr3)))
-			return false;
-	}
-
-
 	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
 		/*
 		 * Translate L1 physical address to host physical
@@ -3253,25 +3240,6 @@  static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
-{
-	if (!nested_get_evmcs_page(vcpu)) {
-		pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
-				     __func__);
-		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
-		vcpu->run->internal.suberror =
-			KVM_INTERNAL_ERROR_EMULATION;
-		vcpu->run->internal.ndata = 0;
-
-		return false;
-	}
-
-	if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
-		return false;
-
-	return true;
-}
-
 static int nested_vmx_write_pml_buffer(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
 	struct vmcs12 *vmcs12;
@@ -3401,12 +3369,12 @@  enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 
 	prepare_vmcs02_early(vmx, &vmx->vmcs01, vmcs12);
 
-	if (from_vmentry) {
-		if (unlikely(!nested_get_vmcs12_pages(vcpu))) {
-			vmx_switch_vmcs(vcpu, &vmx->vmcs01);
-			return NVMX_VMENTRY_KVM_INTERNAL_ERROR;
-		}
+	if (unlikely(!nested_get_vmcs12_pages(vcpu))) {
+		vmx_switch_vmcs(vcpu, &vmx->vmcs01);
+		return NVMX_VMENTRY_KVM_INTERNAL_ERROR;
+	}
 
+	if (from_vmentry) {
 		if (nested_vmx_check_vmentry_hw(vcpu)) {
 			vmx_switch_vmcs(vcpu, &vmx->vmcs01);
 			return NVMX_VMENTRY_VMFAIL;
@@ -3428,24 +3396,14 @@  enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 		goto vmentry_fail_vmexit_guest_mode;
 	}
 
-	if (from_vmentry) {
-		failed_index = nested_vmx_load_msr(vcpu,
-						   vmcs12->vm_entry_msr_load_addr,
-						   vmcs12->vm_entry_msr_load_count);
-		if (failed_index) {
-			exit_reason.basic = EXIT_REASON_MSR_LOAD_FAIL;
-			vmcs12->exit_qualification = failed_index;
-			goto vmentry_fail_vmexit_guest_mode;
-		}
-	} else {
-		/*
-		 * The MMU is not initialized to point at the right entities yet and
-		 * "get pages" would need to read data from the guest (i.e. we will
-		 * need to perform gpa to hpa translation). Request a call
-		 * to nested_get_vmcs12_pages before the next VM-entry.  The MSRs
-		 * have already been set at vmentry time and should not be reset.
-		 */
-		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+
+	failed_index = nested_vmx_load_msr(vcpu,
+					   vmcs12->vm_entry_msr_load_addr,
+					   vmcs12->vm_entry_msr_load_count);
+	if (failed_index) {
+		exit_reason.basic = EXIT_REASON_MSR_LOAD_FAIL;
+		vmcs12->exit_qualification = failed_index;
+		goto vmentry_fail_vmexit_guest_mode;
 	}
 
 	/*
@@ -4515,16 +4473,6 @@  void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	/* Similarly, triple faults in L2 should never escape. */
 	WARN_ON_ONCE(kvm_check_request(KVM_REQ_TRIPLE_FAULT, vcpu));
 
-	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
-		/*
-		 * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
-		 * Enlightened VMCS after migration and we still need to
-		 * do that when something is forcing L2->L1 exit prior to
-		 * the first L2 run.
-		 */
-		(void)nested_get_evmcs_page(vcpu);
-	}
-
 	/* Service pending TLB flush requests for L2 before switching to L1. */
 	kvm_service_local_tlb_flush_requests(vcpu);
 
@@ -6348,14 +6296,17 @@  static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
 
 		set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa);
 	} else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) {
+
+		vmx->nested.hv_evmcs_vmptr = EVMPTR_MAP_PENDING;
+
 		/*
 		 * nested_vmx_handle_enlightened_vmptrld() cannot be called
-		 * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be
-		 * restored yet. EVMCS will be mapped from
-		 * nested_get_vmcs12_pages().
+		 * directly from here if HV_X64_MSR_VP_ASSIST_PAGE is not
+		 * restored yet. EVMCS will be mapped when it is.
 		 */
-		vmx->nested.hv_evmcs_vmptr = EVMPTR_MAP_PENDING;
-		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+		if (kvm_hv_assist_page_enabled(vcpu))
+			nested_get_evmcs_page(vcpu);
+
 	} else {
 		return -EINVAL;
 	}
@@ -6778,8 +6729,8 @@  struct kvm_x86_nested_ops vmx_nested_ops = {
 	.triple_fault = nested_vmx_triple_fault,
 	.get_state = vmx_get_nested_state,
 	.set_state = vmx_set_nested_state,
-	.get_nested_state_pages = vmx_get_nested_state_pages,
 	.write_log_dirty = nested_vmx_write_pml_buffer,
 	.enable_evmcs = nested_enable_evmcs,
 	.get_evmcs_version = nested_get_evmcs_version,
+	.get_evmcs_page = nested_get_evmcs_page,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ae401113a80b..283a3f52aa135 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9727,12 +9727,6 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			r = -EIO;
 			goto out;
 		}
-		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
-			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
-				r = 0;
-				goto out;
-			}
-		}
 		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
-- 
2.26.3