diff mbox

[v2,1/3] kvm: svm: Add support for additional SVM NPF error codes

Message ID 147992049856.27638.17076562184960611399.stgit@brijesh-build-machine (mailing list archive)
State New, archived
Headers show

Commit Message

Brijesh Singh Nov. 23, 2016, 5:01 p.m. UTC
From: Tom Lendacky <thomas.lendacky@amd.com>

AMD hardware adds two additional bits to aid in nested page fault handling.

Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables

The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.

Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |   11 ++++++++++-
 arch/x86/kvm/mmu.c              |   20 ++++++++++++++++++--
 arch/x86/kvm/svm.c              |    2 +-
 3 files changed, 29 insertions(+), 4 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Paolo Bonzini July 27, 2017, 4:27 p.m. UTC | #1
On 23/11/2016 18:01, Brijesh Singh wrote:
>  
> +	/*
> +	 * Before emulating the instruction, check if the error code
> +	 * was due to a RO violation while translating the guest page.
> +	 * This can occur when using nested virtualization with nested
> +	 * paging in both guests. If true, we simply unprotect the page
> +	 * and resume the guest.
> +	 *
> +	 * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
> +	 *       in PFERR_NEXT_GUEST_PAGE)
> +	 */
> +	if (error_code == PFERR_NESTED_GUEST_PAGE) {
> +		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
> +		return 1;
> +	}


What happens if L1 is mapping some memory that is read only in L0?  That
is, the L1 nested page tables make it read-write, but the L0 shadow
nested page tables make it read-only.

Accessing it would cause an NPF, and then my guess is that the L1 guest
would loop on the failing instruction instead of just dropping the write.

Paolo
Brijesh Singh July 31, 2017, 1:30 p.m. UTC | #2
Hi Paolo,

On 07/27/2017 11:27 AM, Paolo Bonzini wrote:
> On 23/11/2016 18:01, Brijesh Singh wrote:
>>   
>> +	/*
>> +	 * Before emulating the instruction, check if the error code
>> +	 * was due to a RO violation while translating the guest page.
>> +	 * This can occur when using nested virtualization with nested
>> +	 * paging in both guests. If true, we simply unprotect the page
>> +	 * and resume the guest.
>> +	 *
>> +	 * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
>> +	 *       in PFERR_NEXT_GUEST_PAGE)
>> +	 */
>> +	if (error_code == PFERR_NESTED_GUEST_PAGE) {
>> +		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
>> +		return 1;
>> +	}
> 
> 
> What happens if L1 is mapping some memory that is read only in L0?  That
> is, the L1 nested page tables make it read-write, but the L0 shadow
> nested page tables make it read-only.
> 
> Accessing it would cause an NPF, and then my guess is that the L1 guest
> would loop on the failing instruction instead of just dropping the write.
> 


Not sure if I am able to follow your use case. Could you please explain me
in bit detail.

The purpose of the code above was really for when we resume from the L2 guest
back to the L1 guest. The L1 page tables are marked RO when in the L2 guest
(for shadow paging) as I recall, so when we come back to the L1 guest, it can
get a fault since its page tables are not marked writeable at L0 as they need to be.

-Brijesh
Paolo Bonzini July 31, 2017, 3:44 p.m. UTC | #3
On 31/07/2017 15:30, Brijesh Singh wrote:
> Hi Paolo,
> 
> On 07/27/2017 11:27 AM, Paolo Bonzini wrote:
>> On 23/11/2016 18:01, Brijesh Singh wrote:
>>>   +    /*
>>> +     * Before emulating the instruction, check if the error code
>>> +     * was due to a RO violation while translating the guest page.
>>> +     * This can occur when using nested virtualization with nested
>>> +     * paging in both guests. If true, we simply unprotect the page
>>> +     * and resume the guest.
>>> +     *
>>> +     * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
>>> +     *       in PFERR_NEXT_GUEST_PAGE)
>>> +     */
>>> +    if (error_code == PFERR_NESTED_GUEST_PAGE) {
>>> +        kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
>>> +        return 1;
>>> +    }
>>
>>
>> What happens if L1 is mapping some memory that is read only in L0?  That
>> is, the L1 nested page tables make it read-write, but the L0 shadow
>> nested page tables make it read-only.
>>
>> Accessing it would cause an NPF, and then my guess is that the L1 guest
>> would loop on the failing instruction instead of just dropping the write.
>>
> 
> 
> Not sure if I am able to follow your use case. Could you please explain me
> in bit detail.
> 
> The purpose of the code above was really for when we resume from the L2 guest
> back to the L1 guest. The L1 page tables are marked RO when in the L2 guest
> (for shadow paging) as I recall, so when we come back to the L1 guest, it can
> get a fault since its page tables are not marked writeable at L0 as they
> need to be.

There can be different cases where an L0->L2 shadow nested page table is
marked read only, in particular when a page is read only in L1's nested
page tables.  If such a page is accessed by L2 while walking page tables
it will cause a nested page fault (page table walks are write accesses).
 However, after kvm_mmu_unprotect_page you will get another page fault,
and again in an endless stream.

Instead, emulation would have caused a nested page fault vmexit, I think.

Paolo
Brijesh Singh July 31, 2017, 4:54 p.m. UTC | #4
On 07/31/2017 10:44 AM, Paolo Bonzini wrote:
> On 31/07/2017 15:30, Brijesh Singh wrote:
>> Hi Paolo,
>>
>> On 07/27/2017 11:27 AM, Paolo Bonzini wrote:
>>> On 23/11/2016 18:01, Brijesh Singh wrote:
>>>>    +    /*
>>>> +     * Before emulating the instruction, check if the error code
>>>> +     * was due to a RO violation while translating the guest page.
>>>> +     * This can occur when using nested virtualization with nested
>>>> +     * paging in both guests. If true, we simply unprotect the page
>>>> +     * and resume the guest.
>>>> +     *
>>>> +     * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
>>>> +     *       in PFERR_NEXT_GUEST_PAGE)
>>>> +     */
>>>> +    if (error_code == PFERR_NESTED_GUEST_PAGE) {
>>>> +        kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
>>>> +        return 1;
>>>> +    }
>>>
>>>
>>> What happens if L1 is mapping some memory that is read only in L0?  That
>>> is, the L1 nested page tables make it read-write, but the L0 shadow
>>> nested page tables make it read-only.
>>>
>>> Accessing it would cause an NPF, and then my guess is that the L1 guest
>>> would loop on the failing instruction instead of just dropping the write.
>>>
>>
>>
>> Not sure if I am able to follow your use case. Could you please explain me
>> in bit detail.
>>
>> The purpose of the code above was really for when we resume from the L2 guest
>> back to the L1 guest. The L1 page tables are marked RO when in the L2 guest
>> (for shadow paging) as I recall, so when we come back to the L1 guest, it can
>> get a fault since its page tables are not marked writeable at L0 as they
>> need to be.
> 
> There can be different cases where an L0->L2 shadow nested page table is
> marked read only, in particular when a page is read only in L1's nested
> page tables.  If such a page is accessed by L2 while walking page tables
> it will cause a nested page fault (page table walks are write accesses).
>   However, after kvm_mmu_unprotect_page you will get another page fault,
> and again in an endless stream.
> 
> Instead, emulation would have caused a nested page fault vmexit, I think.
> 

If possible could you please give me some pointer on how to create this use
case so that we can get definitive answer.

Looking at the code path is giving me indication that the new code
(the kvm_mmu_unprotect_page call) only happens if vcpu->arch.mmu_page_fault()
returns an indication that the instruction should be emulated. I would not expect
that to be the case scenario you described above since L1 making a page read-only
(this is a page table for L2) is an error and should result in #NPF being injected
into L1. It's bit hard for me to visualize the code flow and figure out exactly
how that would happen, but I just tried booting nested virtualization and it seem
to be working okay.

Is there a kvm-unit-test which I can run to trigger this scenario ? thanks

-Brijesh
Paolo Bonzini July 31, 2017, 8:05 p.m. UTC | #5
> > There can be different cases where an L0->L2 shadow nested page table is
> > marked read only, in particular when a page is read only in L1's nested
> > page tables.  If such a page is accessed by L2 while walking page tables
> > it will cause a nested page fault (page table walks are write accesses).
> >   However, after kvm_mmu_unprotect_page you will get another page fault,
> > and again in an endless stream.
> > 
> > Instead, emulation would have caused a nested page fault vmexit, I think.
> 
> If possible could you please give me some pointer on how to create this use
> case so that we can get definitive answer.
> 
> Looking at the code path is giving me indication that the new code
> (the kvm_mmu_unprotect_page call) only happens if vcpu->arch.mmu_page_fault()
> returns an indication that the instruction should be emulated. I would not
> expect that to be the case scenario you described above since L1 making a page
> read-only (this is a page table for L2) is an error and should result in #NPF
> being injected into L1.

The flow is:

  hardware walks page table; L2 page table points to read only memory
  -> pf_interception (code = 
  -> kvm_handle_page_fault (need_unprotect = false)
  -> kvm_mmu_page_fault
  -> paging64_page_fault (for example)
     -> try_async_pf
        map_writable set to false
     -> paging64_fetch(write_fault = true, map_writable = false, prefault = false)
        -> mmu_set_spte(speculative = false, host_writable = false, write_fault = true)
           -> set_spte
              mmu_need_write_protect returns true
              return true
           write_fault == true -> set emulate = true
           return true
        return true
     return true
  emulate

Without this patch, emulation would have called

  ..._gva_to_gpa_nested
  -> translate_nested_gpa
  -> paging64_gva_to_gpa
  -> paging64_walk_addr
  -> paging64_walk_addr_generic
     set fault (nested_page_fault=true)

and then:

   kvm_propagate_fault
   -> nested_svm_inject_npf_exit

> It's bit hard for me to visualize the code flow and
> figure out exactly how that would happen, but I just tried booting nested
> virtualization and it seem to be working okay.

I don't expect the above to happen when booting a normal guest (usual L1
guests hardly have readonly mappings).

> Is there a kvm-unit-test which I can run to trigger this scenario ? thanks

No, there isn't.

Paolo

> -Brijesh
>
Brijesh Singh Aug. 1, 2017, 1:36 p.m. UTC | #6
On 07/31/2017 03:05 PM, Paolo Bonzini wrote:
> 
>>> There can be different cases where an L0->L2 shadow nested page table is
>>> marked read only, in particular when a page is read only in L1's nested
>>> page tables.  If such a page is accessed by L2 while walking page tables
>>> it will cause a nested page fault (page table walks are write accesses).
>>>    However, after kvm_mmu_unprotect_page you will get another page fault,
>>> and again in an endless stream.
>>>
>>> Instead, emulation would have caused a nested page fault vmexit, I think.
>>
>> If possible could you please give me some pointer on how to create this use
>> case so that we can get definitive answer.
>>
>> Looking at the code path is giving me indication that the new code
>> (the kvm_mmu_unprotect_page call) only happens if vcpu->arch.mmu_page_fault()
>> returns an indication that the instruction should be emulated. I would not
>> expect that to be the case scenario you described above since L1 making a page
>> read-only (this is a page table for L2) is an error and should result in #NPF
>> being injected into L1.
> 
> The flow is:
> 
>    hardware walks page table; L2 page table points to read only memory
>    -> pf_interception (code =
>    -> kvm_handle_page_fault (need_unprotect = false)
>    -> kvm_mmu_page_fault
>    -> paging64_page_fault (for example)
>       -> try_async_pf
>          map_writable set to false
>       -> paging64_fetch(write_fault = true, map_writable = false, prefault = false)
>          -> mmu_set_spte(speculative = false, host_writable = false, write_fault = true)
>             -> set_spte
>                mmu_need_write_protect returns true
>                return true
>             write_fault == true -> set emulate = true
>             return true
>          return true
>       return true
>    emulate
> 
> Without this patch, emulation would have called
> 
>    ..._gva_to_gpa_nested
>    -> translate_nested_gpa
>    -> paging64_gva_to_gpa
>    -> paging64_walk_addr
>    -> paging64_walk_addr_generic
>       set fault (nested_page_fault=true)
> 
> and then:
> 
>     kvm_propagate_fault
>     -> nested_svm_inject_npf_exit
> 

maybe then safer thing would be to qualify the new error_code check with
!mmu_is_nested(vcpu) or something like that. So that way it would run on
L1 guest, and not the L2 guest. I believe that would restrict it avoid
hitting this case. Are you okay with this change ?

IIRC, the main place where this check was valuable was when L1 guest had
a fault (when coming out of the L2 guest) and emulation was not needed.

-Brijesh
Paolo Bonzini Aug. 2, 2017, 10:42 a.m. UTC | #7
On 01/08/2017 15:36, Brijesh Singh wrote:
>>
>> The flow is:
>>
>>    hardware walks page table; L2 page table points to read only memory
>>    -> pf_interception (code =
>>    -> kvm_handle_page_fault (need_unprotect = false)
>>    -> kvm_mmu_page_fault
>>    -> paging64_page_fault (for example)
>>       -> try_async_pf
>>          map_writable set to false
>>       -> paging64_fetch(write_fault = true, map_writable = false,
>> prefault = false)
>>          -> mmu_set_spte(speculative = false, host_writable = false,
>> write_fault = true)
>>             -> set_spte
>>                mmu_need_write_protect returns true
>>                return true
>>             write_fault == true -> set emulate = true
>>             return true
>>          return true
>>       return true
>>    emulate
>>
>> Without this patch, emulation would have called
>>
>>    ..._gva_to_gpa_nested
>>    -> translate_nested_gpa
>>    -> paging64_gva_to_gpa
>>    -> paging64_walk_addr
>>    -> paging64_walk_addr_generic
>>       set fault (nested_page_fault=true)
>>
>> and then:
>>
>>     kvm_propagate_fault
>>     -> nested_svm_inject_npf_exit
>>
> 
> maybe then safer thing would be to qualify the new error_code check with
> !mmu_is_nested(vcpu) or something like that. So that way it would run on
> L1 guest, and not the L2 guest. I believe that would restrict it avoid
> hitting this case. Are you okay with this change ?

Or check "vcpu->arch.mmu.direct_map"?  That would be true when not using
shadow pages.

> IIRC, the main place where this check was valuable was when L1 guest had
> a fault (when coming out of the L2 guest) and emulation was not needed.

How do I measure the effect?  I tried counting the number of emulations,
and any difference from the patch was lost in noise.

Paolo
Brijesh Singh Aug. 4, 2017, 12:30 a.m. UTC | #8
On 8/2/17 5:42 AM, Paolo Bonzini wrote:
> On 01/08/2017 15:36, Brijesh Singh wrote:
>>> The flow is:
>>>
>>>    hardware walks page table; L2 page table points to read only memory
>>>    -> pf_interception (code =
>>>    -> kvm_handle_page_fault (need_unprotect = false)
>>>    -> kvm_mmu_page_fault
>>>    -> paging64_page_fault (for example)
>>>       -> try_async_pf
>>>          map_writable set to false
>>>       -> paging64_fetch(write_fault = true, map_writable = false,
>>> prefault = false)
>>>          -> mmu_set_spte(speculative = false, host_writable = false,
>>> write_fault = true)
>>>             -> set_spte
>>>                mmu_need_write_protect returns true
>>>                return true
>>>             write_fault == true -> set emulate = true
>>>             return true
>>>          return true
>>>       return true
>>>    emulate
>>>
>>> Without this patch, emulation would have called
>>>
>>>    ..._gva_to_gpa_nested
>>>    -> translate_nested_gpa
>>>    -> paging64_gva_to_gpa
>>>    -> paging64_walk_addr
>>>    -> paging64_walk_addr_generic
>>>       set fault (nested_page_fault=true)
>>>
>>> and then:
>>>
>>>     kvm_propagate_fault
>>>     -> nested_svm_inject_npf_exit
>>>
>> maybe then safer thing would be to qualify the new error_code check with
>> !mmu_is_nested(vcpu) or something like that. So that way it would run on
>> L1 guest, and not the L2 guest. I believe that would restrict it avoid
>> hitting this case. Are you okay with this change ?
> Or check "vcpu->arch.mmu.direct_map"?  That would be true when not using
> shadow pages.

Yes that can be used.

>> IIRC, the main place where this check was valuable was when L1 guest had
>> a fault (when coming out of the L2 guest) and emulation was not needed.
> How do I measure the effect?  I tried counting the number of emulations,
> and any difference from the patch was lost in noise.

I think this patch is necessary for functional reasons (not just perf), because we added the other patch to look at the GPA and stop walking the guest page tables on a NPF.

The issue I think was that hardware has taken an NPF because the page table is marked RO, and it saves the GPA in the VMCB.  KVM was then going and emulating the instruction and it saw that a GPA was available.  But that GPA was not the GPA of the instruction it is emulating, since it was the GPA of the tablewalk page that had the fault. It was debugged that at the time and realized that emulating the instruction was unnecessary so we added this new code in there which fixed the functional issue and helps perf.

I don't have any data on how much perf, as I recall it was most effective when the L1 guest page tables and L2 nested page tables were exactly the same.  In that case, it avoided emulations for code that L1 executes which I think could be as much as one emulation per 4kb code page.
Paolo Bonzini Aug. 4, 2017, 2:05 p.m. UTC | #9
On 04/08/2017 02:30, Brijesh Singh wrote:
> 
> 
> On 8/2/17 5:42 AM, Paolo Bonzini wrote:
>> On 01/08/2017 15:36, Brijesh Singh wrote:
>>>> The flow is:
>>>>
>>>>    hardware walks page table; L2 page table points to read only memory
>>>>    -> pf_interception (code =
>>>>    -> kvm_handle_page_fault (need_unprotect = false)
>>>>    -> kvm_mmu_page_fault
>>>>    -> paging64_page_fault (for example)
>>>>       -> try_async_pf
>>>>          map_writable set to false
>>>>       -> paging64_fetch(write_fault = true, map_writable = false,
>>>> prefault = false)
>>>>          -> mmu_set_spte(speculative = false, host_writable = false,
>>>> write_fault = true)
>>>>             -> set_spte
>>>>                mmu_need_write_protect returns true
>>>>                return true
>>>>             write_fault == true -> set emulate = true
>>>>             return true
>>>>          return true
>>>>       return true
>>>>    emulate
>>>>
>>>> Without this patch, emulation would have called
>>>>
>>>>    ..._gva_to_gpa_nested
>>>>    -> translate_nested_gpa
>>>>    -> paging64_gva_to_gpa
>>>>    -> paging64_walk_addr
>>>>    -> paging64_walk_addr_generic
>>>>       set fault (nested_page_fault=true)
>>>>
>>>> and then:
>>>>
>>>>     kvm_propagate_fault
>>>>     -> nested_svm_inject_npf_exit
>>>>
>>> maybe then safer thing would be to qualify the new error_code check with
>>> !mmu_is_nested(vcpu) or something like that. So that way it would run on
>>> L1 guest, and not the L2 guest. I believe that would restrict it avoid
>>> hitting this case. Are you okay with this change ?
>> Or check "vcpu->arch.mmu.direct_map"?  That would be true when not using
>> shadow pages.
> 
> Yes that can be used.

Are you going to send a patch for this?

Paolo

>>> IIRC, the main place where this check was valuable was when L1 guest had
>>> a fault (when coming out of the L2 guest) and emulation was not needed.
>> How do I measure the effect?  I tried counting the number of emulations,
>> and any difference from the patch was lost in noise.
> 
> I think this patch is necessary for functional reasons (not just
> perf), because we added the other patch to look at the GPA and stop
> walking the guest page tables on a NPF.
> 
> The issue I think was that hardware has taken an NPF because the page
> table is marked RO, and it saves the GPA in the VMCB. KVM was then going
> and emulating the instruction and it saw that a GPA was available. But
> that GPA was not the GPA of the instruction it is emulating, since it
> was the GPA of the tablewalk page that had the fault. It was debugged
> that at the time and realized that emulating the instruction was
> unnecessary so we added this new code in there which fixed the
> functional issue and helps perf.
> 
> I don't have any data on how much perf, as I recall it was most
> effective when the L1 guest page tables and L2 nested page tables were
> exactly the same. In that case, it avoided emulations for code that L1
> executes which I think could be as much as one emulation per 4kb code page.
>
Brijesh Singh Aug. 4, 2017, 2:23 p.m. UTC | #10
Hi Paolo,

On 08/04/2017 09:05 AM, Paolo Bonzini wrote:
> On 04/08/2017 02:30, Brijesh Singh wrote:
>>
>>
>> On 8/2/17 5:42 AM, Paolo Bonzini wrote:
>>> On 01/08/2017 15:36, Brijesh Singh wrote:
>>>>> The flow is:
>>>>>
>>>>>     hardware walks page table; L2 page table points to read only memory
>>>>>     -> pf_interception (code =
>>>>>     -> kvm_handle_page_fault (need_unprotect = false)
>>>>>     -> kvm_mmu_page_fault
>>>>>     -> paging64_page_fault (for example)
>>>>>        -> try_async_pf
>>>>>           map_writable set to false
>>>>>        -> paging64_fetch(write_fault = true, map_writable = false,
>>>>> prefault = false)
>>>>>           -> mmu_set_spte(speculative = false, host_writable = false,
>>>>> write_fault = true)
>>>>>              -> set_spte
>>>>>                 mmu_need_write_protect returns true
>>>>>                 return true
>>>>>              write_fault == true -> set emulate = true
>>>>>              return true
>>>>>           return true
>>>>>        return true
>>>>>     emulate
>>>>>
>>>>> Without this patch, emulation would have called
>>>>>
>>>>>     ..._gva_to_gpa_nested
>>>>>     -> translate_nested_gpa
>>>>>     -> paging64_gva_to_gpa
>>>>>     -> paging64_walk_addr
>>>>>     -> paging64_walk_addr_generic
>>>>>        set fault (nested_page_fault=true)
>>>>>
>>>>> and then:
>>>>>
>>>>>      kvm_propagate_fault
>>>>>      -> nested_svm_inject_npf_exit
>>>>>
>>>> maybe then safer thing would be to qualify the new error_code check with
>>>> !mmu_is_nested(vcpu) or something like that. So that way it would run on
>>>> L1 guest, and not the L2 guest. I believe that would restrict it avoid
>>>> hitting this case. Are you okay with this change ?
>>> Or check "vcpu->arch.mmu.direct_map"?  That would be true when not using
>>> shadow pages.
>>
>> Yes that can be used.
> 
> Are you going to send a patch for this?
> 

Yes. I should be posting it by Monday or Tuesday - need sometime to verify it.

-Brijesh
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bdde807..da07e17 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -191,6 +191,8 @@  enum {
 #define PFERR_RSVD_BIT 3
 #define PFERR_FETCH_BIT 4
 #define PFERR_PK_BIT 5
+#define PFERR_GUEST_FINAL_BIT 32
+#define PFERR_GUEST_PAGE_BIT 33
 
 #define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
 #define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
@@ -198,6 +200,13 @@  enum {
 #define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT)
 #define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)
 #define PFERR_PK_MASK (1U << PFERR_PK_BIT)
+#define PFERR_GUEST_FINAL_MASK (1ULL << PFERR_GUEST_FINAL_BIT)
+#define PFERR_GUEST_PAGE_MASK (1ULL << PFERR_GUEST_PAGE_BIT)
+
+#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
+				 PFERR_USER_MASK |		\
+				 PFERR_WRITE_MASK |		\
+				 PFERR_PRESENT_MASK)
 
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC	0
@@ -1203,7 +1212,7 @@  void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu);
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code,
+int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u64 error_code,
 		       void *insn, int insn_len);
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
 void kvm_mmu_new_cr3(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d9c7e98..f633d29 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4508,7 +4508,7 @@  static void make_mmu_pages_available(struct kvm_vcpu *vcpu)
 	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 }
 
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
+int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 		       void *insn, int insn_len)
 {
 	int r, emulation_type = EMULTYPE_RETRY;
@@ -4527,12 +4527,28 @@  int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
 			return r;
 	}
 
-	r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
+	r = vcpu->arch.mmu.page_fault(vcpu, cr2, lower_32_bits(error_code),
+				      false);
 	if (r < 0)
 		return r;
 	if (!r)
 		return 1;
 
+	/*
+	 * Before emulating the instruction, check if the error code
+	 * was due to a RO violation while translating the guest page.
+	 * This can occur when using nested virtualization with nested
+	 * paging in both guests. If true, we simply unprotect the page
+	 * and resume the guest.
+	 *
+	 * Note: AMD only (since it supports the PFERR_GUEST_PAGE_MASK used
+	 *       in PFERR_NEXT_GUEST_PAGE)
+	 */
+	if (error_code == PFERR_NESTED_GUEST_PAGE) {
+		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
+		return 1;
+	}
+
 	if (mmio_info_in_cache(vcpu, cr2, direct))
 		emulation_type = 0;
 emulate:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 8ca1eca..4e462bb 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2074,7 +2074,7 @@  static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
 static int pf_interception(struct vcpu_svm *svm)
 {
 	u64 fault_address = svm->vmcb->control.exit_info_2;
-	u32 error_code;
+	u64 error_code;
 	int r = 1;
 
 	switch (svm->apf_reason) {