diff mbox series

[20/24] KVM: x86/mmu: Use a dedicated bit to track shadow/MMU-present SPTEs

Message ID 20210225204749.1512652-21-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Introduce MMU_PRESENT and fix bugs | expand

Commit Message

Sean Christopherson Feb. 25, 2021, 8:47 p.m. UTC
Introduce MMU_PRESENT to explicitly track which SPTEs are "present" from
the MMU's perspective.  Checking for shadow-present SPTEs is a very
common operation for the MMU, particularly in hot paths such as page
faults.  With the addition of "removed" SPTEs for the TDP MMU,
identifying shadow-present SPTEs is quite costly especially since it
requires checking multiple 64-bit values.

On 64-bit KVM, this reduces the footprint of kvm.ko's .text by ~2k bytes.
On 32-bit KVM, this increases the footprint by ~200 bytes, but only
because gcc now inlines several more MMU helpers, e.g. drop_parent_pte().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/spte.c |  8 ++++----
 arch/x86/kvm/mmu/spte.h | 11 ++++++++++-
 2 files changed, 14 insertions(+), 5 deletions(-)

Comments

Tom Lendacky March 8, 2021, 6:52 p.m. UTC | #1
On 2/25/21 2:47 PM, Sean Christopherson wrote:
> Introduce MMU_PRESENT to explicitly track which SPTEs are "present" from
> the MMU's perspective.  Checking for shadow-present SPTEs is a very
> common operation for the MMU, particularly in hot paths such as page
> faults.  With the addition of "removed" SPTEs for the TDP MMU,
> identifying shadow-present SPTEs is quite costly especially since it
> requires checking multiple 64-bit values.
> 
> On 64-bit KVM, this reduces the footprint of kvm.ko's .text by ~2k bytes.
> On 32-bit KVM, this increases the footprint by ~200 bytes, but only
> because gcc now inlines several more MMU helpers, e.g. drop_parent_pte().
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/spte.c |  8 ++++----
>   arch/x86/kvm/mmu/spte.h | 11 ++++++++++-
>   2 files changed, 14 insertions(+), 5 deletions(-)

I'm trying to run a guest on my Rome system using the queue branch, but
I'm encountering an error that I bisected to this commit. In the guest
(during OVMF boot) I see:

error: kvm run failed Invalid argument
RAX=0000000000000000 RBX=00000000ffc12792 RCX=000000007f58401a RDX=000000007faaf808
RSI=0000000000000010 RDI=00000000ffc12792 RBP=00000000ffc12792 RSP=000000007faaf740
R8 =0000000000000792 R9 =000000007faaf808 R10=00000000ffc12793 R11=00000000000003f8
R12=0000000000000010 R13=0000000000000000 R14=000000007faaf808 R15=0000000000000012
RIP=000000007f6e9a90 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
GDT=     000000007f5ee698 00000047
IDT=     000000007f186018 00000fff
CR0=80010033 CR2=0000000000000000 CR3=000000007f801000 CR4=00000668
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d00
Code=22 00 00 e8 c0 e6 ff ff 48 83 c4 20 45 84 ed 74 07 fb eb 04 <44> 88 65 00 58 5b 5d 41 5c 41 5d c3 55 48 0f af 3d 1b 37 00 00 be 20 00 00 00 48 03 3d 17

On the hypervisor, I see the following:

[   55.886136] get_mmio_spte: detect reserved bits on spte, addr 0xffc12792, dump hierarchy:
[   55.895284] ------ spte 0x1344a0827 level 4.
[   55.900059] ------ spte 0x134499827 level 3.
[   55.904877] ------ spte 0x165bf0827 level 2.
[   55.909651] ------ spte 0xff800ffc12817 level 1.

When I kill the guest, I get a kernel panic:

[   95.539683] __pte_list_remove: 0000000040567a6a 0->BUG
[   95.545481] kernel BUG at arch/x86/kvm/mmu/mmu.c:896!
[   95.551133] invalid opcode: 0000 [#1] SMP NOPTI
[   95.556192] CPU: 142 PID: 5054 Comm: qemu-system-x86 Tainted: G        W         5.11.0-rc4-sos-sev-es #1
[   95.566872] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS REX1006G 01/25/2020
[   95.575900] RIP: 0010:__pte_list_remove.cold+0x2e/0x48 [kvm]
[   95.582312] Code: c7 c6 40 6f f3 c0 48 c7 c7 aa da f3 c0 e8 79 3d a7 cd 0f 0b 48 89 fa 48 c7 c6 40 6f f3 c0 48 c7 c7 87 da f3 c0 e8 61 3d a7 cd <0f> 0b 48 89 fa 48 c7 c6 40 6f f3 c0 48 c7 c7 98 da f3 c0 e8 49 3d
[   95.603271] RSP: 0018:ffffc900143e7c78 EFLAGS: 00010246
[   95.609093] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000000
[   95.617058] RDX: 0000000000000000 RSI: ffff88900e598950 RDI: ffff88900e598950
[   95.625019] RBP: ffff888165bf0090 R08: ffff88900e598950 R09: ffffc900143e7a98
[   95.632980] R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000ff29000
[   95.640944] R13: ffffc900143e7d18 R14: 0000000000000098 R15: 0000000000000000
[   95.648912] FS:  0000000000000000(0000) GS:ffff88900e580000(0000) knlGS:0000000000000000
[   95.657951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   95.664361] CR2: 00007fb328d20c80 CR3: 00000001476d2000 CR4: 0000000000350ee0
[   95.672326] Call Trace:
[   95.675065]  mmu_page_zap_pte+0xf9/0x130 [kvm]
[   95.680103]  __kvm_mmu_prepare_zap_page+0x6d/0x380 [kvm]
[   95.686088]  kvm_mmu_zap_all+0x5e/0xe0 [kvm]
[   95.690911]  kvm_mmu_notifier_release+0x2b/0x60 [kvm]
[   95.696614]  __mmu_notifier_release+0x71/0x1e0
[   95.701585]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[   95.707512]  ? __khugepaged_exit+0x111/0x160
[   95.712289]  exit_mmap+0x15b/0x1f0
[   95.716092]  ? __khugepaged_exit+0x111/0x160
[   95.720857]  ? kmem_cache_free+0x210/0x3f0
[   95.725428]  ? kmem_cache_free+0x387/0x3f0
[   95.729998]  mmput+0x56/0x130
[   95.733312]  do_exit+0x341/0xb50
[   95.736923]  do_group_exit+0x3a/0xa0
[   95.740925]  __x64_sys_exit_group+0x14/0x20
[   95.745600]  do_syscall_64+0x33/0x40
[   95.749601]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   95.755241] RIP: 0033:0x7fb333a442c6
[   95.759231] Code: Unable to access opcode bytes at RIP 0x7fb333a4429c.
[   95.766514] RSP: 002b:00007ffdf675cad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[   95.774962] RAX: ffffffffffffffda RBX: 00007fb333b4b610 RCX: 00007fb333a442c6
[   95.782925] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[   95.790892] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffdc38
[   95.798856] R10: 00007fb32945cf80 R11: 0000000000000246 R12: 00007fb333b4b610
[   95.806825] R13: 000000000000034c R14: 00007fb333b4efc8 R15: 0000000000000000
[   95.814803] Modules linked in: tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc intel_rapl_msr wmi_bmof intel_rapl_common amd64_edac_mod edac_mce_amd fuse kvm_amd kvm irqby
pass ipmi_ssif sg ccp k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel parport_pc ppdev lp parport sunrpc ip_tables raid10 raid456 async_raid6_recov async_memcpy async_pq async_xo
r async_tx xor raid6_pq raid1 raid0 linear sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ast drm_vram_helper i2c_algo_bit drm_ttm_helper 
ttm drm_kms_helper syscopyarea sysfillrect sysimgblt ahci fb_sys_fops libahci libata drm i2c_designware_platform e1000e i2c_piix4 wmi i2c_designware_core pinctrl_amd i2c_core
[   95.893646] ---[ end trace f40aac7ee7919c14 ]---
[   95.898848] RIP: 0010:__pte_list_remove.cold+0x2e/0x48 [kvm]
[   95.905258] Code: c7 c6 40 6f f3 c0 48 c7 c7 aa da f3 c0 e8 79 3d a7 cd 0f 0b 48 89 fa 48 c7 c6 40 6f f3 c0 48 c7 c7 87 da f3 c0 e8 61 3d a7 cd <0f> 0b 48 89 fa 48 c7 c6 40 6f f3 c0 48 c7 c7 98 da f3 c0 e8 49 3d
[   95.926234] RSP: 0018:ffffc900143e7c78 EFLAGS: 00010246
[   95.932109] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000000
[   95.940087] RDX: 0000000000000000 RSI: ffff88900e598950 RDI: ffff88900e598950
[   95.948086] RBP: ffff888165bf0090 R08: ffff88900e598950 R09: ffffc900143e7a98
[   95.956068] R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000ff29000
[   95.964051] R13: ffffc900143e7d18 R14: 0000000000000098 R15: 0000000000000000
[   95.972031] FS:  0000000000000000(0000) GS:ffff88900e580000(0000) knlGS:0000000000000000
[   95.981081] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   95.987510] CR2: 00007fb328d20c80 CR3: 00000001476d2000 CR4: 0000000000350ee0
[   95.995492] Kernel panic - not syncing: Fatal exception
[   96.008273] Kernel Offset: disabled
[   96.012249] ---[ end Kernel panic - not syncing: Fatal exception ]---

Let me know if there's anything you want me to try.

Thanks,
Tom

> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index d12acf5eb871..e07aabb23b8a 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -94,7 +94,7 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
>   		     bool can_unsync, bool host_writable, bool ad_disabled,
>   		     u64 *new_spte)
>   {
> -	u64 spte = 0;
> +	u64 spte = SPTE_MMU_PRESENT_MASK;
>   	int ret = 0;
>   
>   	if (ad_disabled)
> @@ -183,10 +183,10 @@ int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
>   
>   u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
>   {
> -	u64 spte;
> +	u64 spte = SPTE_MMU_PRESENT_MASK;
>   
> -	spte = __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
> -	       shadow_user_mask | shadow_x_mask | shadow_me_mask;
> +	spte |= __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
> +		shadow_user_mask | shadow_x_mask | shadow_me_mask;
>   
>   	if (ad_disabled)
>   		spte |= SPTE_TDP_AD_DISABLED_MASK;
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 8996baa8da15..645e9bc2d4a2 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -5,6 +5,15 @@
>   
>   #include "mmu_internal.h"
>   
> +/*
> + * A MMU present SPTE is backed by actual memory and may or may not be present
> + * in hardware.  E.g. MMIO SPTEs are not considered present.  Use bit 11, as it
> + * is ignored by all flavors of SPTEs and checking a low bit often generates
> + * better code than for a high bit, e.g. 56+.  MMU present checks are pervasive
> + * enough that the improved code generation is noticeable in KVM's footprint.
> + */
> +#define SPTE_MMU_PRESENT_MASK		BIT_ULL(11)
> +
>   /*
>    * TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
>    * be restricted to using write-protection (for L2 when CPU dirty logging, i.e.
> @@ -241,7 +250,7 @@ static inline bool is_access_track_spte(u64 spte)
>   
>   static inline bool is_shadow_present_pte(u64 pte)
>   {
> -	return (pte != 0) && !is_mmio_spte(pte) && !is_removed_spte(pte);
> +	return !!(pte & SPTE_MMU_PRESENT_MASK);
>   }
>   
>   static inline bool is_large_pte(u64 pte)
>
Paolo Bonzini March 8, 2021, 7:48 p.m. UTC | #2
On 08/03/21 19:52, Tom Lendacky wrote:
> On 2/25/21 2:47 PM, Sean Christopherson wrote:
>> Introduce MMU_PRESENT to explicitly track which SPTEs are "present" from
>> the MMU's perspective.  Checking for shadow-present SPTEs is a very
>> common operation for the MMU, particularly in hot paths such as page
>> faults.  With the addition of "removed" SPTEs for the TDP MMU,
>> identifying shadow-present SPTEs is quite costly especially since it
>> requires checking multiple 64-bit values.
>>
>> On 64-bit KVM, this reduces the footprint of kvm.ko's .text by ~2k bytes.
>> On 32-bit KVM, this increases the footprint by ~200 bytes, but only
>> because gcc now inlines several more MMU helpers, e.g. drop_parent_pte().
>>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> ---
>>    arch/x86/kvm/mmu/spte.c |  8 ++++----
>>    arch/x86/kvm/mmu/spte.h | 11 ++++++++++-
>>    2 files changed, 14 insertions(+), 5 deletions(-)
> 
> I'm trying to run a guest on my Rome system using the queue branch, but
> I'm encountering an error that I bisected to this commit. In the guest
> (during OVMF boot) I see:
> 
> error: kvm run failed Invalid argument
> RAX=0000000000000000 RBX=00000000ffc12792 RCX=000000007f58401a RDX=000000007faaf808
> RSI=0000000000000010 RDI=00000000ffc12792 RBP=00000000ffc12792 RSP=000000007faaf740
> R8 =0000000000000792 R9 =000000007faaf808 R10=00000000ffc12793 R11=00000000000003f8
> R12=0000000000000010 R13=0000000000000000 R14=000000007faaf808 R15=0000000000000012
> RIP=000000007f6e9a90 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
> GDT=     000000007f5ee698 00000047
> IDT=     000000007f186018 00000fff
> CR0=80010033 CR2=0000000000000000 CR3=000000007f801000 CR4=00000668
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000d00
> Code=22 00 00 e8 c0 e6 ff ff 48 83 c4 20 45 84 ed 74 07 fb eb 04 <44> 88 65 00 58 5b 5d 41 5c 41 5d c3 55 48 0f af 3d 1b 37 00 00 be 20 00 00 00 48 03 3d 17
> 
> On the hypervisor, I see the following:
> 
> [   55.886136] get_mmio_spte: detect reserved bits on spte, addr 0xffc12792, dump hierarchy:
> [   55.895284] ------ spte 0x1344a0827 level 4.
> [   55.900059] ------ spte 0x134499827 level 3.
> [   55.904877] ------ spte 0x165bf0827 level 2.
> [   55.909651] ------ spte 0xff800ffc12817 level 1.
> 
> When I kill the guest, I get a kernel panic:
> 
> [   95.539683] __pte_list_remove: 0000000040567a6a 0->BUG
> [   95.545481] kernel BUG at arch/x86/kvm/mmu/mmu.c:896!
> [   95.551133] invalid opcode: 0000 [#1] SMP NOPTI
> [   95.556192] CPU: 142 PID: 5054 Comm: qemu-system-x86 Tainted: G        W         5.11.0-rc4-sos-sev-es #1
> [   95.566872] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS REX1006G 01/25/2020
> [   95.575900] RIP: 0010:__pte_list_remove.cold+0x2e/0x48 [kvm]
> [   95.582312] Code: c7 c6 40 6f f3 c0 48 c7 c7 aa da f3 c0 e8 79 3d a7 cd 0f 0b 48 89 fa 48 c7 c6 40 6f f3 c0 48 c7 c7 87 da f3 c0 e8 61 3d a7 cd <0f> 0b 48 89 fa 48 c7 c6 40 6f f3 c0 48 c7 c7 98 da f3 c0 e8 49 3d
> [   95.603271] RSP: 0018:ffffc900143e7c78 EFLAGS: 00010246
> [   95.609093] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000000
> [   95.617058] RDX: 0000000000000000 RSI: ffff88900e598950 RDI: ffff88900e598950
> [   95.625019] RBP: ffff888165bf0090 R08: ffff88900e598950 R09: ffffc900143e7a98
> [   95.632980] R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000ff29000
> [   95.640944] R13: ffffc900143e7d18 R14: 0000000000000098 R15: 0000000000000000
> [   95.648912] FS:  0000000000000000(0000) GS:ffff88900e580000(0000) knlGS:0000000000000000
> [   95.657951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   95.664361] CR2: 00007fb328d20c80 CR3: 00000001476d2000 CR4: 0000000000350ee0
> [   95.672326] Call Trace:
> [   95.675065]  mmu_page_zap_pte+0xf9/0x130 [kvm]
> [   95.680103]  __kvm_mmu_prepare_zap_page+0x6d/0x380 [kvm]
> [   95.686088]  kvm_mmu_zap_all+0x5e/0xe0 [kvm]
> [   95.690911]  kvm_mmu_notifier_release+0x2b/0x60 [kvm]
> [   95.696614]  __mmu_notifier_release+0x71/0x1e0
> [   95.701585]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
> [   95.707512]  ? __khugepaged_exit+0x111/0x160
> [   95.712289]  exit_mmap+0x15b/0x1f0
> [   95.716092]  ? __khugepaged_exit+0x111/0x160
> [   95.720857]  ? kmem_cache_free+0x210/0x3f0
> [   95.725428]  ? kmem_cache_free+0x387/0x3f0
> [   95.729998]  mmput+0x56/0x130
> [   95.733312]  do_exit+0x341/0xb50
> [   95.736923]  do_group_exit+0x3a/0xa0
> [   95.740925]  __x64_sys_exit_group+0x14/0x20
> [   95.745600]  do_syscall_64+0x33/0x40
> [   95.749601]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   95.755241] RIP: 0033:0x7fb333a442c6
> [   95.759231] Code: Unable to access opcode bytes at RIP 0x7fb333a4429c.
> [   95.766514] RSP: 002b:00007ffdf675cad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [   95.774962] RAX: ffffffffffffffda RBX: 00007fb333b4b610 RCX: 00007fb333a442c6
> [   95.782925] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [   95.790892] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffdc38
> [   95.798856] R10: 00007fb32945cf80 R11: 0000000000000246 R12: 00007fb333b4b610
> [   95.806825] R13: 000000000000034c R14: 00007fb333b4efc8 R15: 0000000000000000
> [   95.814803] Modules linked in: tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc intel_rapl_msr wmi_bmof intel_rapl_common amd64_edac_mod edac_mce_amd fuse kvm_amd kvm irqby
> pass ipmi_ssif sg ccp k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel parport_pc ppdev lp parport sunrpc ip_tables raid10 raid456 async_raid6_recov async_memcpy async_pq async_xo
> r async_tx xor raid6_pq raid1 raid0 linear sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ast drm_vram_helper i2c_algo_bit drm_ttm_helper
> ttm drm_kms_helper syscopyarea sysfillrect sysimgblt ahci fb_sys_fops libahci libata drm i2c_designware_platform e1000e i2c_piix4 wmi i2c_designware_core pinctrl_amd i2c_core
> [   95.893646] ---[ end trace f40aac7ee7919c14 ]---
> [   95.898848] RIP: 0010:__pte_list_remove.cold+0x2e/0x48 [kvm]
> [   95.905258] Code: c7 c6 40 6f f3 c0 48 c7 c7 aa da f3 c0 e8 79 3d a7 cd 0f 0b 48 89 fa 48 c7 c6 40 6f f3 c0 48 c7 c7 87 da f3 c0 e8 61 3d a7 cd <0f> 0b 48 89 fa 48 c7 c6 40 6f f3 c0 48 c7 c7 98 da f3 c0 e8 49 3d
> [   95.926234] RSP: 0018:ffffc900143e7c78 EFLAGS: 00010246
> [   95.932109] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000000
> [   95.940087] RDX: 0000000000000000 RSI: ffff88900e598950 RDI: ffff88900e598950
> [   95.948086] RBP: ffff888165bf0090 R08: ffff88900e598950 R09: ffffc900143e7a98
> [   95.956068] R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000ff29000
> [   95.964051] R13: ffffc900143e7d18 R14: 0000000000000098 R15: 0000000000000000
> [   95.972031] FS:  0000000000000000(0000) GS:ffff88900e580000(0000) knlGS:0000000000000000
> [   95.981081] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   95.987510] CR2: 00007fb328d20c80 CR3: 00000001476d2000 CR4: 0000000000350ee0
> [   95.995492] Kernel panic - not syncing: Fatal exception
> [   96.008273] Kernel Offset: disabled
> [   96.012249] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> Let me know if there's anything you want me to try.

I can confirm that we're seeing the same failure, though we hadn't 
bisected it yet.

Paolo
Sean Christopherson March 8, 2021, 8:11 p.m. UTC | #3
On Mon, Mar 08, 2021, Tom Lendacky wrote:
> On 2/25/21 2:47 PM, Sean Christopherson wrote:
> > Introduce MMU_PRESENT to explicitly track which SPTEs are "present" from
> > the MMU's perspective.  Checking for shadow-present SPTEs is a very
> > common operation for the MMU, particularly in hot paths such as page
> > faults.  With the addition of "removed" SPTEs for the TDP MMU,
> > identifying shadow-present SPTEs is quite costly especially since it
> > requires checking multiple 64-bit values.
> > 
> > On 64-bit KVM, this reduces the footprint of kvm.ko's .text by ~2k bytes.
> > On 32-bit KVM, this increases the footprint by ~200 bytes, but only
> > because gcc now inlines several more MMU helpers, e.g. drop_parent_pte().
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/mmu/spte.c |  8 ++++----
> >   arch/x86/kvm/mmu/spte.h | 11 ++++++++++-
> >   2 files changed, 14 insertions(+), 5 deletions(-)
> 
> I'm trying to run a guest on my Rome system using the queue branch, but
> I'm encountering an error that I bisected to this commit. In the guest
> (during OVMF boot) I see:
> 
> error: kvm run failed Invalid argument
> RAX=0000000000000000 RBX=00000000ffc12792 RCX=000000007f58401a RDX=000000007faaf808
> RSI=0000000000000010 RDI=00000000ffc12792 RBP=00000000ffc12792 RSP=000000007faaf740
> R8 =0000000000000792 R9 =000000007faaf808 R10=00000000ffc12793 R11=00000000000003f8
> R12=0000000000000010 R13=0000000000000000 R14=000000007faaf808 R15=0000000000000012
> RIP=000000007f6e9a90 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
> GDT=     000000007f5ee698 00000047
> IDT=     000000007f186018 00000fff
> CR0=80010033 CR2=0000000000000000 CR3=000000007f801000 CR4=00000668
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000d00
> Code=22 00 00 e8 c0 e6 ff ff 48 83 c4 20 45 84 ed 74 07 fb eb 04 <44> 88 65 00 58 5b 5d 41 5c 41 5d c3 55 48 0f af 3d 1b 37 00 00 be 20 00 00 00 48 03 3d 17
> 
> On the hypervisor, I see the following:
> 
> [   55.886136] get_mmio_spte: detect reserved bits on spte, addr 0xffc12792, dump hierarchy:
> [   55.895284] ------ spte 0x1344a0827 level 4.
> [   55.900059] ------ spte 0x134499827 level 3.
> [   55.904877] ------ spte 0x165bf0827 level 2.
> [   55.909651] ------ spte 0xff800ffc12817 level 1.

Ah fudge.  I know what's wrong.  The MMIO generation uses bit 11, which means
is_shadow_present_pte() can get a false positive on high MMIO generations.  This
particular error can be squashed by explicitly checking for MMIO sptes in
get_mmio_spte(), but I'm guessing there are other flows where a false positive
is fatal (probably the __pte_list_remove bug below...).  The safe thing to do is
to steal bit 11 from MMIO SPTEs so that shadow present PTEs are the only thing
that sets that bit.

I'll reproduce by stuffing the MMIO generation and get a patch posted.  Sorry :-/

> When I kill the guest, I get a kernel panic:
> 
> [   95.539683] __pte_list_remove: 0000000040567a6a 0->BUG
> [   95.545481] kernel BUG at arch/x86/kvm/mmu/mmu.c:896!
Sean Christopherson March 8, 2021, 9:49 p.m. UTC | #4
On Mon, Mar 08, 2021, Sean Christopherson wrote:
> On Mon, Mar 08, 2021, Tom Lendacky wrote:
> > On the hypervisor, I see the following:
> > 
> > [   55.886136] get_mmio_spte: detect reserved bits on spte, addr 0xffc12792, dump hierarchy:
> > [   55.895284] ------ spte 0x1344a0827 level 4.
> > [   55.900059] ------ spte 0x134499827 level 3.
> > [   55.904877] ------ spte 0x165bf0827 level 2.
> > [   55.909651] ------ spte 0xff800ffc12817 level 1.
> 
> Ah fudge.  I know what's wrong.  The MMIO generation uses bit 11, which means
> is_shadow_present_pte() can get a false positive on high MMIO generations.  This
> particular error can be squashed by explicitly checking for MMIO sptes in
> get_mmio_spte(), but I'm guessing there are other flows where a false positive
> is fatal (probably the __pte_list_remove bug below...).  The safe thing to do is
> to steal bit 11 from MMIO SPTEs so that shadow present PTEs are the only thing
> that sets that bit.
> 
> I'll reproduce by stuffing the MMIO generation and get a patch posted.  Sorry :-/
> 
> > When I kill the guest, I get a kernel panic:
> > 
> > [   95.539683] __pte_list_remove: 0000000040567a6a 0->BUG
> > [   95.545481] kernel BUG at arch/x86/kvm/mmu/mmu.c:896!

Fudging get_mmio_spte() did allow the guest to boot, but as expected KVM panicked
during guest shutdown.  Initial testing on the below changes look good, I'll
test more thoroughly and (hopefully) post later today.

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index b53036d9ddf3..e6e683e0fdcd 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -101,11 +101,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 #undef SHADOW_ACC_TRACK_SAVED_MASK

 /*
- * Due to limited space in PTEs, the MMIO generation is a 20 bit subset of
+ * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
  * the memslots generation and is derived as follows:
  *
- * Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
- * Bits 9-19 of the MMIO generation are propagated to spte bits 52-62
+ * Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
+ * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
  *
  * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
  * the MMIO generation number, as doing so would require stealing a bit from
@@ -116,7 +116,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
  */

 #define MMIO_SPTE_GEN_LOW_START                3
-#define MMIO_SPTE_GEN_LOW_END          11
+#define MMIO_SPTE_GEN_LOW_END          10

 #define MMIO_SPTE_GEN_HIGH_START       52
 #define MMIO_SPTE_GEN_HIGH_END         62
@@ -125,12 +125,14 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
                                                    MMIO_SPTE_GEN_LOW_START)
 #define MMIO_SPTE_GEN_HIGH_MASK                GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \
                                                    MMIO_SPTE_GEN_HIGH_START)
+static_assert(!(SPTE_MMU_PRESENT_MASK &
+               (MMIO_SPTE_GEN_LOW_MASK | MMIO_SPTE_GEN_HIGH_MASK)));

 #define MMIO_SPTE_GEN_LOW_BITS         (MMIO_SPTE_GEN_LOW_END - MMIO_SPTE_GEN_LOW_START + 1)
 #define MMIO_SPTE_GEN_HIGH_BITS                (MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)

 /* remember to adjust the comment above as well if you change these */
-static_assert(MMIO_SPTE_GEN_LOW_BITS == 9 && MMIO_SPTE_GEN_HIGH_BITS == 11);
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);

 #define MMIO_SPTE_GEN_LOW_SHIFT                (MMIO_SPTE_GEN_LOW_START - 0)
 #define MMIO_SPTE_GEN_HIGH_SHIFT       (MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d12acf5eb871..e07aabb23b8a 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -94,7 +94,7 @@  int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
 		     bool can_unsync, bool host_writable, bool ad_disabled,
 		     u64 *new_spte)
 {
-	u64 spte = 0;
+	u64 spte = SPTE_MMU_PRESENT_MASK;
 	int ret = 0;
 
 	if (ad_disabled)
@@ -183,10 +183,10 @@  int make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
 
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 {
-	u64 spte;
+	u64 spte = SPTE_MMU_PRESENT_MASK;
 
-	spte = __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
-	       shadow_user_mask | shadow_x_mask | shadow_me_mask;
+	spte |= __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
+		shadow_user_mask | shadow_x_mask | shadow_me_mask;
 
 	if (ad_disabled)
 		spte |= SPTE_TDP_AD_DISABLED_MASK;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 8996baa8da15..645e9bc2d4a2 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -5,6 +5,15 @@ 
 
 #include "mmu_internal.h"
 
+/*
+ * A MMU present SPTE is backed by actual memory and may or may not be present
+ * in hardware.  E.g. MMIO SPTEs are not considered present.  Use bit 11, as it
+ * is ignored by all flavors of SPTEs and checking a low bit often generates
+ * better code than for a high bit, e.g. 56+.  MMU present checks are pervasive
+ * enough that the improved code generation is noticeable in KVM's footprint.
+ */
+#define SPTE_MMU_PRESENT_MASK		BIT_ULL(11)
+
 /*
  * TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
  * be restricted to using write-protection (for L2 when CPU dirty logging, i.e.
@@ -241,7 +250,7 @@  static inline bool is_access_track_spte(u64 spte)
 
 static inline bool is_shadow_present_pte(u64 pte)
 {
-	return (pte != 0) && !is_mmio_spte(pte) && !is_removed_spte(pte);
+	return !!(pte & SPTE_MMU_PRESENT_MASK);
 }
 
 static inline bool is_large_pte(u64 pte)