diff mbox series

[5.4,1/1] KVM: SEV: add cache flush to solve SEV cache incoherency issues

Message ID 20220926145247.3688090-1-ovidiu.panait@windriver.com (mailing list archive)
State New, archived
Headers show
Series [5.4,1/1] KVM: SEV: add cache flush to solve SEV cache incoherency issues | expand

Commit Message

Ovidiu Panait Sept. 26, 2022, 2:52 p.m. UTC
From: Mingwei Zhang <mizhang@google.com>

commit 683412ccf61294d727ead4a73d97397396e69a6b upstream.

Flush the CPU caches when memory is reclaimed from an SEV guest (where
reclaim also includes it being unmapped from KVM's memslots).  Due to lack
of coherency for SEV encrypted memory, failure to flush results in silent
data corruption if userspace is malicious/broken and doesn't ensure SEV
guest memory is properly pinned and unpinned.

Cache coherency is not enforced across the VM boundary in SEV (AMD APM
vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
VM guests have to be explicitly flushed on the host side. If a memory page
containing dirty confidential cachelines was released by VM and reallocated
to another user, the cachelines may corrupt the new user at a later time.

KVM takes a shortcut by assuming all confidential memory remain pinned
until the end of VM lifetime. Therefore, KVM does not flush cache at
mmu_notifier invalidation events. Because of this incorrect assumption and
the lack of cache flushing, malicous userspace can crash the host kernel:
creating a malicious VM and continuously allocates/releases unpinned
confidential memory pages when the VM is running.

Add cache flush operations to mmu_notifier operations to ensure that any
physical memory leaving the guest VM get flushed. In particular, hook
mmu_notifier_invalidate_range_start and mmu_notifier_release events and
flush cache accordingly. The hook after releasing the mmu lock to avoid
contention with other vCPUs.

Cc: stable@vger.kernel.org
Suggested-by: Sean Christpherson <seanjc@google.com>
Reported-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220421031407.2516575-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[OP: applied kvm_arch_guest_memory_reclaimed() calls in
__kvm_set_memory_region() and kvm_mmu_notifier_invalidate_range_start();
OP: adjusted kvm_arch_guest_memory_reclaimed() to not use static_call_cond()]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/svm.c              |  9 +++++++++
 arch/x86/kvm/x86.c              |  6 ++++++
 include/linux/kvm_host.h        |  2 ++
 virt/kvm/kvm_main.c             | 16 ++++++++++++++--
 5 files changed, 32 insertions(+), 2 deletions(-)

Comments

Liam Merwick Sept. 26, 2022, 7 p.m. UTC | #1
On 26/09/2022 15:52, Ovidiu Panait wrote:
> From: Mingwei Zhang <mizhang@google.com>
> 
> commit 683412ccf61294d727ead4a73d97397396e69a6b upstream.
> 
> Flush the CPU caches when memory is reclaimed from an SEV guest (where
> reclaim also includes it being unmapped from KVM's memslots).  Due to lack
> of coherency for SEV encrypted memory, failure to flush results in silent
> data corruption if userspace is malicious/broken and doesn't ensure SEV
> guest memory is properly pinned and unpinned.
> 
> Cache coherency is not enforced across the VM boundary in SEV (AMD APM
> vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
> VM guests have to be explicitly flushed on the host side. If a memory page
> containing dirty confidential cachelines was released by VM and reallocated
> to another user, the cachelines may corrupt the new user at a later time.
> 
> KVM takes a shortcut by assuming all confidential memory remain pinned
> until the end of VM lifetime. Therefore, KVM does not flush cache at
> mmu_notifier invalidation events. Because of this incorrect assumption and
> the lack of cache flushing, malicous userspace can crash the host kernel:
> creating a malicious VM and continuously allocates/releases unpinned
> confidential memory pages when the VM is running.
> 
> Add cache flush operations to mmu_notifier operations to ensure that any
> physical memory leaving the guest VM get flushed. In particular, hook
> mmu_notifier_invalidate_range_start and mmu_notifier_release events and
> flush cache accordingly. The hook after releasing the mmu lock to avoid
> contention with other vCPUs.
> 
> Cc: stable@vger.kernel.org
> Suggested-by: Sean Christpherson <seanjc@google.com>
> Reported-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Message-Id: <20220421031407.2516575-4-mizhang@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [OP: applied kvm_arch_guest_memory_reclaimed() calls in
> __kvm_set_memory_region() and kvm_mmu_notifier_invalidate_range_start();
> OP: adjusted kvm_arch_guest_memory_reclaimed() to not use static_call_cond()]
> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>

Reviewed-by: Liam Merwick <liam.merwick@oracle.com>

> ---
>   arch/x86/include/asm/kvm_host.h |  1 +
>   arch/x86/kvm/svm.c              |  9 +++++++++
>   arch/x86/kvm/x86.c              |  6 ++++++
>   include/linux/kvm_host.h        |  2 ++
>   virt/kvm/kvm_main.c             | 16 ++++++++++++++--
>   5 files changed, 32 insertions(+), 2 deletions(-)
>
Kalra, Ashish Sept. 27, 2022, 12:07 a.m. UTC | #2
With this patch applied, we are observing soft lockup and RCU stall issues on SNP guests with
128 vCPUs assigned and >=10GB guest memory allocations.

From the call stack dumps, it looks like migrate_pages() gets invoked for hugepages and triggers the
MMU notifiers for invalidate_range_start() and correspondingly kvm_mmu_notifier_invalidate_range_start() 
invokes sev_guest_memory_reclaimed() which internally does wbinvd_on_all_cpus(). This can potentially 
cause long delays especially on large physical CPU count systems (the one we are testing on has 500 CPUs)
and thus delay guest re-entry and cause soft-lockups and RCU stalls on the guest.

Here are the kstack dumps of vCPU thread(s) invoking the invalidate_range_start() MMU notifiers: 

#1: 
[  913.377780] CPU: 79 PID: 6538 Comm: qemu-system-x86 Not tainted 5.19.0-rc5-next-20220706-sev-es-snp+ #380
[  913.377783] Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ1002C 09/15/2022
[  913.377785] Call Trace:
[  913.377788]  <TASK>
[  913.304300]  sev_guest_memory_reclaimed.cold+0x18/0x22
[  913.304303]  kvm_arch_guest_memory_reclaimed+0x12/0x20
[  913.304309]  kvm_mmu_notifier_invalidate_range_start+0x2af/0x2e0
[  913.304312]  ? kvm_mmu_notifier_invalidate_range_end+0x101/0x1c0
[  913.304314]  __mmu_notifier_invalidate_range_start+0x83/0x190
[  913.304320]  try_to_migrate_one+0xba9/0xd80
[  913.304326]  rmap_walk_anon+0x166/0x360
[  913.304329]  rmap_walk+0x28/0x40
[  913.304331]  try_to_migrate+0x92/0xd0
[  913.304334]  ? try_to_unmap_one+0xe60/0xe60
[  913.304336]  ? anon_vma_ctor+0x50/0x50
[  913.304339]  ? page_get_anon_vma+0x80/0x80
[  913.304341]  ? invalid_mkclean_vma+0x20/0x20
[  913.304343]  migrate_pages+0x1276/0x1720
[  913.304346]  ? do_pages_stat+0x310/0x310
[  913.304348]  migrate_misplaced_page+0x5d0/0x820
[  913.304351]  do_huge_pmd_numa_page+0x1f7/0x4b0
[  913.304354]  __handle_mm_fault+0x66a/0x1040
[  913.304358]  handle_mm_fault+0xe4/0x2d0
[  913.304361]  __get_user_pages+0x1ea/0x710
[  913.304363]  get_user_pages_unlocked+0xd0/0x340
[  913.304365]  hva_to_pfn+0xf7/0x440
[  913.304367]  __gfn_to_pfn_memslot+0x7f/0xc0
[  913.304369]  kvm_faultin_pfn+0x95/0x280
[  913.304373]  direct_page_fault+0x201/0x800
[  913.304375]  kvm_tdp_page_fault+0x72/0x80
[  913.304377]  kvm_mmu_page_fault+0x136/0x710
[  913.304379]  ? kvm_complete_insn_gp+0x37/0x40
[  913.304382]  ? svm_complete_emulated_msr+0x52/0x60
[  913.304384]  ? kvm_emulate_wrmsr+0x6c/0x160
[  913.304387]  ? sev_handle_vmgexit+0x115a/0x1600
[  913.304390]  npf_interception+0x50/0xd0
[  913.304391]  svm_invoke_exit_handler+0xf5/0x130 
[  913.304394]  svm_handle_exit+0x11c/0x230
[  913.304396]  vcpu_enter_guest+0x832/0x12e0
[  913.304396]  ? kvm_apic_local_deliver+0x6a/0x70
[  913.304401]  ? kvm_inject_apic_timer_irqs+0x2c/0x70
[  913.304403]  kvm_arch_vcpu_ioctl_run+0x105/0x680

#2: 
[  913.378680] CPU: 79 PID: 6538 Comm: qemu-system-x86 Not tainted 5.19.0-rc5-next-20220706-sev-es-snp+ #380
[  913.378683] Hardware name: AMD Corporation QUARTZ/QUARTZ, BIOS RQZ1002C 09/15/2022
[  913.378685] Call Trace:
[  913.378687]  <TASK>
[  913.378699]  sev_guest_memory_reclaimed.cold+0x18/0x22
[  913.378702]  kvm_arch_guest_memory_reclaimed+0x12/0x20
[  913.378707]  kvm_mmu_notifier_invalidate_range_start+0x2af/0x2e0
[  913.378711]  __mmu_notifier_invalidate_range_start+0x83/0x190
[  913.378715]  change_protection+0x11ec/0x1420
[  913.378720]  ? kvm_release_pfn_clean+0x2f/0x40
[  913.378722]  change_prot_numa+0x66/0xb0
[  913.378724]  task_numa_work+0x22c/0x3b0
[  913.378729]  task_work_run+0x72/0xb0
[  913.378732]  xfer_to_guest_mode_handle_work+0xfc/0x100
[  913.378738]  kvm_arch_vcpu_ioctl_run+0x422/0x680

Additionally, it causes other vCPU threads handling #NPF to block as the above code path(s) are holding
mm->mmap_lock, following are the kstack dumps of the blocked vCPU threads:

[  316.969254] task:qemu-system-x86 state:D stack:    0 pid: 6939 ppid:  6908 flags:0x00000000
[  316.969256] Call Trace:
[  316.969257]  <TASK>
[  316.969258]  __schedule+0x350/0x900
[  316.969262]  schedule+0x52/0xb0
[  316.969265]  rwsem_down_read_slowpath+0x271/0x4b0
[  316.969267]  down_read+0x47/0xa0
[  316.969269]  get_user_pages_unlocked+0x6b/0x340
[  316.969273]  hva_to_pfn+0xf7/0x440
[  316.969277]  __gfn_to_pfn_memslot+0x7f/0xc0
[  316.969279]  kvm_faultin_pfn+0x95/0x280
[  316.969283]  ? kvm_apic_send_ipi+0x9c/0x100
[  316.969287]  direct_page_fault+0x201/0x800
[  316.969290]  kvm_tdp_page_fault+0x72/0x80
[  316.969293]  kvm_mmu_page_fault+0x136/0x710
[  316.969296]  ? xas_load+0x35/0x40
[  316.969299]  ? xas_find+0x187/0x1c0
[  316.969301]  ? xa_find_after+0xf1/0x110
[  316.969304]  ? kvm_pmu_trigger_event+0x5e/0x1e0
[  316.969307]  ? sysvec_call_function+0x52/0x90
[  316.969310]  npf_interception+0x50/0xd0

The invocation of migrate_pages() as in the following code path
does not seem right: 
    
    do_huge_pmd_numa_page
      migrate_misplaced_page
        migrate_pages
    
as all the guest memory for SEV/SNP VMs will be pinned/locked, so why is the page migration code path getting invoked at all ?

Thanks,
Ashish
Sean Christopherson Sept. 27, 2022, 12:37 a.m. UTC | #3
On Tue, Sep 27, 2022, Ashish Kalra wrote:
> With this patch applied, we are observing soft lockup and RCU stall issues on
> SNP guests with 128 vCPUs assigned and >=10GB guest memory allocations.

...

> From the call stack dumps, it looks like migrate_pages() > The invocation of
> migrate_pages() as in the following code path does not seem right: 
>     
>     do_huge_pmd_numa_page
>       migrate_misplaced_page
>         migrate_pages
>     
> as all the guest memory for SEV/SNP VMs will be pinned/locked, so why is the
> page migration code path getting invoked at all ?

LOL, I feel your pain.  It's the wonderful NUMA autobalancing code.  It's been a
while since I looked at the code, but IIRC, it "works" by zapping PTEs for pages that
aren't allocated on the "right" node without checking if page migration is actually
possible.

The actual migration is done on the subsequent page fault.  In this case, the
balancer detects that the page can't be migrated and reinstalls the original PTE.

I don't know if using FOLL_LONGTERM would help?  Again, been a while.  The workaround
I've used in the past is to simply disable the balancer, e.g.

  CONFIG_NUMA_BALANCING=n

or
  
  numa_balancing=disable

on the kernel command line.
Ovidiu Panait Sept. 27, 2022, 8:03 a.m. UTC | #4
Hi Greg,

On 9/26/22 17:52, Ovidiu Panait wrote:
> From: Mingwei Zhang <mizhang@google.com>
>
> commit 683412ccf61294d727ead4a73d97397396e69a6b upstream.
Please ignore this 5.4 backport, as it introduces soft lockups in 
certain scenarios:
https://lore.kernel.org/kvm/YzJFvWPb1syXcVQm@google.com/T/#mb79712b3d141cabb166b504984f6058b01e30c63 



Ovidiu

>
> Flush the CPU caches when memory is reclaimed from an SEV guest (where
> reclaim also includes it being unmapped from KVM's memslots).  Due to lack
> of coherency for SEV encrypted memory, failure to flush results in silent
> data corruption if userspace is malicious/broken and doesn't ensure SEV
> guest memory is properly pinned and unpinned.
>
> Cache coherency is not enforced across the VM boundary in SEV (AMD APM
> vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
> VM guests have to be explicitly flushed on the host side. If a memory page
> containing dirty confidential cachelines was released by VM and reallocated
> to another user, the cachelines may corrupt the new user at a later time.
>
> KVM takes a shortcut by assuming all confidential memory remain pinned
> until the end of VM lifetime. Therefore, KVM does not flush cache at
> mmu_notifier invalidation events. Because of this incorrect assumption and
> the lack of cache flushing, malicous userspace can crash the host kernel:
> creating a malicious VM and continuously allocates/releases unpinned
> confidential memory pages when the VM is running.
>
> Add cache flush operations to mmu_notifier operations to ensure that any
> physical memory leaving the guest VM get flushed. In particular, hook
> mmu_notifier_invalidate_range_start and mmu_notifier_release events and
> flush cache accordingly. The hook after releasing the mmu lock to avoid
> contention with other vCPUs.
>
> Cc: stable@vger.kernel.org
> Suggested-by: Sean Christpherson <seanjc@google.com>
> Reported-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Message-Id: <20220421031407.2516575-4-mizhang@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [OP: applied kvm_arch_guest_memory_reclaimed() calls in
> __kvm_set_memory_region() and kvm_mmu_notifier_invalidate_range_start();
> OP: adjusted kvm_arch_guest_memory_reclaimed() to not use static_call_cond()]
> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  1 +
>   arch/x86/kvm/svm.c              |  9 +++++++++
>   arch/x86/kvm/x86.c              |  6 ++++++
>   include/linux/kvm_host.h        |  2 ++
>   virt/kvm/kvm_main.c             | 16 ++++++++++++++--
>   5 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4bc476d7fa6c..7167f94ed250 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1204,6 +1204,7 @@ struct kvm_x86_ops {
>   	int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
>   	int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
>   	int (*mem_enc_unreg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> +	void (*guest_memory_reclaimed)(struct kvm *kvm);
>   
>   	int (*get_msr_feature)(struct kvm_msr_entry *entry);
>   
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 1efcc7d4bc88..95f1293babae 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5072,6 +5072,14 @@ static void reload_tss(struct kvm_vcpu *vcpu)
>   	load_TR_desc();
>   }
>   
> +static void sev_guest_memory_reclaimed(struct kvm *kvm)
> +{
> +	if (!sev_guest(kvm))
> +		return;
> +
> +	wbinvd_on_all_cpus();
> +}
> +
>   static void pre_sev_run(struct vcpu_svm *svm, int cpu)
>   {
>   	struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
> @@ -7385,6 +7393,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
>   	.mem_enc_op = svm_mem_enc_op,
>   	.mem_enc_reg_region = svm_register_enc_region,
>   	.mem_enc_unreg_region = svm_unregister_enc_region,
> +	.guest_memory_reclaimed = sev_guest_memory_reclaimed,
>   
>   	.nested_enable_evmcs = NULL,
>   	.nested_get_evmcs_version = NULL,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d0b297583df8..bb391ff7a901 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8046,6 +8046,12 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>   }
>   
> +void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
> +{
> +	if (kvm_x86_ops->guest_memory_reclaimed)
> +		kvm_x86_ops->guest_memory_reclaimed(kvm);
> +}
> +
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>   {
>   	struct page *page = NULL;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index dd4cdad76b18..9a35585271d8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1408,6 +1408,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   					    unsigned long start, unsigned long end);
>   
> +void kvm_arch_guest_memory_reclaimed(struct kvm *kvm);
> +
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
>   int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
>   #else
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0008fc49528a..b1cb2ef209ca 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -164,6 +164,10 @@ __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   {
>   }
>   
> +__weak void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
> +{
> +}
> +
>   bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
>   {
>   	/*
> @@ -324,6 +328,12 @@ void kvm_reload_remote_mmus(struct kvm *kvm)
>   	kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
>   }
>   
> +static void kvm_flush_shadow_all(struct kvm *kvm)
> +{
> +	kvm_arch_flush_shadow_all(kvm);
> +	kvm_arch_guest_memory_reclaimed(kvm);
> +}
> +
>   int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>   {
>   	struct page *page;
> @@ -435,6 +445,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   		kvm_flush_remote_tlbs(kvm);
>   
>   	spin_unlock(&kvm->mmu_lock);
> +	kvm_arch_guest_memory_reclaimed(kvm);
>   	srcu_read_unlock(&kvm->srcu, idx);
>   
>   	return 0;
> @@ -538,7 +549,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
>   	int idx;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
> -	kvm_arch_flush_shadow_all(kvm);
> +	kvm_flush_shadow_all(kvm);
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> @@ -844,7 +855,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>   	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
>   #else
> -	kvm_arch_flush_shadow_all(kvm);
> +	kvm_flush_shadow_all(kvm);
>   #endif
>   	kvm_arch_destroy_vm(kvm);
>   	kvm_destroy_devices(kvm);
> @@ -1143,6 +1154,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   		 *	- kvm_is_visible_gfn (mmu_check_roots)
>   		 */
>   		kvm_arch_flush_shadow_memslot(kvm, slot);
> +		kvm_arch_guest_memory_reclaimed(kvm);
>   
>   		/*
>   		 * We can re-use the old_memslots from above, the only difference
Kalra, Ashish Oct. 6, 2022, 5:36 p.m. UTC | #5
Hello All,

Here is a summary of all the discussions and options we have discussed
off-list :

For SNP guests we don't need to invoke the MMU invalidation notifiers 
and the cache flush should be done at the point of RMP ownership change 
instead of mmu_notifier, which will be when the unregister_enc_region 
ioctl is called, but as we don't trust the userspace (which can bypass 
this ioctl), therefore we continue to use the MMU invalidation 
notifiers. With UPM support added, we will be avoiding the RMP #PF split 
code path to split the host pagetable to be in sync with the RMP table 
entries, and therefore, mmu_notifier invoked from __split_huge_pmd() 
won’t be of concern.

For the MMU invalidation notifiers we are going to make two changes 
currently:

1). Use clflush/clflushopt instead of wbinvd_on_all_cpus() for range <= 
2MB.

But this is not that straightforward, for SME_COHERENT platforms (Milan 
and beyond), clflush/clflushopt will flush guest tagged cache entries, 
but before Milan (!SME_COHERENT) we will need to use either 
VM_PAGE_FLUSH MSR or wbinvd to flush guest tagged cache entries. So for 
non SME_COHERENT platforms, there is no change and effectively no 
optimizations.

2). We also add the filtering in mmu_notifier (from Sean's patch) which 
invokes the mmu invalidation notifiers depending on the flag passed to 
the notifier.
This will assist in reducing the overhead with NUMA balancing and 
especially eliminates the mmu_notifier invocations for the 
change_protection case.

Thanks,
Ashish


On 9/26/2022 7:37 PM, Sean Christopherson wrote:
> On Tue, Sep 27, 2022, Ashish Kalra wrote:
>> With this patch applied, we are observing soft lockup and RCU stall issues on
>> SNP guests with 128 vCPUs assigned and >=10GB guest memory allocations.
> 
> ...
> 
>>  From the call stack dumps, it looks like migrate_pages() > The invocation of
>> migrate_pages() as in the following code path does not seem right:
>>      
>>      do_huge_pmd_numa_page
>>        migrate_misplaced_page
>>          migrate_pages
>>      
>> as all the guest memory for SEV/SNP VMs will be pinned/locked, so why is the
>> page migration code path getting invoked at all ?
> 
> LOL, I feel your pain.  It's the wonderful NUMA autobalancing code.  It's been a
> while since I looked at the code, but IIRC, it "works" by zapping PTEs for pages that
> aren't allocated on the "right" node without checking if page migration is actually
> possible.
> 
> The actual migration is done on the subsequent page fault.  In this case, the
> balancer detects that the page can't be migrated and reinstalls the original PTE.
> 
> I don't know if using FOLL_LONGTERM would help?  Again, been a while.  The workaround
> I've used in the past is to simply disable the balancer, e.g.
> 
>    CONFIG_NUMA_BALANCING=n
> 
> or
>    
>    numa_balancing=disable
> 
> on the kernel command line.
>
Sean Christopherson Oct. 7, 2022, 1:15 a.m. UTC | #6
On Thu, Oct 06, 2022, Kalra, Ashish wrote:
> For the MMU invalidation notifiers we are going to make two changes
> currently:
> 
> 1). Use clflush/clflushopt instead of wbinvd_on_all_cpus() for range <= 2MB.

IMO, this isn't worth pursuing, to the point where I might object to this code
being added upstream.  Avoiding WBINVD for the mmu_notifiers doesn't prevent a
malicious userspace from using SEV-induced WBINVD to effectively DoS the host,
e.g. userspace can simply ADD+DELETE memslots, or mprotect() chunks > 2mb.

Using clfushopt also effectively puts a requirement on mm/ that the notifiers
be invoked _before_ PTEs are modified in the primary MMU, otherwise KVM may not
be able to resolve the VA=>PFN, or even worse, resolve the wrong PFN.

And no sane VMM should be modifying userspace mappings that cover SEV guest memory
at any reasonable rate.

In other words, switching to CLFUSHOPT for SEV+SEV-ES VMs is effectively a
band-aid for the NUMA balancing issue.  A far better solution for NUMA balancing
would be to pursue a fix for the underlying problem, e.g. disable NUMA balancing
entirely for SEV/SEV-ES VMs.  That might already be doable from userspace by
manipulating memory policy, and if not there's a WIP patch[*] that would make it
trivial for the userspace VMM to disable NUMA balancing.

As for guarding against DoS, /dev/sev should really be locked down so that only
sufficiently privileged users can create SEV VMs.

[*] https://lore.kernel.org/all/20220929064359.46932-1-ligang.bdlg@bytedance.com
Mingwei Zhang Oct. 7, 2022, 6:03 a.m. UTC | #7
On Thu, Oct 6, 2022 at 6:15 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Oct 06, 2022, Kalra, Ashish wrote:
> > For the MMU invalidation notifiers we are going to make two changes
> > currently:
> >
> > 1). Use clflush/clflushopt instead of wbinvd_on_all_cpus() for range <= 2MB.
>
> IMO, this isn't worth pursuing, to the point where I might object to this code
> being added upstream.  Avoiding WBINVD for the mmu_notifiers doesn't prevent a
> malicious userspace from using SEV-induced WBINVD to effectively DoS the host,
> e.g. userspace can simply ADD+DELETE memslots, or mprotect() chunks > 2mb.
>

I think using clflush/clflushopt is a tactical workaround for SNP VMs.
As mentioned earlier by Ashish:

"For SNP guests we don't need to invoke the MMU invalidation notifiers
and the cache flush should be done at the point of RMP ownership change
instead of mmu_notifier, which will be when the unregister_enc_region
ioctl is called, but as we don't trust the userspace (which can bypass
this ioctl), therefore we continue to use the MMU invalidation
notifiers."

So if that is true: SNP VMs also have to use mmu_notifiers for
splitting the PMDs, then I think using clflush/clflushopt might be the
only workaround that I know of.

> Using clfushopt also effectively puts a requirement on mm/ that the notifiers
> be invoked _before_ PTEs are modified in the primary MMU, otherwise KVM may not
> be able to resolve the VA=>PFN, or even worse, resolve the wrong PFN.

I don't understand this. Isn't it always true that MM should fire
mmu_notifiers before modifying PTEs in host MMU? This should be a
strict rule as in my knowledge, no?

>
> And no sane VMM should be modifying userspace mappings that cover SEV guest memory
> at any reasonable rate.
>
> In other words, switching to CLFUSHOPT for SEV+SEV-ES VMs is effectively a
> band-aid for the NUMA balancing issue.

That's not true. KSM might also use the same mechanism. For NUMA
balancing and KSM, there seems to be a pattern: blindly flushing
mmu_notifier first, then try to do the actual work.

I have a limited knowledge on MM, but from my observations, it looks
like the property of a page being "PINNED" is very unreliable (or
expensive), i.e., anyone can jump in and pin the page. So it is hard
to see whether a page is truly "PINNED" or maybe just someone is
"working" on it without holding the lock. Holding the refcount of a
struct page requires a spinlock. I suspect that might be the reason
why NUMA balancing and KSM is just aggressively firing mmu_notifiers
first. I don't know if there is other stuff in MM following the same
pattern.

Concretely, my deep worry is the mmu_notifier in try_to_unmap_one(). I
cannot enumerate all of the callers. But if there is someone who calls
into this, it might be a disaster level (4K) performance lock. Hope we
can prove that won't happen.
Sean Christopherson Oct. 7, 2022, 3:51 p.m. UTC | #8
On Thu, Oct 06, 2022, Mingwei Zhang wrote:
> On Thu, Oct 6, 2022 at 6:15 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Oct 06, 2022, Kalra, Ashish wrote:
> > > For the MMU invalidation notifiers we are going to make two changes
> > > currently:
> > >
> > > 1). Use clflush/clflushopt instead of wbinvd_on_all_cpus() for range <= 2MB.
> >
> > IMO, this isn't worth pursuing, to the point where I might object to this code
> > being added upstream.  Avoiding WBINVD for the mmu_notifiers doesn't prevent a
> > malicious userspace from using SEV-induced WBINVD to effectively DoS the host,
> > e.g. userspace can simply ADD+DELETE memslots, or mprotect() chunks > 2mb.
> >
> 
> I think using clflush/clflushopt is a tactical workaround for SNP VMs.
> As mentioned earlier by Ashish:
> 
> "For SNP guests we don't need to invoke the MMU invalidation notifiers
> and the cache flush should be done at the point of RMP ownership change
> instead of mmu_notifier, which will be when the unregister_enc_region
> ioctl is called, but as we don't trust the userspace (which can bypass
> this ioctl), therefore we continue to use the MMU invalidation
> notifiers."
> 
> So if that is true: SNP VMs also have to use mmu_notifiers for
> splitting the PMDs

SNP private memory will be backed by UPM, and thus won't have host PMDs to split.
And splitting guest PMDs to convert between shared and private doesn't initiate
an RMP ownership change of the underlying physical page.  The ownership of a page
will change back to "hypervisor" only when the page is truly freed from the private
backing store.

> then I think using clflush/clflushopt might be the
> only workaround that I know of.
> 
> > Using clfushopt also effectively puts a requirement on mm/ that the notifiers
> > be invoked _before_ PTEs are modified in the primary MMU, otherwise KVM may not
> > be able to resolve the VA=>PFN, or even worse, resolve the wrong PFN.
> 
> I don't understand this. Isn't it always true that MM should fire
> mmu_notifiers before modifying PTEs in host MMU? This should be a
> strict rule as in my knowledge, no?

It will hold true for all flavors of invalidate_range, but not for change_pte().
The original intent of change_pte() is that it would be called without an
invalidation, _after_ the primary MMU changed its PTE, so that secondary MMUs
could pick up the new PTE without causing a VM-Exit.

E.g. snippets from the original changelog and code:

    The set_pte_at_notify() macro allows setting a pte in the shadow page
    table directly, instead of flushing the shadow page table entry and then
    getting vmexit to set it.  It uses a new change_pte() callback to do so.
    
    set_pte_at_notify() is an optimization for kvm, and other users of
    mmu_notifiers, for COW pages.  It is useful for kvm when ksm is used,
    because it allows kvm not to have to receive vmexit and only then map the
    ksm page into the shadow page table, but instead map it directly at the
    same time as Linux maps the page into the host page table.
    

                /*
                 * Clear the pte entry and flush it first, before updating the
                 * pte with the new entry. This will avoid a race condition
                 * seen in the presence of one thread doing SMC and another
                 * thread doing COW.
                 */
                ptep_clear_flush(vma, address, page_table);
                page_add_new_anon_rmap(new_page, vma, address);
                /*
                 * We call the notify macro here because, when using secondary
                 * mmu page tables (such as kvm shadow page tables), we want the
                 * new page to be mapped directly into the secondary page table.
                 */
                set_pte_at_notify(mm, address, page_table, entry);


#define set_pte_at_notify(__mm, __address, __ptep, __pte)               \
({                                                                      \
        struct mm_struct *___mm = __mm;                                 \
        unsigned long ___address = __address;                           \
        pte_t ___pte = __pte;                                           \
                                                                        \
        set_pte_at(___mm, ___address, __ptep, ___pte);                  \
        mmu_notifier_change_pte(___mm, ___address, ___pte);             \
})


That behavior got smushed by commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify
with invalidate_range_start and invalidate_range_end"), which depsite clearly
intending to wrap change_pte() with an invalidation, completely overlooked the original
intent of the change_pte() notifier.

Whether or not anyone cares enough about KSM to fix that remains to be seen, but
I don't think it KVM should put restrictions on change_pte() just so that KVM can
put a band-aid on a wart for a very specific type of VM.

> > And no sane VMM should be modifying userspace mappings that cover SEV guest memory
> > at any reasonable rate.
> >
> > In other words, switching to CLFUSHOPT for SEV+SEV-ES VMs is effectively a
> > band-aid for the NUMA balancing issue.
> 
> That's not true. KSM might also use the same mechanism. For NUMA
> balancing and KSM, there seems to be a pattern: blindly flushing
> mmu_notifier first, then try to do the actual work.

See above for KSM.

> I have a limited knowledge on MM, but from my observations, it looks
> like the property of a page being "PINNED" is very unreliable (or
> expensive), i.e., anyone can jump in and pin the page. So it is hard
> to see whether a page is truly "PINNED" or maybe just someone is
> "working" on it without holding the lock.

mm/ differentiates between various types of pins, e.g. elevated refcount vs. pin
vs. longterm pin.  See the comments in include/linux/mm.h for FOLL_PIN.

> Holding the refcount of a struct page requires a spinlock. I suspect that
> might be the reason why NUMA balancing and KSM is just aggressively firing
> mmu_notifiers first. I don't know if there is other stuff in MM following the
> same pattern.
> 
> Concretely, my deep worry is the mmu_notifier in try_to_unmap_one(). I
> cannot enumerate all of the callers. But if there is someone who calls
> into this, it might be a disaster level (4K) performance lock.

Like NUMA balancing, anything that triggers unmapping of SEV memory is a problem
that needs to be fixed.  I'm arguing that we should fix the actual problems, not
put a band-aid on things to make the problems less painful.

>  Hope we can prove that won't happen.

Truly proving it can't happen is likely infeasible.  But I also think it's
unnecessary and would be a poor use of time/resources to try and hunt down every
possible edge case.  Userspace is already highly motivated to ensure SEV guest memory
is as static as possible, and so edge cases that are theoretically possible will
likely not be problematic in practice.

And if there are spurious unappings, they likely affect non-SEV VMs as well, just
to a much lesser degree.  I.e. fixing the underlying problems would benefit all
KVM VMs, not just SEV VMs.
Sean Christopherson Oct. 7, 2022, 5 p.m. UTC | #9
On Fri, Oct 07, 2022, Sean Christopherson wrote:
> On Thu, Oct 06, 2022, Mingwei Zhang wrote:
> > I have a limited knowledge on MM, but from my observations, it looks
> > like the property of a page being "PINNED" is very unreliable (or
> > expensive), i.e., anyone can jump in and pin the page. So it is hard
> > to see whether a page is truly "PINNED" or maybe just someone is
> > "working" on it without holding the lock.
> 
> mm/ differentiates between various types of pins, e.g. elevated refcount vs. pin
> vs. longterm pin.  See the comments in include/linux/mm.h for FOLL_PIN.

Ah, after catching up on the off-list thread, I suspect you're referring to the
the fact that even longterm pins don't prevent zapping[*].

NUMA balancing - already discussed

PMD splitting - If necessary, solvable by introducing MMU_NOTIFY_SPLIT, as a true
                split doesn't reclaim memory, i.e. KVM doesn't need to do WBINVD.
                Unfortunately, not straightforward as __split_huge_pmd_locked() may
                zap instead of split, but it is solvable.

KSM - absolutely should not be turned on for SEV guests

[*] https://lore.kernel.org/linux-arm-kernel/YuEMkKY2RU%2F2KiZW@monolith.localdoman
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4bc476d7fa6c..7167f94ed250 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1204,6 +1204,7 @@  struct kvm_x86_ops {
 	int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
 	int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*mem_enc_unreg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
+	void (*guest_memory_reclaimed)(struct kvm *kvm);
 
 	int (*get_msr_feature)(struct kvm_msr_entry *entry);
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1efcc7d4bc88..95f1293babae 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5072,6 +5072,14 @@  static void reload_tss(struct kvm_vcpu *vcpu)
 	load_TR_desc();
 }
 
+static void sev_guest_memory_reclaimed(struct kvm *kvm)
+{
+	if (!sev_guest(kvm))
+		return;
+
+	wbinvd_on_all_cpus();
+}
+
 static void pre_sev_run(struct vcpu_svm *svm, int cpu)
 {
 	struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
@@ -7385,6 +7393,7 @@  static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.mem_enc_op = svm_mem_enc_op,
 	.mem_enc_reg_region = svm_register_enc_region,
 	.mem_enc_unreg_region = svm_unregister_enc_region,
+	.guest_memory_reclaimed = sev_guest_memory_reclaimed,
 
 	.nested_enable_evmcs = NULL,
 	.nested_get_evmcs_version = NULL,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d0b297583df8..bb391ff7a901 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8046,6 +8046,12 @@  void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 }
 
+void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
+{
+	if (kvm_x86_ops->guest_memory_reclaimed)
+		kvm_x86_ops->guest_memory_reclaimed(kvm);
+}
+
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
 	struct page *page = NULL;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index dd4cdad76b18..9a35585271d8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1408,6 +1408,8 @@  static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 					    unsigned long start, unsigned long end);
 
+void kvm_arch_guest_memory_reclaimed(struct kvm *kvm);
+
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
 int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
 #else
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0008fc49528a..b1cb2ef209ca 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -164,6 +164,10 @@  __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 {
 }
 
+__weak void kvm_arch_guest_memory_reclaimed(struct kvm *kvm)
+{
+}
+
 bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
 {
 	/*
@@ -324,6 +328,12 @@  void kvm_reload_remote_mmus(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
 }
 
+static void kvm_flush_shadow_all(struct kvm *kvm)
+{
+	kvm_arch_flush_shadow_all(kvm);
+	kvm_arch_guest_memory_reclaimed(kvm);
+}
+
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
 	struct page *page;
@@ -435,6 +445,7 @@  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		kvm_flush_remote_tlbs(kvm);
 
 	spin_unlock(&kvm->mmu_lock);
+	kvm_arch_guest_memory_reclaimed(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return 0;
@@ -538,7 +549,7 @@  static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 	int idx;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	kvm_arch_flush_shadow_all(kvm);
+	kvm_flush_shadow_all(kvm);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -844,7 +855,7 @@  static void kvm_destroy_vm(struct kvm *kvm)
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
 	mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
 #else
-	kvm_arch_flush_shadow_all(kvm);
+	kvm_flush_shadow_all(kvm);
 #endif
 	kvm_arch_destroy_vm(kvm);
 	kvm_destroy_devices(kvm);
@@ -1143,6 +1154,7 @@  int __kvm_set_memory_region(struct kvm *kvm,
 		 *	- kvm_is_visible_gfn (mmu_check_roots)
 		 */
 		kvm_arch_flush_shadow_memslot(kvm, slot);
+		kvm_arch_guest_memory_reclaimed(kvm);
 
 		/*
 		 * We can re-use the old_memslots from above, the only difference