Message ID | b6bc54ed6c8ae4444f3acf1ed4386010783ad386.1606782580.git.ashish.kalra@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add AMD SEV page encryption bitmap support. | expand |
On Tue, Dec 01, 2020, Ashish Kalra wrote: > From: Brijesh Singh <brijesh.singh@amd.com> > > KVM hypercall framework relies on alternative framework to patch the > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > apply_alternative() is called then it defaults to VMCALL. The approach > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > will be able to decode the instruction and do the right things. But > when SEV is active, guest memory is encrypted with guest key and > hypervisor will not be able to decode the instruction bytes. > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > will be used by the SEV guest to notify encrypted pages to the hypervisor. What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL and opt into VMCALL? It's a synthetic feature flag either way, and I don't think there are any existing KVM hypercalls that happen before alternatives are patched, i.e. it'll be a nop for sane kernel builds. I'm also skeptical that a KVM specific hypercall is the right approach for the encryption behavior, but I'll take that up in the patches later in the series. > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: "H. Peter Anvin" <hpa@zytor.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: "Radim Krčmář" <rkrcmar@redhat.com> > Cc: Joerg Roedel <joro@8bytes.org> > Cc: Borislav Petkov <bp@suse.de> > Cc: Tom Lendacky <thomas.lendacky@amd.com> > Cc: x86@kernel.org > Cc: kvm@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Reviewed-by: Steve Rutherford <srutherford@google.com> > Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com> > Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> > Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> > --- > arch/x86/include/asm/kvm_para.h | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h > index 338119852512..bc1b11d057fc 100644 > --- a/arch/x86/include/asm/kvm_para.h > +++ b/arch/x86/include/asm/kvm_para.h > @@ -85,6 +85,18 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, > return ret; > } > > +static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1, > + unsigned long p2, unsigned long p3) > +{ > + long ret; > + > + asm volatile("vmmcall" > + : "=a"(ret) > + : "a"(nr), "b"(p1), "c"(p2), "d"(p3) > + : "memory"); > + return ret; > +} > + > #ifdef CONFIG_KVM_GUEST > bool kvm_para_available(void); > unsigned int kvm_arch_para_features(void); > -- > 2.17.1 >
On 12/2/20 6:34 PM, Sean Christopherson wrote: > On Tue, Dec 01, 2020, Ashish Kalra wrote: >> From: Brijesh Singh <brijesh.singh@amd.com> >> >> KVM hypercall framework relies on alternative framework to patch the >> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before >> apply_alternative() is called then it defaults to VMCALL. The approach >> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor >> will be able to decode the instruction and do the right things. But >> when SEV is active, guest memory is encrypted with guest key and >> hypervisor will not be able to decode the instruction bytes. >> >> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall >> will be used by the SEV guest to notify encrypted pages to the hypervisor. > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > think there are any existing KVM hypercalls that happen before alternatives are > patched, i.e. it'll be a nop for sane kernel builds. If we invert the X86_FEATURE_VMMCALL to default to VMMCALL then it should work fine without this patch. So far there was no hypercall made before the alternative patching took place. Since the page state change can occur much before the alternative patching so we need to default to VMMCALL when SEV is active. > I'm also skeptical that a KVM specific hypercall is the right approach for the > encryption behavior, but I'll take that up in the patches later in the series. Great, I am open to explore other alternative approaches. > >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: "H. Peter Anvin" <hpa@zytor.com> >> Cc: Paolo Bonzini <pbonzini@redhat.com> >> Cc: "Radim Krčmář" <rkrcmar@redhat.com> >> Cc: Joerg Roedel <joro@8bytes.org> >> Cc: Borislav Petkov <bp@suse.de> >> Cc: Tom Lendacky <thomas.lendacky@amd.com> >> Cc: x86@kernel.org >> Cc: kvm@vger.kernel.org >> Cc: linux-kernel@vger.kernel.org >> Reviewed-by: Steve Rutherford <srutherford@google.com> >> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com> >> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> >> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> >> --- >> arch/x86/include/asm/kvm_para.h | 12 ++++++++++++ >> 1 file changed, 12 insertions(+) >> >> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h >> index 338119852512..bc1b11d057fc 100644 >> --- a/arch/x86/include/asm/kvm_para.h >> +++ b/arch/x86/include/asm/kvm_para.h >> @@ -85,6 +85,18 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, >> return ret; >> } >> >> +static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1, >> + unsigned long p2, unsigned long p3) >> +{ >> + long ret; >> + >> + asm volatile("vmmcall" >> + : "=a"(ret) >> + : "a"(nr), "b"(p1), "c"(p2), "d"(p3) >> + : "memory"); >> + return ret; >> +} >> + >> #ifdef CONFIG_KVM_GUEST >> bool kvm_para_available(void); >> unsigned int kvm_arch_para_features(void); >> -- >> 2.17.1 >>
On 03/12/20 01:34, Sean Christopherson wrote: > On Tue, Dec 01, 2020, Ashish Kalra wrote: >> From: Brijesh Singh <brijesh.singh@amd.com> >> >> KVM hypercall framework relies on alternative framework to patch the >> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before >> apply_alternative() is called then it defaults to VMCALL. The approach >> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor >> will be able to decode the instruction and do the right things. But >> when SEV is active, guest memory is encrypted with guest key and >> hypervisor will not be able to decode the instruction bytes. >> >> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall >> will be used by the SEV guest to notify encrypted pages to the hypervisor. > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > think there are any existing KVM hypercalls that happen before alternatives are > patched, i.e. it'll be a nop for sane kernel builds. > > I'm also skeptical that a KVM specific hypercall is the right approach for the > encryption behavior, but I'll take that up in the patches later in the series. Do you think that it's the guest that should "donate" memory for the bitmap instead? Paolo > >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: "H. Peter Anvin" <hpa@zytor.com> >> Cc: Paolo Bonzini <pbonzini@redhat.com> >> Cc: "Radim Krčmář" <rkrcmar@redhat.com> >> Cc: Joerg Roedel <joro@8bytes.org> >> Cc: Borislav Petkov <bp@suse.de> >> Cc: Tom Lendacky <thomas.lendacky@amd.com> >> Cc: x86@kernel.org >> Cc: kvm@vger.kernel.org >> Cc: linux-kernel@vger.kernel.org >> Reviewed-by: Steve Rutherford <srutherford@google.com> >> Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com> >> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> >> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> >> --- >> arch/x86/include/asm/kvm_para.h | 12 ++++++++++++ >> 1 file changed, 12 insertions(+) >> >> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h >> index 338119852512..bc1b11d057fc 100644 >> --- a/arch/x86/include/asm/kvm_para.h >> +++ b/arch/x86/include/asm/kvm_para.h >> @@ -85,6 +85,18 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, >> return ret; >> } >> >> +static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1, >> + unsigned long p2, unsigned long p3) >> +{ >> + long ret; >> + >> + asm volatile("vmmcall" >> + : "=a"(ret) >> + : "a"(nr), "b"(p1), "c"(p2), "d"(p3) >> + : "memory"); >> + return ret; >> +} >> + >> #ifdef CONFIG_KVM_GUEST >> bool kvm_para_available(void); >> unsigned int kvm_arch_para_features(void); >> -- >> 2.17.1 >> >
On Sun, Dec 06, 2020, Paolo Bonzini wrote: > On 03/12/20 01:34, Sean Christopherson wrote: > > On Tue, Dec 01, 2020, Ashish Kalra wrote: > > > From: Brijesh Singh <brijesh.singh@amd.com> > > > > > > KVM hypercall framework relies on alternative framework to patch the > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > > apply_alternative() is called then it defaults to VMCALL. The approach > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > > will be able to decode the instruction and do the right things. But > > > when SEV is active, guest memory is encrypted with guest key and > > > hypervisor will not be able to decode the instruction bytes. > > > > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > > will be used by the SEV guest to notify encrypted pages to the hypervisor. > > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > think there are any existing KVM hypercalls that happen before alternatives are > > patched, i.e. it'll be a nop for sane kernel builds. > > > > I'm also skeptical that a KVM specific hypercall is the right approach for the > > encryption behavior, but I'll take that up in the patches later in the series. > > Do you think that it's the guest that should "donate" memory for the bitmap > instead? No. Two things I'd like to explore: 1. Making the hypercall to announce/request private vs. shared common across hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). I'm concerned that we'll end up with multiple hypercalls that do more or less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a pipe dream, but I'd like to at least explore options before shoving in KVM- only hypercalls. 2. Tracking shared memory via a list of ranges instead of a using bitmap to track all of guest memory. For most use cases, the vast majority of guest memory will be private, most ranges will be 2mb+, and conversions between private and shared will be uncommon events, i.e. the overhead to walk and split/merge list entries is hopefully not a big concern. I suspect a list would consume far less memory, hopefully without impacting performance.
On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > On Sun, Dec 06, 2020, Paolo Bonzini wrote: > > On 03/12/20 01:34, Sean Christopherson wrote: > > > On Tue, Dec 01, 2020, Ashish Kalra wrote: > > > > From: Brijesh Singh <brijesh.singh@amd.com> > > > > > > > > KVM hypercall framework relies on alternative framework to patch the > > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > > > apply_alternative() is called then it defaults to VMCALL. The approach > > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > > > will be able to decode the instruction and do the right things. But > > > > when SEV is active, guest memory is encrypted with guest key and > > > > hypervisor will not be able to decode the instruction bytes. > > > > > > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > > > will be used by the SEV guest to notify encrypted pages to the hypervisor. > > > > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > > think there are any existing KVM hypercalls that happen before alternatives are > > > patched, i.e. it'll be a nop for sane kernel builds. > > > > > > I'm also skeptical that a KVM specific hypercall is the right approach for the > > > encryption behavior, but I'll take that up in the patches later in the series. > > > > Do you think that it's the guest that should "donate" memory for the bitmap > > instead? > > No. Two things I'd like to explore: > > 1. Making the hypercall to announce/request private vs. shared common across > hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > I'm concerned that we'll end up with multiple hypercalls that do more or > less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > pipe dream, but I'd like to at least explore options before shoving in KVM- > only hypercalls. > > > 2. Tracking shared memory via a list of ranges instead of a using bitmap to > track all of guest memory. For most use cases, the vast majority of guest > memory will be private, most ranges will be 2mb+, and conversions between > private and shared will be uncommon events, i.e. the overhead to walk and > split/merge list entries is hopefully not a big concern. I suspect a list > would consume far less memory, hopefully without impacting performance. For a fancier data structure, I'd suggest an interval tree. Linux already has an rbtree-based interval tree implementation, which would likely work, and would probably assuage any performance concerns. Something like this would not be worth doing unless most of the shared pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had 60ish discontiguous shared regions. This is by no means a thorough search, but it's suggestive. If this is typical, then the bitmap would be far less efficient than most any interval-based data structure. You'd have to allow userspace to upper bound the number of intervals (similar to the maximum bitmap size), to prevent host OOMs due to malicious guests. There's something nice about the guest donating memory for this, since that would eliminate the OOM risk.
I don’t think that the bitmap by itself is really a performance bottleneck here. Thanks, Ashish > On Dec 7, 2020, at 9:10 PM, Steve Rutherford <srutherford@google.com> wrote: > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: >> >>> On Sun, Dec 06, 2020, Paolo Bonzini wrote: >>> On 03/12/20 01:34, Sean Christopherson wrote: >>>> On Tue, Dec 01, 2020, Ashish Kalra wrote: >>>>> From: Brijesh Singh <brijesh.singh@amd.com> >>>>> >>>>> KVM hypercall framework relies on alternative framework to patch the >>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before >>>>> apply_alternative() is called then it defaults to VMCALL. The approach >>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor >>>>> will be able to decode the instruction and do the right things. But >>>>> when SEV is active, guest memory is encrypted with guest key and >>>>> hypervisor will not be able to decode the instruction bytes. >>>>> >>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall >>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor. >>>> >>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL >>>> and opt into VMCALL? It's a synthetic feature flag either way, and I don't >>>> think there are any existing KVM hypercalls that happen before alternatives are >>>> patched, i.e. it'll be a nop for sane kernel builds. >>>> >>>> I'm also skeptical that a KVM specific hypercall is the right approach for the >>>> encryption behavior, but I'll take that up in the patches later in the series. >>> >>> Do you think that it's the guest that should "donate" memory for the bitmap >>> instead? >> >> No. Two things I'd like to explore: >> >> 1. Making the hypercall to announce/request private vs. shared common across >> hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). >> I'm concerned that we'll end up with multiple hypercalls that do more or >> less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a >> pipe dream, but I'd like to at least explore options before shoving in KVM- >> only hypercalls. >> >> >> 2. Tracking shared memory via a list of ranges instead of a using bitmap to >> track all of guest memory. For most use cases, the vast majority of guest >> memory will be private, most ranges will be 2mb+, and conversions between >> private and shared will be uncommon events, i.e. the overhead to walk and >> split/merge list entries is hopefully not a big concern. I suspect a list >> would consume far less memory, hopefully without impacting performance. > > For a fancier data structure, I'd suggest an interval tree. Linux > already has an rbtree-based interval tree implementation, which would > likely work, and would probably assuage any performance concerns. > > Something like this would not be worth doing unless most of the shared > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > 60ish discontiguous shared regions. This is by no means a thorough > search, but it's suggestive. If this is typical, then the bitmap would > be far less efficient than most any interval-based data structure. > > You'd have to allow userspace to upper bound the number of intervals > (similar to the maximum bitmap size), to prevent host OOMs due to > malicious guests. There's something nice about the guest donating > memory for this, since that would eliminate the OOM risk.
On 12/7/20 9:09 PM, Steve Rutherford wrote: > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: >> On Sun, Dec 06, 2020, Paolo Bonzini wrote: >>> On 03/12/20 01:34, Sean Christopherson wrote: >>>> On Tue, Dec 01, 2020, Ashish Kalra wrote: >>>>> From: Brijesh Singh <brijesh.singh@amd.com> >>>>> >>>>> KVM hypercall framework relies on alternative framework to patch the >>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before >>>>> apply_alternative() is called then it defaults to VMCALL. The approach >>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor >>>>> will be able to decode the instruction and do the right things. But >>>>> when SEV is active, guest memory is encrypted with guest key and >>>>> hypervisor will not be able to decode the instruction bytes. >>>>> >>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall >>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor. >>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL >>>> and opt into VMCALL? It's a synthetic feature flag either way, and I don't >>>> think there are any existing KVM hypercalls that happen before alternatives are >>>> patched, i.e. it'll be a nop for sane kernel builds. >>>> >>>> I'm also skeptical that a KVM specific hypercall is the right approach for the >>>> encryption behavior, but I'll take that up in the patches later in the series. >>> Do you think that it's the guest that should "donate" memory for the bitmap >>> instead? >> No. Two things I'd like to explore: >> >> 1. Making the hypercall to announce/request private vs. shared common across >> hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). >> I'm concerned that we'll end up with multiple hypercalls that do more or >> less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a >> pipe dream, but I'd like to at least explore options before shoving in KVM- >> only hypercalls. >> >> >> 2. Tracking shared memory via a list of ranges instead of a using bitmap to >> track all of guest memory. For most use cases, the vast majority of guest >> memory will be private, most ranges will be 2mb+, and conversions between >> private and shared will be uncommon events, i.e. the overhead to walk and >> split/merge list entries is hopefully not a big concern. I suspect a list >> would consume far less memory, hopefully without impacting performance. > For a fancier data structure, I'd suggest an interval tree. Linux > already has an rbtree-based interval tree implementation, which would > likely work, and would probably assuage any performance concerns. > > Something like this would not be worth doing unless most of the shared > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > 60ish discontiguous shared regions. This is by no means a thorough > search, but it's suggestive. If this is typical, then the bitmap would > be far less efficient than most any interval-based data structure. > > You'd have to allow userspace to upper bound the number of intervals > (similar to the maximum bitmap size), to prevent host OOMs due to > malicious guests. There's something nice about the guest donating > memory for this, since that would eliminate the OOM risk. Tracking the list of ranges may not be bad idea, especially if we use the some kind of rbtree-based data structure to update the ranges. It will certainly be better than bitmap which grows based on the guest memory size and as you guys see in the practice most of the pages will be guest private. I am not sure if guest donating a memory will cover all the cases, e.g what if we do a memory hotplug (increase the guest ram from 2GB to 64GB), will donated memory range will be enough to store the metadata.
Hello All, On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > >> On Sun, Dec 06, 2020, Paolo Bonzini wrote: > >>> On 03/12/20 01:34, Sean Christopherson wrote: > >>>> On Tue, Dec 01, 2020, Ashish Kalra wrote: > >>>>> From: Brijesh Singh <brijesh.singh@amd.com> > >>>>> > >>>>> KVM hypercall framework relies on alternative framework to patch the > >>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > >>>>> apply_alternative() is called then it defaults to VMCALL. The approach > >>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > >>>>> will be able to decode the instruction and do the right things. But > >>>>> when SEV is active, guest memory is encrypted with guest key and > >>>>> hypervisor will not be able to decode the instruction bytes. > >>>>> > >>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > >>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor. > >>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > >>>> and opt into VMCALL? It's a synthetic feature flag either way, and I don't > >>>> think there are any existing KVM hypercalls that happen before alternatives are > >>>> patched, i.e. it'll be a nop for sane kernel builds. > >>>> > >>>> I'm also skeptical that a KVM specific hypercall is the right approach for the > >>>> encryption behavior, but I'll take that up in the patches later in the series. > >>> Do you think that it's the guest that should "donate" memory for the bitmap > >>> instead? > >> No. Two things I'd like to explore: > >> > >> 1. Making the hypercall to announce/request private vs. shared common across > >> hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > >> I'm concerned that we'll end up with multiple hypercalls that do more or > >> less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > >> pipe dream, but I'd like to at least explore options before shoving in KVM- > >> only hypercalls. > >> > >> > >> 2. Tracking shared memory via a list of ranges instead of a using bitmap to > >> track all of guest memory. For most use cases, the vast majority of guest > >> memory will be private, most ranges will be 2mb+, and conversions between > >> private and shared will be uncommon events, i.e. the overhead to walk and > >> split/merge list entries is hopefully not a big concern. I suspect a list > >> would consume far less memory, hopefully without impacting performance. > > For a fancier data structure, I'd suggest an interval tree. Linux > > already has an rbtree-based interval tree implementation, which would > > likely work, and would probably assuage any performance concerns. > > > > Something like this would not be worth doing unless most of the shared > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > > 60ish discontiguous shared regions. This is by no means a thorough > > search, but it's suggestive. If this is typical, then the bitmap would > > be far less efficient than most any interval-based data structure. > > > > You'd have to allow userspace to upper bound the number of intervals > > (similar to the maximum bitmap size), to prevent host OOMs due to > > malicious guests. There's something nice about the guest donating > > memory for this, since that would eliminate the OOM risk. > > > Tracking the list of ranges may not be bad idea, especially if we use > the some kind of rbtree-based data structure to update the ranges. It > will certainly be better than bitmap which grows based on the guest > memory size and as you guys see in the practice most of the pages will > be guest private. I am not sure if guest donating a memory will cover > all the cases, e.g what if we do a memory hotplug (increase the guest > ram from 2GB to 64GB), will donated memory range will be enough to store > the metadata. > >. With reference to internal discussions regarding the above, i am going to look into specific items as listed below : 1). "hypercall" related : a). Explore the SEV-SNP page change request structure (included in GHCB), see if there is something common there than can be re-used for SEV/SEV-ES page encryption status hypercalls. b). Explore if there is any common hypercall framework i can use in Linux/KVM. 2). related to the "backing" data structure - explore using a range-based list or something like rbtree-based interval tree data structure (as mentioned by Steve above) to replace the current bitmap based implementation. Thanks, Ashish
* Ashish Kalra (ashish.kalra@amd.com) wrote: > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote: > > Hello All, > > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > > > > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > > >> On Sun, Dec 06, 2020, Paolo Bonzini wrote: > > > >>> On 03/12/20 01:34, Sean Christopherson wrote: > > > >>>> On Tue, Dec 01, 2020, Ashish Kalra wrote: > > > >>>>> From: Brijesh Singh <brijesh.singh@amd.com> > > > >>>>> > > > >>>>> KVM hypercall framework relies on alternative framework to patch the > > > >>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > > >>>>> apply_alternative() is called then it defaults to VMCALL. The approach > > > >>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > > >>>>> will be able to decode the instruction and do the right things. But > > > >>>>> when SEV is active, guest memory is encrypted with guest key and > > > >>>>> hypervisor will not be able to decode the instruction bytes. > > > >>>>> > > > >>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > > >>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor. > > > >>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > > >>>> and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > > >>>> think there are any existing KVM hypercalls that happen before alternatives are > > > >>>> patched, i.e. it'll be a nop for sane kernel builds. > > > >>>> > > > >>>> I'm also skeptical that a KVM specific hypercall is the right approach for the > > > >>>> encryption behavior, but I'll take that up in the patches later in the series. > > > >>> Do you think that it's the guest that should "donate" memory for the bitmap > > > >>> instead? > > > >> No. Two things I'd like to explore: > > > >> > > > >> 1. Making the hypercall to announce/request private vs. shared common across > > > >> hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > > > >> I'm concerned that we'll end up with multiple hypercalls that do more or > > > >> less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > > > >> pipe dream, but I'd like to at least explore options before shoving in KVM- > > > >> only hypercalls. > > > >> > > > >> > > > >> 2. Tracking shared memory via a list of ranges instead of a using bitmap to > > > >> track all of guest memory. For most use cases, the vast majority of guest > > > >> memory will be private, most ranges will be 2mb+, and conversions between > > > >> private and shared will be uncommon events, i.e. the overhead to walk and > > > >> split/merge list entries is hopefully not a big concern. I suspect a list > > > >> would consume far less memory, hopefully without impacting performance. > > > > For a fancier data structure, I'd suggest an interval tree. Linux > > > > already has an rbtree-based interval tree implementation, which would > > > > likely work, and would probably assuage any performance concerns. > > > > > > > > Something like this would not be worth doing unless most of the shared > > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > > > > 60ish discontiguous shared regions. This is by no means a thorough > > > > search, but it's suggestive. If this is typical, then the bitmap would > > > > be far less efficient than most any interval-based data structure. > > > > > > > > You'd have to allow userspace to upper bound the number of intervals > > > > (similar to the maximum bitmap size), to prevent host OOMs due to > > > > malicious guests. There's something nice about the guest donating > > > > memory for this, since that would eliminate the OOM risk. > > > > > > > > > Tracking the list of ranges may not be bad idea, especially if we use > > > the some kind of rbtree-based data structure to update the ranges. It > > > will certainly be better than bitmap which grows based on the guest > > > memory size and as you guys see in the practice most of the pages will > > > be guest private. I am not sure if guest donating a memory will cover > > > all the cases, e.g what if we do a memory hotplug (increase the guest > > > ram from 2GB to 64GB), will donated memory range will be enough to store > > > the metadata. > > > > > >. > > > > With reference to internal discussions regarding the above, i am going > > to look into specific items as listed below : > > > > 1). "hypercall" related : > > a). Explore the SEV-SNP page change request structure (included in GHCB), > > see if there is something common there than can be re-used for SEV/SEV-ES > > page encryption status hypercalls. > > b). Explore if there is any common hypercall framework i can use in > > Linux/KVM. > > > > 2). related to the "backing" data structure - explore using a range-based > > list or something like rbtree-based interval tree data structure > > (as mentioned by Steve above) to replace the current bitmap based > > implementation. > > > > > > I do agree that a range-based list or an interval tree data structure is a > really good "logical" fit for the guest page encryption status tracking. > > We can only keep track of the guest unencrypted shared pages in the > range(s) list (which will keep the data structure quite compact) and all > the guest private/encrypted memory does not really need any tracking in > the list, anything not in the list will be encrypted/private. > > Also looking at a more "practical" use case, here is the current log of > page encryption status hypercalls when booting a linux guest : > > ... <snip> > [ 56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 1 > [ 56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > [ 56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > [ 56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 0 .... > [ 56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 0 > [ 56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 0 > [ 56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 1 > [ 56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 1 .... > [ 56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 0 > [ 56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 0 > [ 56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 1 > [ 56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 1 .... > [ 56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 0 > [ 56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 0 > [ 56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 1 > [ 56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 1 .... > [ 56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 0 > [ 56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 0 > [ 56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 1 > [ 56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 1 .... > [ 56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > [ 56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > [ 56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 1 > [ 56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 0 > [ 56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > [ 56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 .... > As can be observed here, all guest MMIO ranges are initially setup as > shared, and those are all contigious guest page ranges. > > After that the encryption status hypercalls are invoked when DMA gets > triggered during disk i/o while booting the guest ... here again the > guest page ranges are contigious, though mostly single page is touched > and a lot of page re-use is observed. > > So a range-based list/structure will be a "good" fit for such usage > scenarios. It seems surprisingly common to flick the same pages back and forth between encrypted and clear for quite a while; why is this? Dave > Thanks, > Ashish >
* Kalra, Ashish (Ashish.Kalra@amd.com) wrote: > Hello Dave, > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > * Ashish Kalra (ashish.kalra@amd.com) wrote: > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote: > Hello All, > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > On Sun, Dec 06, 2020, Paolo Bonzini wrote: > On 03/12/20 01:34, Sean Christopherson wrote: > On Tue, Dec 01, 2020, Ashish Kalra wrote: > From: Brijesh Singh <brijesh.singh@amd.com> > > KVM hypercall framework relies on alternative framework to patch the > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > apply_alternative() is called then it defaults to VMCALL. The approach > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > will be able to decode the instruction and do the right things. But > when SEV is active, guest memory is encrypted with guest key and > hypervisor will not be able to decode the instruction bytes. > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > will be used by the SEV guest to notify encrypted pages to the hypervisor. > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > think there are any existing KVM hypercalls that happen before alternatives are > patched, i.e. it'll be a nop for sane kernel builds. > > I'm also skeptical that a KVM specific hypercall is the right approach for the > encryption behavior, but I'll take that up in the patches later in the series. > Do you think that it's the guest that should "donate" memory for the bitmap > instead? > No. Two things I'd like to explore: > > 1. Making the hypercall to announce/request private vs. shared common across > hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > I'm concerned that we'll end up with multiple hypercalls that do more or > less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > pipe dream, but I'd like to at least explore options before shoving in KVM- > only hypercalls. > > > 2. Tracking shared memory via a list of ranges instead of a using bitmap to > track all of guest memory. For most use cases, the vast majority of guest > memory will be private, most ranges will be 2mb+, and conversions between > private and shared will be uncommon events, i.e. the overhead to walk and > split/merge list entries is hopefully not a big concern. I suspect a list > would consume far less memory, hopefully without impacting performance. > For a fancier data structure, I'd suggest an interval tree. Linux > already has an rbtree-based interval tree implementation, which would > likely work, and would probably assuage any performance concerns. > > Something like this would not be worth doing unless most of the shared > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > 60ish discontiguous shared regions. This is by no means a thorough > search, but it's suggestive. If this is typical, then the bitmap would > be far less efficient than most any interval-based data structure. > > You'd have to allow userspace to upper bound the number of intervals > (similar to the maximum bitmap size), to prevent host OOMs due to > malicious guests. There's something nice about the guest donating > memory for this, since that would eliminate the OOM risk. > > > Tracking the list of ranges may not be bad idea, especially if we use > the some kind of rbtree-based data structure to update the ranges. It > will certainly be better than bitmap which grows based on the guest > memory size and as you guys see in the practice most of the pages will > be guest private. I am not sure if guest donating a memory will cover > all the cases, e.g what if we do a memory hotplug (increase the guest > ram from 2GB to 64GB), will donated memory range will be enough to store > the metadata. > > . > > With reference to internal discussions regarding the above, i am going > to look into specific items as listed below : > > 1). "hypercall" related : > a). Explore the SEV-SNP page change request structure (included in GHCB), > see if there is something common there than can be re-used for SEV/SEV-ES > page encryption status hypercalls. > b). Explore if there is any common hypercall framework i can use in > Linux/KVM. > > 2). related to the "backing" data structure - explore using a range-based > list or something like rbtree-based interval tree data structure > (as mentioned by Steve above) to replace the current bitmap based > implementation. > > > > I do agree that a range-based list or an interval tree data structure is a > really good "logical" fit for the guest page encryption status tracking. > > We can only keep track of the guest unencrypted shared pages in the > range(s) list (which will keep the data structure quite compact) and all > the guest private/encrypted memory does not really need any tracking in > the list, anything not in the list will be encrypted/private. > > Also looking at a more "practical" use case, here is the current log of > page encryption status hypercalls when booting a linux guest : > > ... > > <snip> > > [ 56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 1 > [ 56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > [ 56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > [ 56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 0 > .... > > [ 56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 0 > [ 56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 0 > [ 56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 1 > [ 56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 1 > > .... > [ 56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 0 > [ 56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 0 > [ 56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 1 > [ 56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 1 > .... > > [ 56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 0 > [ 56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 0 > [ 56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 1 > [ 56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 1 > .... > > [ 56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 0 > [ 56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 0 > [ 56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 1 > [ 56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 1 > .... > [ 56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > [ 56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > [ 56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 1 > [ 56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 0 > [ 56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > [ 56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > .... > > As can be observed here, all guest MMIO ranges are initially setup as > shared, and those are all contigious guest page ranges. > > After that the encryption status hypercalls are invoked when DMA gets > triggered during disk i/o while booting the guest ... here again the > guest page ranges are contigious, though mostly single page is touched > and a lot of page re-use is observed. > > So a range-based list/structure will be a "good" fit for such usage > scenarios. > > It seems surprisingly common to flick the same pages back and forth between > encrypted and clear for quite a while; why is this? > > > dma_alloc_coherent()'s will allocate pages and then call > set_decrypted() on them and then at dma_free_coherent(), set_encrypted() > is called on the pages to be freed. So these observations in the logs > where a lot of single 4K pages are seeing C-bit transitions and > corresponding hypercalls are the ones associated with > dma_alloc_coherent(). It makes me wonder if it might be worth teaching it to hold onto those DMA pages somewhere until it needs them for something else and avoid the extra hypercalls; just something to think about. Dave > Thanks, > Ashish > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >
On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote: > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote: > > Hello Dave, > > > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > > > * Ashish Kalra (ashish.kalra@amd.com) wrote: > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote: > > Hello All, > > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > On Sun, Dec 06, 2020, Paolo Bonzini wrote: > > On 03/12/20 01:34, Sean Christopherson wrote: > > On Tue, Dec 01, 2020, Ashish Kalra wrote: > > From: Brijesh Singh <brijesh.singh@amd.com> > > > > KVM hypercall framework relies on alternative framework to patch the > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > apply_alternative() is called then it defaults to VMCALL. The approach > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > will be able to decode the instruction and do the right things. But > > when SEV is active, guest memory is encrypted with guest key and > > hypervisor will not be able to decode the instruction bytes. > > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > will be used by the SEV guest to notify encrypted pages to the hypervisor. > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > think there are any existing KVM hypercalls that happen before alternatives are > > patched, i.e. it'll be a nop for sane kernel builds. > > > > I'm also skeptical that a KVM specific hypercall is the right approach for the > > encryption behavior, but I'll take that up in the patches later in the series. > > Do you think that it's the guest that should "donate" memory for the bitmap > > instead? > > No. Two things I'd like to explore: > > > > 1. Making the hypercall to announce/request private vs. shared common across > > hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > > I'm concerned that we'll end up with multiple hypercalls that do more or > > less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > > pipe dream, but I'd like to at least explore options before shoving in KVM- > > only hypercalls. > > > > > > 2. Tracking shared memory via a list of ranges instead of a using bitmap to > > track all of guest memory. For most use cases, the vast majority of guest > > memory will be private, most ranges will be 2mb+, and conversions between > > private and shared will be uncommon events, i.e. the overhead to walk and > > split/merge list entries is hopefully not a big concern. I suspect a list > > would consume far less memory, hopefully without impacting performance. > > For a fancier data structure, I'd suggest an interval tree. Linux > > already has an rbtree-based interval tree implementation, which would > > likely work, and would probably assuage any performance concerns. > > > > Something like this would not be worth doing unless most of the shared > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > > 60ish discontiguous shared regions. This is by no means a thorough > > search, but it's suggestive. If this is typical, then the bitmap would > > be far less efficient than most any interval-based data structure. > > > > You'd have to allow userspace to upper bound the number of intervals > > (similar to the maximum bitmap size), to prevent host OOMs due to > > malicious guests. There's something nice about the guest donating > > memory for this, since that would eliminate the OOM risk. > > > > > > Tracking the list of ranges may not be bad idea, especially if we use > > the some kind of rbtree-based data structure to update the ranges. It > > will certainly be better than bitmap which grows based on the guest > > memory size and as you guys see in the practice most of the pages will > > be guest private. I am not sure if guest donating a memory will cover > > all the cases, e.g what if we do a memory hotplug (increase the guest > > ram from 2GB to 64GB), will donated memory range will be enough to store > > the metadata. > > > > . > > > > With reference to internal discussions regarding the above, i am going > > to look into specific items as listed below : > > > > 1). "hypercall" related : > > a). Explore the SEV-SNP page change request structure (included in GHCB), > > see if there is something common there than can be re-used for SEV/SEV-ES > > page encryption status hypercalls. > > b). Explore if there is any common hypercall framework i can use in > > Linux/KVM. > > > > 2). related to the "backing" data structure - explore using a range-based > > list or something like rbtree-based interval tree data structure > > (as mentioned by Steve above) to replace the current bitmap based > > implementation. > > > > > > > > I do agree that a range-based list or an interval tree data structure is a > > really good "logical" fit for the guest page encryption status tracking. > > > > We can only keep track of the guest unencrypted shared pages in the > > range(s) list (which will keep the data structure quite compact) and all > > the guest private/encrypted memory does not really need any tracking in > > the list, anything not in the list will be encrypted/private. > > > > Also looking at a more "practical" use case, here is the current log of > > page encryption status hypercalls when booting a linux guest : > > > > ... > > > > <snip> > > > > [ 56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 1 > > [ 56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > [ 56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > [ 56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 0 > > .... > > > > [ 56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 0 > > [ 56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 0 > > [ 56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 1 > > [ 56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 1 > > > > .... > > [ 56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 0 > > [ 56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 0 > > [ 56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 1 > > [ 56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 1 > > .... > > > > [ 56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 0 > > [ 56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 0 > > [ 56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 1 > > [ 56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 1 > > .... > > > > [ 56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 0 > > [ 56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 0 > > [ 56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 1 > > [ 56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 1 > > .... > > [ 56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > [ 56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > [ 56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 1 > > [ 56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 0 > > [ 56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > [ 56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > .... > > > > As can be observed here, all guest MMIO ranges are initially setup as > > shared, and those are all contigious guest page ranges. > > > > After that the encryption status hypercalls are invoked when DMA gets > > triggered during disk i/o while booting the guest ... here again the > > guest page ranges are contigious, though mostly single page is touched > > and a lot of page re-use is observed. > > > > So a range-based list/structure will be a "good" fit for such usage > > scenarios. > > > > It seems surprisingly common to flick the same pages back and forth between > > encrypted and clear for quite a while; why is this? > > > > > > dma_alloc_coherent()'s will allocate pages and then call > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted() > > is called on the pages to be freed. So these observations in the logs > > where a lot of single 4K pages are seeing C-bit transitions and > > corresponding hypercalls are the ones associated with > > dma_alloc_coherent(). > > It makes me wonder if it might be worth teaching it to hold onto those > DMA pages somewhere until it needs them for something else and avoid the > extra hypercalls; just something to think about. > > Dave Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments : It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status. These will be like 15-20 nodes/entries. Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup. As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges). Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup won't be too expensive compared to a tree lookup. In other words, this be a fixed size list. Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status. Looking fwd. to any comments/feedback/thoughts on the above. Thanks, Ashish
Avoiding an rbtree for such a small (but unstable) list seems correct. For the unencrypted region list strategy, the only questions that I have are fairly secondary. - How should the kernel upper bound the size of the list in the face of malicious guests, but still support large guests? (Something similar to the size provided in the bitmap API would work). - What serialization format should be used for the ioctl API? (Usermode could send down a pointer to a user region and a size. The kernel could then populate that with an array of structs containing bases and limits for unencrypted regions.) - How will the kernel tag a guest as having exceeded its maximum list size, in order to indicate that the list is now incomplete? (Track a poison bit, and send it up when getting the serialized list of regions). In my view, there are two main competitors to this strategy: - (Existing) Bitmap API - A guest memory donation based model The existing bitmap API avoids any issues with growing too large, since it's size is predictable. To elaborate on the memory donation based model, the guest could put an encryption status data structure into unencrypted guest memory, and then use a hypercall to inform the host where the base of that structure is located. The main advantage of this is that it side steps any issues around malicious guests causing large allocations. The unencrypted region list seems very practical. It's biggest advantage over the bitmap is how cheap it will be to pass the structure up from the kernel. A memory donation based model could achieve similar performance, but with some additional complexity. Does anyone view the memory donation model as worth the complexity? Does anyone think the simplicity of the bitmap is a better tradeoff compared to an unencrypted region list? Or have other ideas that are not mentioned here? On Wed, Jan 6, 2021 at 3:06 PM Ashish Kalra <ashish.kalra@amd.com> wrote: > > On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote: > > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote: > > > Hello Dave, > > > > > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > > > > > * Ashish Kalra (ashish.kalra@amd.com) wrote: > > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote: > > > Hello All, > > > > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > > > > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > > On Sun, Dec 06, 2020, Paolo Bonzini wrote: > > > On 03/12/20 01:34, Sean Christopherson wrote: > > > On Tue, Dec 01, 2020, Ashish Kalra wrote: > > > From: Brijesh Singh <brijesh.singh@amd.com> > > > > > > KVM hypercall framework relies on alternative framework to patch the > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > > apply_alternative() is called then it defaults to VMCALL. The approach > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > > will be able to decode the instruction and do the right things. But > > > when SEV is active, guest memory is encrypted with guest key and > > > hypervisor will not be able to decode the instruction bytes. > > > > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > > will be used by the SEV guest to notify encrypted pages to the hypervisor. > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > > think there are any existing KVM hypercalls that happen before alternatives are > > > patched, i.e. it'll be a nop for sane kernel builds. > > > > > > I'm also skeptical that a KVM specific hypercall is the right approach for the > > > encryption behavior, but I'll take that up in the patches later in the series. > > > Do you think that it's the guest that should "donate" memory for the bitmap > > > instead? > > > No. Two things I'd like to explore: > > > > > > 1. Making the hypercall to announce/request private vs. shared common across > > > hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > > > I'm concerned that we'll end up with multiple hypercalls that do more or > > > less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > > > pipe dream, but I'd like to at least explore options before shoving in KVM- > > > only hypercalls. > > > > > > > > > 2. Tracking shared memory via a list of ranges instead of a using bitmap to > > > track all of guest memory. For most use cases, the vast majority of guest > > > memory will be private, most ranges will be 2mb+, and conversions between > > > private and shared will be uncommon events, i.e. the overhead to walk and > > > split/merge list entries is hopefully not a big concern. I suspect a list > > > would consume far less memory, hopefully without impacting performance. > > > For a fancier data structure, I'd suggest an interval tree. Linux > > > already has an rbtree-based interval tree implementation, which would > > > likely work, and would probably assuage any performance concerns. > > > > > > Something like this would not be worth doing unless most of the shared > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > > > 60ish discontiguous shared regions. This is by no means a thorough > > > search, but it's suggestive. If this is typical, then the bitmap would > > > be far less efficient than most any interval-based data structure. > > > > > > You'd have to allow userspace to upper bound the number of intervals > > > (similar to the maximum bitmap size), to prevent host OOMs due to > > > malicious guests. There's something nice about the guest donating > > > memory for this, since that would eliminate the OOM risk. > > > > > > > > > Tracking the list of ranges may not be bad idea, especially if we use > > > the some kind of rbtree-based data structure to update the ranges. It > > > will certainly be better than bitmap which grows based on the guest > > > memory size and as you guys see in the practice most of the pages will > > > be guest private. I am not sure if guest donating a memory will cover > > > all the cases, e.g what if we do a memory hotplug (increase the guest > > > ram from 2GB to 64GB), will donated memory range will be enough to store > > > the metadata. > > > > > > . > > > > > > With reference to internal discussions regarding the above, i am going > > > to look into specific items as listed below : > > > > > > 1). "hypercall" related : > > > a). Explore the SEV-SNP page change request structure (included in GHCB), > > > see if there is something common there than can be re-used for SEV/SEV-ES > > > page encryption status hypercalls. > > > b). Explore if there is any common hypercall framework i can use in > > > Linux/KVM. > > > > > > 2). related to the "backing" data structure - explore using a range-based > > > list or something like rbtree-based interval tree data structure > > > (as mentioned by Steve above) to replace the current bitmap based > > > implementation. > > > > > > > > > > > > I do agree that a range-based list or an interval tree data structure is a > > > really good "logical" fit for the guest page encryption status tracking. > > > > > > We can only keep track of the guest unencrypted shared pages in the > > > range(s) list (which will keep the data structure quite compact) and all > > > the guest private/encrypted memory does not really need any tracking in > > > the list, anything not in the list will be encrypted/private. > > > > > > Also looking at a more "practical" use case, here is the current log of > > > page encryption status hypercalls when booting a linux guest : > > > > > > ... > > > > > > <snip> > > > > > > [ 56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 1 > > > [ 56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > > [ 56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > > [ 56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 0 > > > .... > > > > > > [ 56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 0 > > > [ 56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 0 > > > [ 56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 1 > > > [ 56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 1 > > > > > > .... > > > [ 56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 0 > > > [ 56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 0 > > > [ 56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 1 > > > [ 56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 1 > > > .... > > > > > > [ 56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 0 > > > [ 56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 0 > > > [ 56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 1 > > > [ 56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 1 > > > .... > > > > > > [ 56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 0 > > > [ 56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 0 > > > [ 56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 1 > > > [ 56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 1 > > > .... > > > [ 56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > > [ 56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > > [ 56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 1 > > > [ 56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 0 > > > [ 56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > > [ 56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > > .... > > > > > > As can be observed here, all guest MMIO ranges are initially setup as > > > shared, and those are all contigious guest page ranges. > > > > > > After that the encryption status hypercalls are invoked when DMA gets > > > triggered during disk i/o while booting the guest ... here again the > > > guest page ranges are contigious, though mostly single page is touched > > > and a lot of page re-use is observed. > > > > > > So a range-based list/structure will be a "good" fit for such usage > > > scenarios. > > > > > > It seems surprisingly common to flick the same pages back and forth between > > > encrypted and clear for quite a while; why is this? > > > > > > > > > dma_alloc_coherent()'s will allocate pages and then call > > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted() > > > is called on the pages to be freed. So these observations in the logs > > > where a lot of single 4K pages are seeing C-bit transitions and > > > corresponding hypercalls are the ones associated with > > > dma_alloc_coherent(). > > > > It makes me wonder if it might be worth teaching it to hold onto those > > DMA pages somewhere until it needs them for something else and avoid the > > extra hypercalls; just something to think about. > > > > Dave > > Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments : > > It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status. > These will be like 15-20 nodes/entries. > > Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup. > > As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be > using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions > (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex > and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges). > > Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup > won't be too expensive compared to a tree lookup. In other words, this be a fixed size list. > > Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status. > > Looking fwd. to any comments/feedback/thoughts on the above. > > Thanks, > Ashish >
Hello Steve, My thoughts here ... On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > Avoiding an rbtree for such a small (but unstable) list seems correct. > I agree. > For the unencrypted region list strategy, the only questions that I > have are fairly secondary. > - How should the kernel upper bound the size of the list in the face > of malicious guests, but still support large guests? (Something > similar to the size provided in the bitmap API would work). Please note that in our current implementation, we don't do any bitmap resize based on guest page encryption status hypercall, the max. bitmap size is computed dynamically by walking the KVM memslots and finding the maximum guest physical address size. So, malicious guests cannot do large bitmap allocations using the hypercalls. > - What serialization format should be used for the ioctl API? > (Usermode could send down a pointer to a user region and a size. The > kernel could then populate that with an array of structs containing > bases and limits for unencrypted regions.) > - How will the kernel tag a guest as having exceeded its maximum list > size, in order to indicate that the list is now incomplete? (Track a > poison bit, and send it up when getting the serialized list of > regions). > With reference to the serialization concerns with active live migration and simultaneous page encryption bitmap updates, please note that in our current VMM implementation, after each memory region migration cycle we pause live migration, re-sync the page encryption bitmap, XOR it to the last synced bitmap, and then re-transfer any modified pages accordingly. I have a prototype implementation for this in Qemu, which seems to work fine. > In my view, there are two main competitors to this strategy: > - (Existing) Bitmap API > - A guest memory donation based model > > The existing bitmap API avoids any issues with growing too large, > since it's size is predictable. > Yes, as i mentioned above, it's size is predictable and cannot grow too large. > To elaborate on the memory donation based model, the guest could put > an encryption status data structure into unencrypted guest memory, and > then use a hypercall to inform the host where the base of that > structure is located. The main advantage of this is that it side steps > any issues around malicious guests causing large allocations. > > The unencrypted region list seems very practical. It's biggest > advantage over the bitmap is how cheap it will be to pass the > structure up from the kernel. A memory donation based model could > achieve similar performance, but with some additional complexity. > > Does anyone view the memory donation model as worth the complexity? > Does anyone think the simplicity of the bitmap is a better tradeoff > compared to an unencrypted region list? One advantage in sticking with the bitmap is that it maps very nicely to the dirty bitmap page tracking logic in KVM/Qemu. The way Brijesh designed and implemented it is very similar to dirty page bitmap tracking and syncing between KVM and Qemu. The same logic is re-used for the page encryption bitmap which means quite mininal changes and code resuse in Qemu. Any changes to the backing data structure, will require additional mapping logic to be added to Qemu. This is one advantage in keeping the bitmap logic. Thanks, Ashish > Or have other ideas that are not mentioned here? > > > On Wed, Jan 6, 2021 at 3:06 PM Ashish Kalra <ashish.kalra@amd.com> wrote: > > > > On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote: > > > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote: > > > > Hello Dave, > > > > > > > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > > > > > > > * Ashish Kalra (ashish.kalra@amd.com) wrote: > > > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote: > > > > Hello All, > > > > > > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > > > > > > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Sun, Dec 06, 2020, Paolo Bonzini wrote: > > > > On 03/12/20 01:34, Sean Christopherson wrote: > > > > On Tue, Dec 01, 2020, Ashish Kalra wrote: > > > > From: Brijesh Singh <brijesh.singh@amd.com> > > > > > > > > KVM hypercall framework relies on alternative framework to patch the > > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > > > apply_alternative() is called then it defaults to VMCALL. The approach > > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > > > will be able to decode the instruction and do the right things. But > > > > when SEV is active, guest memory is encrypted with guest key and > > > > hypervisor will not be able to decode the instruction bytes. > > > > > > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > > > will be used by the SEV guest to notify encrypted pages to the hypervisor. > > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > > > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > > > think there are any existing KVM hypercalls that happen before alternatives are > > > > patched, i.e. it'll be a nop for sane kernel builds. > > > > > > > > I'm also skeptical that a KVM specific hypercall is the right approach for the > > > > encryption behavior, but I'll take that up in the patches later in the series. > > > > Do you think that it's the guest that should "donate" memory for the bitmap > > > > instead? > > > > No. Two things I'd like to explore: > > > > > > > > 1. Making the hypercall to announce/request private vs. shared common across > > > > hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > > > > I'm concerned that we'll end up with multiple hypercalls that do more or > > > > less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > > > > pipe dream, but I'd like to at least explore options before shoving in KVM- > > > > only hypercalls. > > > > > > > > > > > > 2. Tracking shared memory via a list of ranges instead of a using bitmap to > > > > track all of guest memory. For most use cases, the vast majority of guest > > > > memory will be private, most ranges will be 2mb+, and conversions between > > > > private and shared will be uncommon events, i.e. the overhead to walk and > > > > split/merge list entries is hopefully not a big concern. I suspect a list > > > > would consume far less memory, hopefully without impacting performance. > > > > For a fancier data structure, I'd suggest an interval tree. Linux > > > > already has an rbtree-based interval tree implementation, which would > > > > likely work, and would probably assuage any performance concerns. > > > > > > > > Something like this would not be worth doing unless most of the shared > > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > > > > 60ish discontiguous shared regions. This is by no means a thorough > > > > search, but it's suggestive. If this is typical, then the bitmap would > > > > be far less efficient than most any interval-based data structure. > > > > > > > > You'd have to allow userspace to upper bound the number of intervals > > > > (similar to the maximum bitmap size), to prevent host OOMs due to > > > > malicious guests. There's something nice about the guest donating > > > > memory for this, since that would eliminate the OOM risk. > > > > > > > > > > > > Tracking the list of ranges may not be bad idea, especially if we use > > > > the some kind of rbtree-based data structure to update the ranges. It > > > > will certainly be better than bitmap which grows based on the guest > > > > memory size and as you guys see in the practice most of the pages will > > > > be guest private. I am not sure if guest donating a memory will cover > > > > all the cases, e.g what if we do a memory hotplug (increase the guest > > > > ram from 2GB to 64GB), will donated memory range will be enough to store > > > > the metadata. > > > > > > > > . > > > > > > > > With reference to internal discussions regarding the above, i am going > > > > to look into specific items as listed below : > > > > > > > > 1). "hypercall" related : > > > > a). Explore the SEV-SNP page change request structure (included in GHCB), > > > > see if there is something common there than can be re-used for SEV/SEV-ES > > > > page encryption status hypercalls. > > > > b). Explore if there is any common hypercall framework i can use in > > > > Linux/KVM. > > > > > > > > 2). related to the "backing" data structure - explore using a range-based > > > > list or something like rbtree-based interval tree data structure > > > > (as mentioned by Steve above) to replace the current bitmap based > > > > implementation. > > > > > > > > > > > > > > > > I do agree that a range-based list or an interval tree data structure is a > > > > really good "logical" fit for the guest page encryption status tracking. > > > > > > > > We can only keep track of the guest unencrypted shared pages in the > > > > range(s) list (which will keep the data structure quite compact) and all > > > > the guest private/encrypted memory does not really need any tracking in > > > > the list, anything not in the list will be encrypted/private. > > > > > > > > Also looking at a more "practical" use case, here is the current log of > > > > page encryption status hypercalls when booting a linux guest : > > > > > > > > ... > > > > > > > > <snip> > > > > > > > > [ 56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 1 > > > > [ 56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > > > [ 56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > > > [ 56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 0 > > > > .... > > > > > > > > [ 56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 0 > > > > [ 56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 0 > > > > [ 56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 1 > > > > [ 56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 1 > > > > > > > > .... > > > > [ 56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 0 > > > > [ 56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 0 > > > > [ 56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 1 > > > > [ 56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 1 > > > > .... > > > > > > > > [ 56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 0 > > > > [ 56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 0 > > > > [ 56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 1 > > > > [ 56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 1 > > > > .... > > > > > > > > [ 56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 0 > > > > [ 56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 0 > > > > [ 56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 1 > > > > [ 56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 1 > > > > .... > > > > [ 56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > > > [ 56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > > > [ 56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 1 > > > > [ 56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 0 > > > > [ 56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > > > [ 56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > > > .... > > > > > > > > As can be observed here, all guest MMIO ranges are initially setup as > > > > shared, and those are all contigious guest page ranges. > > > > > > > > After that the encryption status hypercalls are invoked when DMA gets > > > > triggered during disk i/o while booting the guest ... here again the > > > > guest page ranges are contigious, though mostly single page is touched > > > > and a lot of page re-use is observed. > > > > > > > > So a range-based list/structure will be a "good" fit for such usage > > > > scenarios. > > > > > > > > It seems surprisingly common to flick the same pages back and forth between > > > > encrypted and clear for quite a while; why is this? > > > > > > > > > > > > dma_alloc_coherent()'s will allocate pages and then call > > > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted() > > > > is called on the pages to be freed. So these observations in the logs > > > > where a lot of single 4K pages are seeing C-bit transitions and > > > > corresponding hypercalls are the ones associated with > > > > dma_alloc_coherent(). > > > > > > It makes me wonder if it might be worth teaching it to hold onto those > > > DMA pages somewhere until it needs them for something else and avoid the > > > extra hypercalls; just something to think about. > > > > > > Dave > > > > Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments : > > > > It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status. > > These will be like 15-20 nodes/entries. > > > > Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup. > > > > As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be > > using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions > > (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex > > and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges). > > > > Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup > > won't be too expensive compared to a tree lookup. In other words, this be a fixed size list. > > > > Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status. > > > > Looking fwd. to any comments/feedback/thoughts on the above. > > > > Thanks, > > Ashish > >
Hello Steve, Sorry, i realized later that i replied to this email with regard to the current bitmap implementation and not the unencrpyted region list strategy. I am now looking at your thoughts/questions with regard to the unencrypted region list strategy and will reply to them accordingly. Thanks, Ashish On Thu, Jan 07, 2021 at 01:34:14AM +0000, Ashish Kalra wrote: > Hello Steve, > > My thoughts here ... > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > > Avoiding an rbtree for such a small (but unstable) list seems correct. > > > > I agree. > > > For the unencrypted region list strategy, the only questions that I > > have are fairly secondary. > > - How should the kernel upper bound the size of the list in the face > > of malicious guests, but still support large guests? (Something > > similar to the size provided in the bitmap API would work). > > Please note that in our current implementation, we don't do any bitmap > resize based on guest page encryption status hypercall, the max. bitmap > size is computed dynamically by walking the KVM memslots and finding the > maximum guest physical address size. > > So, malicious guests cannot do large bitmap allocations using the > hypercalls. > > > - What serialization format should be used for the ioctl API? > > (Usermode could send down a pointer to a user region and a size. The > > kernel could then populate that with an array of structs containing > > bases and limits for unencrypted regions.) > > - How will the kernel tag a guest as having exceeded its maximum list > > size, in order to indicate that the list is now incomplete? (Track a > > poison bit, and send it up when getting the serialized list of > > regions). > > > > With reference to the serialization concerns with active live > migration and simultaneous page encryption bitmap updates, please note > that in our current VMM implementation, after each memory region > migration cycle we pause live migration, re-sync the page encryption > bitmap, XOR it to the last synced bitmap, and then re-transfer any > modified pages accordingly. > > I have a prototype implementation for this in Qemu, which seems to > work fine. > > > In my view, there are two main competitors to this strategy: > > - (Existing) Bitmap API > > - A guest memory donation based model > > > > The existing bitmap API avoids any issues with growing too large, > > since it's size is predictable. > > > > Yes, as i mentioned above, it's size is predictable and cannot grow too > large. > > > To elaborate on the memory donation based model, the guest could put > > an encryption status data structure into unencrypted guest memory, and > > then use a hypercall to inform the host where the base of that > > structure is located. The main advantage of this is that it side steps > > any issues around malicious guests causing large allocations. > > > > The unencrypted region list seems very practical. It's biggest > > advantage over the bitmap is how cheap it will be to pass the > > structure up from the kernel. A memory donation based model could > > achieve similar performance, but with some additional complexity. > > > > Does anyone view the memory donation model as worth the complexity? > > Does anyone think the simplicity of the bitmap is a better tradeoff > > compared to an unencrypted region list? > > One advantage in sticking with the bitmap is that it maps very nicely to > the dirty bitmap page tracking logic in KVM/Qemu. The way Brijesh > designed and implemented it is very similar to dirty page bitmap tracking > and syncing between KVM and Qemu. The same logic is re-used for the page > encryption bitmap which means quite mininal changes and code resuse in > Qemu. > > Any changes to the backing data structure, will require additional > mapping logic to be added to Qemu. > > This is one advantage in keeping the bitmap logic. > > Thanks, > Ashish > > > Or have other ideas that are not mentioned here? > > > > > > On Wed, Jan 6, 2021 at 3:06 PM Ashish Kalra <ashish.kalra@amd.com> wrote: > > > > > > On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote: > > > > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote: > > > > > Hello Dave, > > > > > > > > > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > > > > > > > > > * Ashish Kalra (ashish.kalra@amd.com) wrote: > > > > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote: > > > > > Hello All, > > > > > > > > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > > > > > > > > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > > > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > On Sun, Dec 06, 2020, Paolo Bonzini wrote: > > > > > On 03/12/20 01:34, Sean Christopherson wrote: > > > > > On Tue, Dec 01, 2020, Ashish Kalra wrote: > > > > > From: Brijesh Singh <brijesh.singh@amd.com> > > > > > > > > > > KVM hypercall framework relies on alternative framework to patch the > > > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > > > > apply_alternative() is called then it defaults to VMCALL. The approach > > > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > > > > will be able to decode the instruction and do the right things. But > > > > > when SEV is active, guest memory is encrypted with guest key and > > > > > hypervisor will not be able to decode the instruction bytes. > > > > > > > > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > > > > will be used by the SEV guest to notify encrypted pages to the hypervisor. > > > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > > > > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > > > > think there are any existing KVM hypercalls that happen before alternatives are > > > > > patched, i.e. it'll be a nop for sane kernel builds. > > > > > > > > > > I'm also skeptical that a KVM specific hypercall is the right approach for the > > > > > encryption behavior, but I'll take that up in the patches later in the series. > > > > > Do you think that it's the guest that should "donate" memory for the bitmap > > > > > instead? > > > > > No. Two things I'd like to explore: > > > > > > > > > > 1. Making the hypercall to announce/request private vs. shared common across > > > > > hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > > > > > I'm concerned that we'll end up with multiple hypercalls that do more or > > > > > less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > > > > > pipe dream, but I'd like to at least explore options before shoving in KVM- > > > > > only hypercalls. > > > > > > > > > > > > > > > 2. Tracking shared memory via a list of ranges instead of a using bitmap to > > > > > track all of guest memory. For most use cases, the vast majority of guest > > > > > memory will be private, most ranges will be 2mb+, and conversions between > > > > > private and shared will be uncommon events, i.e. the overhead to walk and > > > > > split/merge list entries is hopefully not a big concern. I suspect a list > > > > > would consume far less memory, hopefully without impacting performance. > > > > > For a fancier data structure, I'd suggest an interval tree. Linux > > > > > already has an rbtree-based interval tree implementation, which would > > > > > likely work, and would probably assuage any performance concerns. > > > > > > > > > > Something like this would not be worth doing unless most of the shared > > > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > > > > > 60ish discontiguous shared regions. This is by no means a thorough > > > > > search, but it's suggestive. If this is typical, then the bitmap would > > > > > be far less efficient than most any interval-based data structure. > > > > > > > > > > You'd have to allow userspace to upper bound the number of intervals > > > > > (similar to the maximum bitmap size), to prevent host OOMs due to > > > > > malicious guests. There's something nice about the guest donating > > > > > memory for this, since that would eliminate the OOM risk. > > > > > > > > > > > > > > > Tracking the list of ranges may not be bad idea, especially if we use > > > > > the some kind of rbtree-based data structure to update the ranges. It > > > > > will certainly be better than bitmap which grows based on the guest > > > > > memory size and as you guys see in the practice most of the pages will > > > > > be guest private. I am not sure if guest donating a memory will cover > > > > > all the cases, e.g what if we do a memory hotplug (increase the guest > > > > > ram from 2GB to 64GB), will donated memory range will be enough to store > > > > > the metadata. > > > > > > > > > > . > > > > > > > > > > With reference to internal discussions regarding the above, i am going > > > > > to look into specific items as listed below : > > > > > > > > > > 1). "hypercall" related : > > > > > a). Explore the SEV-SNP page change request structure (included in GHCB), > > > > > see if there is something common there than can be re-used for SEV/SEV-ES > > > > > page encryption status hypercalls. > > > > > b). Explore if there is any common hypercall framework i can use in > > > > > Linux/KVM. > > > > > > > > > > 2). related to the "backing" data structure - explore using a range-based > > > > > list or something like rbtree-based interval tree data structure > > > > > (as mentioned by Steve above) to replace the current bitmap based > > > > > implementation. > > > > > > > > > > > > > > > > > > > > I do agree that a range-based list or an interval tree data structure is a > > > > > really good "logical" fit for the guest page encryption status tracking. > > > > > > > > > > We can only keep track of the guest unencrypted shared pages in the > > > > > range(s) list (which will keep the data structure quite compact) and all > > > > > the guest private/encrypted memory does not really need any tracking in > > > > > the list, anything not in the list will be encrypted/private. > > > > > > > > > > Also looking at a more "practical" use case, here is the current log of > > > > > page encryption status hypercalls when booting a linux guest : > > > > > > > > > > ... > > > > > > > > > > <snip> > > > > > > > > > > [ 56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 1 > > > > > [ 56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > > > > [ 56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > > > > [ 56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 0 > > > > > .... > > > > > > > > > > [ 56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 0 > > > > > [ 56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 0 > > > > > [ 56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 1 > > > > > [ 56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 1 > > > > > > > > > > .... > > > > > [ 56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 0 > > > > > [ 56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 0 > > > > > [ 56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 1 > > > > > [ 56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 1 > > > > > .... > > > > > > > > > > [ 56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 0 > > > > > [ 56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 0 > > > > > [ 56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 1 > > > > > [ 56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 1 > > > > > .... > > > > > > > > > > [ 56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 0 > > > > > [ 56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 0 > > > > > [ 56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 1 > > > > > [ 56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 1 > > > > > .... > > > > > [ 56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > > > > [ 56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > > > > [ 56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 1 > > > > > [ 56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 0 > > > > > [ 56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > > > > [ 56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > > > > .... > > > > > > > > > > As can be observed here, all guest MMIO ranges are initially setup as > > > > > shared, and those are all contigious guest page ranges. > > > > > > > > > > After that the encryption status hypercalls are invoked when DMA gets > > > > > triggered during disk i/o while booting the guest ... here again the > > > > > guest page ranges are contigious, though mostly single page is touched > > > > > and a lot of page re-use is observed. > > > > > > > > > > So a range-based list/structure will be a "good" fit for such usage > > > > > scenarios. > > > > > > > > > > It seems surprisingly common to flick the same pages back and forth between > > > > > encrypted and clear for quite a while; why is this? > > > > > > > > > > > > > > > dma_alloc_coherent()'s will allocate pages and then call > > > > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted() > > > > > is called on the pages to be freed. So these observations in the logs > > > > > where a lot of single 4K pages are seeing C-bit transitions and > > > > > corresponding hypercalls are the ones associated with > > > > > dma_alloc_coherent(). > > > > > > > > It makes me wonder if it might be worth teaching it to hold onto those > > > > DMA pages somewhere until it needs them for something else and avoid the > > > > extra hypercalls; just something to think about. > > > > > > > > Dave > > > > > > Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments : > > > > > > It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status. > > > These will be like 15-20 nodes/entries. > > > > > > Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup. > > > > > > As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be > > > using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions > > > (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex > > > and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges). > > > > > > Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup > > > won't be too expensive compared to a tree lookup. In other words, this be a fixed size list. > > > > > > Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status. > > > > > > Looking fwd. to any comments/feedback/thoughts on the above. > > > > > > Thanks, > > > Ashish > > >
Hello Steve, On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > Avoiding an rbtree for such a small (but unstable) list seems correct. > > For the unencrypted region list strategy, the only questions that I > have are fairly secondary. > - How should the kernel upper bound the size of the list in the face > of malicious guests, but still support large guests? (Something > similar to the size provided in the bitmap API would work). I am thinking of another scenario, where a malicious guest can make infinite/repetetive hypercalls and DOS attack the host. But probably this is a more generic issue, this can be done by any guest and under any hypervisor, i don't know what kind of mitigations exist for such a scenario ? Potentially, the guest memory donation model can handle such an attack, because in this model, the hypervisor will expect only one hypercall, any repetetive hypercalls can make the hypervisor disable the guest ? Thanks, Ashish > - What serialization format should be used for the ioctl API? > (Usermode could send down a pointer to a user region and a size. The > kernel could then populate that with an array of structs containing > bases and limits for unencrypted regions.) > - How will the kernel tag a guest as having exceeded its maximum list > size, in order to indicate that the list is now incomplete? (Track a > poison bit, and send it up when getting the serialized list of > regions). > > In my view, there are two main competitors to this strategy: > - (Existing) Bitmap API > - A guest memory donation based model > > The existing bitmap API avoids any issues with growing too large, > since it's size is predictable. > > To elaborate on the memory donation based model, the guest could put > an encryption status data structure into unencrypted guest memory, and > then use a hypercall to inform the host where the base of that > structure is located. The main advantage of this is that it side steps > any issues around malicious guests causing large allocations. > > The unencrypted region list seems very practical. It's biggest > advantage over the bitmap is how cheap it will be to pass the > structure up from the kernel. A memory donation based model could > achieve similar performance, but with some additional complexity. > > Does anyone view the memory donation model as worth the complexity? > Does anyone think the simplicity of the bitmap is a better tradeoff > compared to an unencrypted region list? > Or have other ideas that are not mentioned here? > > > On Wed, Jan 6, 2021 at 3:06 PM Ashish Kalra <ashish.kalra@amd.com> wrote: > > > > On Fri, Dec 18, 2020 at 07:56:41PM +0000, Dr. David Alan Gilbert wrote: > > > * Kalra, Ashish (Ashish.Kalra@amd.com) wrote: > > > > Hello Dave, > > > > > > > > On Dec 18, 2020, at 1:40 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > > > > > > > * Ashish Kalra (ashish.kalra@amd.com) wrote: > > > > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote: > > > > Hello All, > > > > > > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > > > > > > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Sun, Dec 06, 2020, Paolo Bonzini wrote: > > > > On 03/12/20 01:34, Sean Christopherson wrote: > > > > On Tue, Dec 01, 2020, Ashish Kalra wrote: > > > > From: Brijesh Singh <brijesh.singh@amd.com> > > > > > > > > KVM hypercall framework relies on alternative framework to patch the > > > > VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > > > apply_alternative() is called then it defaults to VMCALL. The approach > > > > works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > > > will be able to decode the instruction and do the right things. But > > > > when SEV is active, guest memory is encrypted with guest key and > > > > hypervisor will not be able to decode the instruction bytes. > > > > > > > > Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > > > will be used by the SEV guest to notify encrypted pages to the hypervisor. > > > > What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > > > and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > > > think there are any existing KVM hypercalls that happen before alternatives are > > > > patched, i.e. it'll be a nop for sane kernel builds. > > > > > > > > I'm also skeptical that a KVM specific hypercall is the right approach for the > > > > encryption behavior, but I'll take that up in the patches later in the series. > > > > Do you think that it's the guest that should "donate" memory for the bitmap > > > > instead? > > > > No. Two things I'd like to explore: > > > > > > > > 1. Making the hypercall to announce/request private vs. shared common across > > > > hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > > > > I'm concerned that we'll end up with multiple hypercalls that do more or > > > > less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > > > > pipe dream, but I'd like to at least explore options before shoving in KVM- > > > > only hypercalls. > > > > > > > > > > > > 2. Tracking shared memory via a list of ranges instead of a using bitmap to > > > > track all of guest memory. For most use cases, the vast majority of guest > > > > memory will be private, most ranges will be 2mb+, and conversions between > > > > private and shared will be uncommon events, i.e. the overhead to walk and > > > > split/merge list entries is hopefully not a big concern. I suspect a list > > > > would consume far less memory, hopefully without impacting performance. > > > > For a fancier data structure, I'd suggest an interval tree. Linux > > > > already has an rbtree-based interval tree implementation, which would > > > > likely work, and would probably assuage any performance concerns. > > > > > > > > Something like this would not be worth doing unless most of the shared > > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > > > > 60ish discontiguous shared regions. This is by no means a thorough > > > > search, but it's suggestive. If this is typical, then the bitmap would > > > > be far less efficient than most any interval-based data structure. > > > > > > > > You'd have to allow userspace to upper bound the number of intervals > > > > (similar to the maximum bitmap size), to prevent host OOMs due to > > > > malicious guests. There's something nice about the guest donating > > > > memory for this, since that would eliminate the OOM risk. > > > > > > > > > > > > Tracking the list of ranges may not be bad idea, especially if we use > > > > the some kind of rbtree-based data structure to update the ranges. It > > > > will certainly be better than bitmap which grows based on the guest > > > > memory size and as you guys see in the practice most of the pages will > > > > be guest private. I am not sure if guest donating a memory will cover > > > > all the cases, e.g what if we do a memory hotplug (increase the guest > > > > ram from 2GB to 64GB), will donated memory range will be enough to store > > > > the metadata. > > > > > > > > . > > > > > > > > With reference to internal discussions regarding the above, i am going > > > > to look into specific items as listed below : > > > > > > > > 1). "hypercall" related : > > > > a). Explore the SEV-SNP page change request structure (included in GHCB), > > > > see if there is something common there than can be re-used for SEV/SEV-ES > > > > page encryption status hypercalls. > > > > b). Explore if there is any common hypercall framework i can use in > > > > Linux/KVM. > > > > > > > > 2). related to the "backing" data structure - explore using a range-based > > > > list or something like rbtree-based interval tree data structure > > > > (as mentioned by Steve above) to replace the current bitmap based > > > > implementation. > > > > > > > > > > > > > > > > I do agree that a range-based list or an interval tree data structure is a > > > > really good "logical" fit for the guest page encryption status tracking. > > > > > > > > We can only keep track of the guest unencrypted shared pages in the > > > > range(s) list (which will keep the data structure quite compact) and all > > > > the guest private/encrypted memory does not really need any tracking in > > > > the list, anything not in the list will be encrypted/private. > > > > > > > > Also looking at a more "practical" use case, here is the current log of > > > > page encryption status hypercalls when booting a linux guest : > > > > > > > > ... > > > > > > > > <snip> > > > > > > > > [ 56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 1 > > > > [ 56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > > > [ 56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > > > > [ 56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 0 > > > > .... > > > > > > > > [ 56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 0 > > > > [ 56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 0 > > > > [ 56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 1 > > > > [ 56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 1 > > > > > > > > .... > > > > [ 56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 0 > > > > [ 56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 0 > > > > [ 56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 1 > > > > [ 56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 1 > > > > .... > > > > > > > > [ 56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 0 > > > > [ 56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 0 > > > > [ 56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 1 > > > > [ 56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 1 > > > > .... > > > > > > > > [ 56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 0 > > > > [ 56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 0 > > > > [ 56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 1 > > > > [ 56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 1 > > > > .... > > > > [ 56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > > > [ 56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > > > [ 56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 1 > > > > [ 56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 0 > > > > [ 56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > > > > [ 56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > > > > .... > > > > > > > > As can be observed here, all guest MMIO ranges are initially setup as > > > > shared, and those are all contigious guest page ranges. > > > > > > > > After that the encryption status hypercalls are invoked when DMA gets > > > > triggered during disk i/o while booting the guest ... here again the > > > > guest page ranges are contigious, though mostly single page is touched > > > > and a lot of page re-use is observed. > > > > > > > > So a range-based list/structure will be a "good" fit for such usage > > > > scenarios. > > > > > > > > It seems surprisingly common to flick the same pages back and forth between > > > > encrypted and clear for quite a while; why is this? > > > > > > > > > > > > dma_alloc_coherent()'s will allocate pages and then call > > > > set_decrypted() on them and then at dma_free_coherent(), set_encrypted() > > > > is called on the pages to be freed. So these observations in the logs > > > > where a lot of single 4K pages are seeing C-bit transitions and > > > > corresponding hypercalls are the ones associated with > > > > dma_alloc_coherent(). > > > > > > It makes me wonder if it might be worth teaching it to hold onto those > > > DMA pages somewhere until it needs them for something else and avoid the > > > extra hypercalls; just something to think about. > > > > > > Dave > > > > Following up on this discussion and looking at the hypercall logs and DMA usage scenarios on the SEV, I have the following additional observations and comments : > > > > It is mostly the Guest MMIO regions setup as un-encrypted by uefi/edk2 initially, which will be the "static" nodes in the backing data structure for page encryption status. > > These will be like 15-20 nodes/entries. > > > > Drivers doing DMA allocations using GFP_ATOMIC will be fetching DMA buffers from the pre-allocated unencrypted atomic pool, hence it will be a "static" node added at kernel startup. > > > > As we see with the logs, almost all runtime C-bit transitions and corresponding hypercalls will be from DMA I/O and dma_alloc_coherent/dma_free_coherent calls, these will be > > using 4K/single pages and mostly fragmented ranges, so if we use a "rbtree" based interval tree then there will be a lot of tree insertions and deletions > > (dma_alloc_coherent followed with a dma_free_coherent), so this will lead to a lot of expensive tree rotations and re-balancing, compared to much less complex > > and faster linked list node insertions and deletions (if we use a list based structure to represent these interval ranges). > > > > Also as the static nodes in the structure will be quite limited (all the above DMA I/O added ranges will simply be inserted and removed), so a linked list lookup > > won't be too expensive compared to a tree lookup. In other words, this be a fixed size list. > > > > Looking at the above, I am now more inclined to use a list based structure to represent the page encryption status. > > > > Looking fwd. to any comments/feedback/thoughts on the above. > > > > Thanks, > > Ashish > >
On Thu, Jan 07, 2021, Ashish Kalra wrote: > Hello Steve, > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > > Avoiding an rbtree for such a small (but unstable) list seems correct. > > > > For the unencrypted region list strategy, the only questions that I > > have are fairly secondary. > > - How should the kernel upper bound the size of the list in the face > > of malicious guests, but still support large guests? (Something > > similar to the size provided in the bitmap API would work). > > I am thinking of another scenario, where a malicious guest can make > infinite/repetetive hypercalls and DOS attack the host. > > But probably this is a more generic issue, this can be done by any guest > and under any hypervisor, i don't know what kind of mitigations exist > for such a scenario ? > > Potentially, the guest memory donation model can handle such an attack, > because in this model, the hypervisor will expect only one hypercall, > any repetetive hypercalls can make the hypervisor disable the guest ? KVM doesn't need to explicitly bound its tracking structures, it just needs to use GFP_KERNEL_ACCOUNT when allocating kernel memory for the structures so that the memory will be accounted to the task/process/VM. Shadow MMU pages are the only exception that comes to mind; they're still accounted properly, but KVM also explicitly limits them for a variety of reasons. The size of the list will naturally be bounded by the size of the guest; and assuming KVM proactively merges adjancent regions, that upper bound is probably reasonably low compared to other allocations, e.g. the aforementioned MMU pages. And, using a list means a malicious guest will get automatically throttled as the latency of walking the list (to merge/delete existing entries) will increase with the size of the list.
On Thu, Jan 07, 2021 at 09:26:25AM -0800, Sean Christopherson wrote: > On Thu, Jan 07, 2021, Ashish Kalra wrote: > > Hello Steve, > > > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > > > Avoiding an rbtree for such a small (but unstable) list seems correct. > > > > > > For the unencrypted region list strategy, the only questions that I > > > have are fairly secondary. > > > - How should the kernel upper bound the size of the list in the face > > > of malicious guests, but still support large guests? (Something > > > similar to the size provided in the bitmap API would work). > > > > I am thinking of another scenario, where a malicious guest can make > > infinite/repetetive hypercalls and DOS attack the host. > > > > But probably this is a more generic issue, this can be done by any guest > > and under any hypervisor, i don't know what kind of mitigations exist > > for such a scenario ? > > > > Potentially, the guest memory donation model can handle such an attack, > > because in this model, the hypervisor will expect only one hypercall, > > any repetetive hypercalls can make the hypervisor disable the guest ? > > KVM doesn't need to explicitly bound its tracking structures, it just needs to > use GFP_KERNEL_ACCOUNT when allocating kernel memory for the structures so that > the memory will be accounted to the task/process/VM. Shadow MMU pages are the > only exception that comes to mind; they're still accounted properly, but KVM > also explicitly limits them for a variety of reasons. > > The size of the list will naturally be bounded by the size of the guest; and > assuming KVM proactively merges adjancent regions, that upper bound is probably > reasonably low compared to other allocations, e.g. the aforementioned MMU pages. > > And, using a list means a malicious guest will get automatically throttled as > the latency of walking the list (to merge/delete existing entries) will increase > with the size of the list. Just to add here, potentially there won't be any proactive merging/deletion of existing entries, as the only static entries will be initial guest MMIO regions, which are contigious guest PA ranges but not necessarily adjacent. After that, as discussed before, almost all entries will be due to DMA I/O with respect to dma_alloc_coherent/dma_free_coherent, and all these entries will be temporary as these DMA buffers are marked un-encrypted and immediately marked encrypted as soon as DMA I/O is completed, so it makes no sense to do merging of temporary entries that will be deleted from the list immediately after being added to it. Thanks, Ashish
On Thu, Jan 07, 2021, Ashish Kalra wrote: > On Thu, Jan 07, 2021 at 09:26:25AM -0800, Sean Christopherson wrote: > > On Thu, Jan 07, 2021, Ashish Kalra wrote: > > > Hello Steve, > > > > > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > > > > Avoiding an rbtree for such a small (but unstable) list seems correct. > > > > > > > > For the unencrypted region list strategy, the only questions that I > > > > have are fairly secondary. > > > > - How should the kernel upper bound the size of the list in the face > > > > of malicious guests, but still support large guests? (Something > > > > similar to the size provided in the bitmap API would work). > > > > > > I am thinking of another scenario, where a malicious guest can make > > > infinite/repetetive hypercalls and DOS attack the host. > > > > > > But probably this is a more generic issue, this can be done by any guest > > > and under any hypervisor, i don't know what kind of mitigations exist > > > for such a scenario ? > > > > > > Potentially, the guest memory donation model can handle such an attack, > > > because in this model, the hypervisor will expect only one hypercall, > > > any repetetive hypercalls can make the hypervisor disable the guest ? > > > > KVM doesn't need to explicitly bound its tracking structures, it just needs to > > use GFP_KERNEL_ACCOUNT when allocating kernel memory for the structures so that > > the memory will be accounted to the task/process/VM. Shadow MMU pages are the > > only exception that comes to mind; they're still accounted properly, but KVM > > also explicitly limits them for a variety of reasons. > > > > The size of the list will naturally be bounded by the size of the guest; and > > assuming KVM proactively merges adjancent regions, that upper bound is probably > > reasonably low compared to other allocations, e.g. the aforementioned MMU pages. > > > > And, using a list means a malicious guest will get automatically throttled as > > the latency of walking the list (to merge/delete existing entries) will increase > > with the size of the list. > > Just to add here, potentially there won't be any proactive > merging/deletion of existing entries, as the only static entries will be > initial guest MMIO regions, which are contigious guest PA ranges but not > necessarily adjacent. My point was that, if the guest is malicious, eventually there will be adjacent entries, e.g. the worst case scenario is that the encrypted status changes on every 4k page. Anyways, not really all that important, I mostly thinking out loud :-)
> On Thu, Jan 07, 2021 at 01:34:14AM +0000, Ashish Kalra wrote: > > Hello Steve, > > > > My thoughts here ... > > > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > > > Avoiding an rbtree for such a small (but unstable) list seems correct. > > > > > > > I agree. > > > > > For the unencrypted region list strategy, the only questions that I > > > have are fairly secondary. > > > - How should the kernel upper bound the size of the list in the face > > > of malicious guests, but still support large guests? (Something > > > similar to the size provided in the bitmap API would work). > > > - What serialization format should be used for the ioctl API? > > > (Usermode could send down a pointer to a user region and a size. The > > > kernel could then populate that with an array of structs containing > > > bases and limits for unencrypted regions.) > > > - How will the kernel tag a guest as having exceeded its maximum list > > > size, in order to indicate that the list is now incomplete? (Track a > > > poison bit, and send it up when getting the serialized list of > > > regions). > > > > > > In my view, there are two main competitors to this strategy: > > > - (Existing) Bitmap API > > > - A guest memory donation based model > > > > > > The existing bitmap API avoids any issues with growing too large, > > > since it's size is predictable. > > > > > > To elaborate on the memory donation based model, the guest could put > > > an encryption status data structure into unencrypted guest memory, and > > > then use a hypercall to inform the host where the base of that > > > structure is located. The main advantage of this is that it side steps > > > any issues around malicious guests causing large allocations. > > > > > > The unencrypted region list seems very practical. It's biggest > > > advantage over the bitmap is how cheap it will be to pass the > > > structure up from the kernel. A memory donation based model could > > > achieve similar performance, but with some additional complexity. > > > > > > Does anyone view the memory donation model as worth the complexity? > > > Does anyone think the simplicity of the bitmap is a better tradeoff > > > compared to an unencrypted region list? > > > > One advantage in sticking with the bitmap is that it maps very nicely to > > the dirty bitmap page tracking logic in KVM/Qemu. The way Brijesh > > designed and implemented it is very similar to dirty page bitmap tracking > > and syncing between KVM and Qemu. The same logic is re-used for the page > > encryption bitmap which means quite mininal changes and code resuse in > > Qemu. > > > > Any changes to the backing data structure, will require additional > > mapping logic to be added to Qemu. > > > > This is one advantage in keeping the bitmap logic. > > So if nobody is in favor of keeping the (current) bitmap logic, we will move to the unencrypted region list approach. Thanks, Ashish
Supporting merging of consecutive entries (or not) is less important to get right since it doesn't change any of the APIs. If someone runs into performance issues, they can loop back and fix this then. I'm slightly concerned with the behavior for overlapping regions. I also have slight concerns with how we handle re-encrypting small chunks of larger unencrypted regions. I don't think we've seen these in practice, but nothing precludes them afaik. On Thu, Jan 7, 2021 at 11:23 AM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Jan 07, 2021, Ashish Kalra wrote: > > On Thu, Jan 07, 2021 at 09:26:25AM -0800, Sean Christopherson wrote: > > > On Thu, Jan 07, 2021, Ashish Kalra wrote: > > > > Hello Steve, > > > > > > > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > > > > > Avoiding an rbtree for such a small (but unstable) list seems correct. > > > > > > > > > > For the unencrypted region list strategy, the only questions that I > > > > > have are fairly secondary. > > > > > - How should the kernel upper bound the size of the list in the face > > > > > of malicious guests, but still support large guests? (Something > > > > > similar to the size provided in the bitmap API would work). > > > > > > > > I am thinking of another scenario, where a malicious guest can make > > > > infinite/repetetive hypercalls and DOS attack the host. > > > > > > > > But probably this is a more generic issue, this can be done by any guest > > > > and under any hypervisor, i don't know what kind of mitigations exist > > > > for such a scenario ? > > > > > > > > Potentially, the guest memory donation model can handle such an attack, > > > > because in this model, the hypervisor will expect only one hypercall, > > > > any repetetive hypercalls can make the hypervisor disable the guest ? > > > > > > KVM doesn't need to explicitly bound its tracking structures, it just needs to > > > use GFP_KERNEL_ACCOUNT when allocating kernel memory for the structures so that > > > the memory will be accounted to the task/process/VM. Shadow MMU pages are the > > > only exception that comes to mind; they're still accounted properly, but KVM > > > also explicitly limits them for a variety of reasons. > > > > > > The size of the list will naturally be bounded by the size of the guest; and > > > assuming KVM proactively merges adjancent regions, that upper bound is probably > > > reasonably low compared to other allocations, e.g. the aforementioned MMU pages. > > > > > > And, using a list means a malicious guest will get automatically throttled as > > > the latency of walking the list (to merge/delete existing entries) will increase > > > with the size of the list. > > > > Just to add here, potentially there won't be any proactive > > merging/deletion of existing entries, as the only static entries will be > > initial guest MMIO regions, which are contigious guest PA ranges but not > > necessarily adjacent. > > My point was that, if the guest is malicious, eventually there will be adjacent > entries, e.g. the worst case scenario is that the encrypted status changes on > every 4k page. Anyways, not really all that important, I mostly thinking out > loud :-) Agreed. Tagging this with GFP_KERNEL_ACCOUNT means we don't need to upper bound the number of pages. I now don't think there is any unusual DoS potential here. Perhaps, if the guest tries really hard to make a massive list, they could get a softlockup on the host. Not sure how important that is to fix.
On Thu, Jan 7, 2021 at 4:48 PM Ashish Kalra <ashish.kalra@amd.com> wrote: > > > On Thu, Jan 07, 2021 at 01:34:14AM +0000, Ashish Kalra wrote: > > > Hello Steve, > > > > > > My thoughts here ... > > > > > > On Wed, Jan 06, 2021 at 05:01:33PM -0800, Steve Rutherford wrote: > > > > Avoiding an rbtree for such a small (but unstable) list seems correct. > > > > > > > > > > I agree. > > > > > > > For the unencrypted region list strategy, the only questions that I > > > > have are fairly secondary. > > > > - How should the kernel upper bound the size of the list in the face > > > > of malicious guests, but still support large guests? (Something > > > > similar to the size provided in the bitmap API would work). > > > > - What serialization format should be used for the ioctl API? > > > > (Usermode could send down a pointer to a user region and a size. The > > > > kernel could then populate that with an array of structs containing > > > > bases and limits for unencrypted regions.) > > > > - How will the kernel tag a guest as having exceeded its maximum list > > > > size, in order to indicate that the list is now incomplete? (Track a > > > > poison bit, and send it up when getting the serialized list of > > > > regions). > > > > > > > > In my view, there are two main competitors to this strategy: > > > > - (Existing) Bitmap API > > > > - A guest memory donation based model > > > > > > > > The existing bitmap API avoids any issues with growing too large, > > > > since it's size is predictable. > > > > > > > > To elaborate on the memory donation based model, the guest could put > > > > an encryption status data structure into unencrypted guest memory, and > > > > then use a hypercall to inform the host where the base of that > > > > structure is located. The main advantage of this is that it side steps > > > > any issues around malicious guests causing large allocations. > > > > > > > > The unencrypted region list seems very practical. It's biggest > > > > advantage over the bitmap is how cheap it will be to pass the > > > > structure up from the kernel. A memory donation based model could > > > > achieve similar performance, but with some additional complexity. > > > > > > > > Does anyone view the memory donation model as worth the complexity? > > > > Does anyone think the simplicity of the bitmap is a better tradeoff > > > > compared to an unencrypted region list? > > > > > > One advantage in sticking with the bitmap is that it maps very nicely to > > > the dirty bitmap page tracking logic in KVM/Qemu. The way Brijesh > > > designed and implemented it is very similar to dirty page bitmap tracking > > > and syncing between KVM and Qemu. The same logic is re-used for the page > > > encryption bitmap which means quite mininal changes and code resuse in > > > Qemu. > > > > > > Any changes to the backing data structure, will require additional > > > mapping logic to be added to Qemu. > > > > > > This is one advantage in keeping the bitmap logic. > > > > > So if nobody is in favor of keeping the (current) bitmap logic, we will > move to the unencrypted region list approach. Sounds good to me. > > Thanks, > Ashish
On Thu, Jan 07, 2021, Steve Rutherford wrote: > Supporting merging of consecutive entries (or not) is less important > to get right since it doesn't change any of the APIs. If someone runs > into performance issues, they can loop back and fix this then. I'm > slightly concerned with the behavior for overlapping regions. I'm assuming merging entries will be a near-trivial effort once everything else is implemented, e.g. KVM will need to traverse the list to remove entries when address are converted back to private. Piling on merging functionality at that point should not be all that hard, especially if the list is sorted and entries are merged on insertion.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index 338119852512..bc1b11d057fc 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -85,6 +85,18 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, return ret; } +static inline long kvm_sev_hypercall3(unsigned int nr, unsigned long p1, + unsigned long p2, unsigned long p3) +{ + long ret; + + asm volatile("vmmcall" + : "=a"(ret) + : "a"(nr), "b"(p1), "c"(p2), "d"(p3) + : "memory"); + return ret; +} + #ifdef CONFIG_KVM_GUEST bool kvm_para_available(void); unsigned int kvm_arch_para_features(void);