Message ID | 2f1de1b7b6512280fae4ac05e77ced80a585971b.1712785629.git.isaku.yamahata@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Guest Memory Pre-Population API | expand |
I wouldn't call myself much of an expert on nested, but... On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote: > There are several options to populate L1 GPA irrelevant to vCPU mode. > - Switch vCPU MMU only: This patch. > Pros: Concise implementation. > Cons: Heavily dependent on the KVM MMU implementation. Is switching just the MMU enough here? Won't the MTRRs and other vcpu bits be wrong? > - Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode. > Use __get/set_sregs2() to switch to/from SMM mode. > Pros: straightforward. > Cons: This may cause unintended side effects. Cons make sense. > - Refactor KVM page fault handler not to pass vCPU. Pass around necessary > parameters and struct kvm. > Pros: The end result will have clearly no side effects. > Cons: This will require big refactoring. But doesn't the fault handler need the vCPU state? > - Return error on guest mode or SMM mode: Without this patch. > Pros: No additional patch. > Cons: Difficult to use. Hmm... For the non-TDX use cases this is just an optimization, right? For TDX there shouldn't be an issue. If so, maybe this last one is not so horrible.
On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote: > @@ -5882,18 +5884,40 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu, > if (!tdp_enabled) > return -EOPNOTSUPP; > > + /* Force to use L1 GPA despite of vcpu MMU mode. */ > + is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK); > + if (is_smm || > + vcpu->arch.mmu != &vcpu->arch.root_mmu || > + vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) { > + vcpu->arch.hflags &= ~HF_SMM_MASK; > + mmu = vcpu->arch.mmu; > + walk_mmu = vcpu->arch.walk_mmu; > + vcpu->arch.mmu = &vcpu->arch.root_mmu; > + vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; > + kvm_mmu_reset_context(vcpu); > + } > + > /* reload is optimized for repeated call. */ After the kvm_mmu_reset_context(), what benefit is there to the operation? And it happening for every call of kvm_arch_vcpu_map_memory()? > kvm_mmu_reload(vcpu);
On Mon, Apr 15, 2024, Rick P Edgecombe wrote: > I wouldn't call myself much of an expert on nested, but... > > On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote: > > There are several options to populate L1 GPA irrelevant to vCPU mode. > > - Switch vCPU MMU only: This patch. > > Pros: Concise implementation. > > Cons: Heavily dependent on the KVM MMU implementation. Con: Makes it impossible to support other MMUs/modes without extending the uAPI. > Is switching just the MMU enough here? Won't the MTRRs and other vcpu bits be > wrong? > > > - Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode. > > Use __get/set_sregs2() to switch to/from SMM mode. > > Pros: straightforward. > > Cons: This may cause unintended side effects. > > Cons make sense. > > > - Refactor KVM page fault handler not to pass vCPU. Pass around necessary > > parameters and struct kvm. > > Pros: The end result will have clearly no side effects. > > Cons: This will require big refactoring. > > But doesn't the fault handler need the vCPU state? Ignoring guest MTRRs, which will hopefully soon be a non-issue, no. There are only six possible roots if TDP is enabled: 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode 4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU eliminates the SMM and guest_mode issues. If there is per-vCPU state that makes its way into the TDP page tables, then we have problems, because it means that there is per-vCPU state in per-VM structures that isn't accounted for. There are a few edge cases where KVM treads carefully, e.g. if the fault is to the vCPU's APIC-access page, but KVM manually handles those to avoid consuming per-vCPU state. That said, I think this option is effectively 1b, because dropping the SMM vs. guest_mode state has the same uAPI problems as forcibly swapping the MMU, it's just a different way of doing so. The first question to answer is, do we want to return an error or "silently" install mappings for !SMM, !guest_mode. And so this option becomes relevant only _if_ we want to unconditionally install mappings for the 'base" mode. > > - Return error on guest mode or SMM mode: Without this patch. > > Pros: No additional patch. > > Cons: Difficult to use. > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX > there shouldn't be an issue. If so, maybe this last one is not so horrible. And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode) basically invalidates the argument that returning an error makes the ioctl() hard to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's existing code, but I don't buy that the ioctl() itself is hard to use. Literally the only thing userspace needs to do is set CPUID to implicitly select between 4-level and 5-level paging. If userspace wants to pre-map memory during live migration, or when jump-starting the guest with pre-defined state, simply pre-map memory before stuffing guest state. In and of itself, that doesn't seem difficult, e.g. at a quick glance, QEMU could add a hook somewhere in kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous). I would describe the overall cons for this patch versus returning an error differently. Switching MMU state puts the complexity in the kernel. Returning an error punts any complexity to userspace. Specifically, anything that KVM can do regarding vCPU state to get the right MMU, userspace can do too. Add on that silently doing things that effectively ignore guest state usually ends badly, and I don't see a good argument for this patch (or any variant thereof).
On Mon, 2024-04-15 at 14:17 -0700, Sean Christopherson wrote: > > But doesn't the fault handler need the vCPU state? > > Ignoring guest MTRRs, which will hopefully soon be a non-issue, no. There are > only six possible roots if TDP is enabled: > > 1. 4-level !SMM !guest_mode > 2. 4-level SMM !guest_mode > 3. 5-level !SMM !guest_mode > 4. 5-level SMM !guest_mode > 5. 4-level !SMM guest_mode > 6. 5-level !SMM guest_mode > > 4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU > eliminates > the SMM and guest_mode issues. If there is per-vCPU state that makes its way > into > the TDP page tables, then we have problems, because it means that there is > per-vCPU > state in per-VM structures that isn't accounted for. > > There are a few edge cases where KVM treads carefully, e.g. if the fault is to > the vCPU's APIC-access page, but KVM manually handles those to avoid consuming > per-vCPU state. > > That said, I think this option is effectively 1b, because dropping the SMM vs. > guest_mode state has the same uAPI problems as forcibly swapping the MMU, it's > just a different way of doing so. > > The first question to answer is, do we want to return an error or "silently" > install mappings for !SMM, !guest_mode. And so this option becomes relevant > only > _if_ we want to unconditionally install mappings for the 'base" mode. Ah, I thought there was some logic around CR0.CD. > > > > - Return error on guest mode or SMM mode: Without this patch. > > > Pros: No additional patch. > > > Cons: Difficult to use. > > > > Hmm... For the non-TDX use cases this is just an optimization, right? For > > TDX > > there shouldn't be an issue. If so, maybe this last one is not so horrible. > > And the fact there are so variables to control (MAXPHADDR, SMM, and > guest_mode) > basically invalidates the argument that returning an error makes the ioctl() > hard > to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's > existing code, but I don't buy that the ioctl() itself is hard to use. > > Literally the only thing userspace needs to do is set CPUID to implicitly > select > between 4-level and 5-level paging. If userspace wants to pre-map memory > during > live migration, or when jump-starting the guest with pre-defined state, simply > pre-map memory before stuffing guest state. In and of itself, that doesn't > seem > difficult, e.g. at a quick glance, QEMU could add a hook somewhere in > kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge > disclaimer that I only know enough about how QEMU manages vCPUs to be > dangerous). > > I would describe the overall cons for this patch versus returning an error > differently. Switching MMU state puts the complexity in the kernel. > Returning > an error punts any complexity to userspace. Specifically, anything that KVM > can > do regarding vCPU state to get the right MMU, userspace can do too. > > Add on that silently doing things that effectively ignore guest state usually > ends badly, and I don't see a good argument for this patch (or any variant > thereof). Great.
On Mon, Apr 15, 2024, Rick P Edgecombe wrote: > On Mon, 2024-04-15 at 14:17 -0700, Sean Christopherson wrote: > > > But doesn't the fault handler need the vCPU state? > > > > Ignoring guest MTRRs, which will hopefully soon be a non-issue, no. There are > > only six possible roots if TDP is enabled: > > > > 1. 4-level !SMM !guest_mode > > 2. 4-level SMM !guest_mode > > 3. 5-level !SMM !guest_mode > > 4. 5-level SMM !guest_mode > > 5. 4-level !SMM guest_mode > > 6. 5-level !SMM guest_mode > > > > 4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU > > eliminates the SMM and guest_mode issues. If there is per-vCPU state that > > makes its way into the TDP page tables, then we have problems, because it > > means that there is per-vCPU state in per-VM structures that isn't > > accounted for. > > > > There are a few edge cases where KVM treads carefully, e.g. if the fault is > > to the vCPU's APIC-access page, but KVM manually handles those to avoid > > consuming per-vCPU state. > > > > That said, I think this option is effectively 1b, because dropping the SMM > > vs. guest_mode state has the same uAPI problems as forcibly swapping the > > MMU, it's just a different way of doing so. > > > > The first question to answer is, do we want to return an error or > > "silently" install mappings for !SMM, !guest_mode. And so this option > > becomes relevant only _if_ we want to unconditionally install mappings for > > the 'base" mode. > > Ah, I thought there was some logic around CR0.CD. There is, but it's hopefully going the way of the dodo, along with full MTRR virtualization: https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@google.com
On Mon, Apr 15, 2024 at 02:17:02PM -0700, Sean Christopherson <seanjc@google.com> wrote: > > > - Return error on guest mode or SMM mode: Without this patch. > > > Pros: No additional patch. > > > Cons: Difficult to use. > > > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX > > there shouldn't be an issue. If so, maybe this last one is not so horrible. > > And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode) > basically invalidates the argument that returning an error makes the ioctl() hard > to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's > existing code, but I don't buy that the ioctl() itself is hard to use. > > Literally the only thing userspace needs to do is set CPUID to implicitly select > between 4-level and 5-level paging. If userspace wants to pre-map memory during > live migration, or when jump-starting the guest with pre-defined state, simply > pre-map memory before stuffing guest state. In and of itself, that doesn't seem > difficult, e.g. at a quick glance, QEMU could add a hook somewhere in > kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge > disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous). > > I would describe the overall cons for this patch versus returning an error > differently. Switching MMU state puts the complexity in the kernel. Returning > an error punts any complexity to userspace. Specifically, anything that KVM can > do regarding vCPU state to get the right MMU, userspace can do too. > > Add on that silently doing things that effectively ignore guest state usually > ends badly, and I don't see a good argument for this patch (or any variant > thereof). Ok, here is a experimental patch on top of the 7/10 to return error. Is this a direction? or do we want to invoke KVM page fault handler without any check? I can see the following options. - Error if vCPU is in SMM mode or guest mode: This patch Defer the decision until the use cases come up. We can utilize KVM_CAP_MAP_MEMORY and struct kvm_map_memory.flags for future enhancement. Pro: Keep room for future enhancement for unclear use cases to defer the decision. Con: The use space VMM has to check/switch the vCPU mode. - No check of vCPU mode and go on Pro: It works. Con: Unclear how the uAPI should be without concrete use cases. - Always populate with L1 GPA: This is a bad idea. --- arch/x86/kvm/x86.c | 32 +++++++++----------------------- 1 file changed, 9 insertions(+), 23 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8ba9c1720ac9..2f3ceda5c225 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5871,10 +5871,8 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu, struct kvm_memory_mapping *mapping) { - struct kvm_mmu *mmu = NULL, *walk_mmu = NULL; u64 end, error_code = 0; u8 level = PG_LEVEL_4K; - bool is_smm; int r; /* @@ -5884,25 +5882,21 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu, if (!tdp_enabled) return -EOPNOTSUPP; - /* Force to use L1 GPA despite of vcpu MMU mode. */ - is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK); - if (is_smm || - vcpu->arch.mmu != &vcpu->arch.root_mmu || - vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) { - vcpu->arch.hflags &= ~HF_SMM_MASK; - mmu = vcpu->arch.mmu; - walk_mmu = vcpu->arch.walk_mmu; - vcpu->arch.mmu = &vcpu->arch.root_mmu; - vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; - kvm_mmu_reset_context(vcpu); - } + /* + * SMM mode results in populating SMM memory space with memslots id = 1. + * guest mode results in populating with L2 GPA. + * Don't support those cases for now and punt them for the future + * discussion. + */ + if (is_smm(vcpu) || is_guest_mode(vcpu)) + return -EOPNOTSUPP; /* reload is optimized for repeated call. */ kvm_mmu_reload(vcpu); r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level); if (r) - goto out; + return r; /* mapping->base_address is not necessarily aligned to level-hugepage. */ end = (mapping->base_address & KVM_HPAGE_MASK(level)) + @@ -5910,14 +5904,6 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu, mapping->size -= end - mapping->base_address; mapping->base_address = end; -out: - /* Restore MMU state. */ - if (is_smm || mmu) { - vcpu->arch.hflags |= is_smm ? HF_SMM_MASK : 0; - vcpu->arch.mmu = mmu; - vcpu->arch.walk_mmu = walk_mmu; - kvm_mmu_reset_context(vcpu); - } return r; }
On Mon, Apr 15, 2024, Isaku Yamahata wrote: > On Mon, Apr 15, 2024 at 02:17:02PM -0700, > Sean Christopherson <seanjc@google.com> wrote: > > > > > - Return error on guest mode or SMM mode: Without this patch. > > > > Pros: No additional patch. > > > > Cons: Difficult to use. > > > > > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX > > > there shouldn't be an issue. If so, maybe this last one is not so horrible. > > > > And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode) > > basically invalidates the argument that returning an error makes the ioctl() hard > > to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's > > existing code, but I don't buy that the ioctl() itself is hard to use. > > > > Literally the only thing userspace needs to do is set CPUID to implicitly select > > between 4-level and 5-level paging. If userspace wants to pre-map memory during > > live migration, or when jump-starting the guest with pre-defined state, simply > > pre-map memory before stuffing guest state. In and of itself, that doesn't seem > > difficult, e.g. at a quick glance, QEMU could add a hook somewhere in > > kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge > > disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous). > > > > I would describe the overall cons for this patch versus returning an error > > differently. Switching MMU state puts the complexity in the kernel. Returning > > an error punts any complexity to userspace. Specifically, anything that KVM can > > do regarding vCPU state to get the right MMU, userspace can do too. > > > > Add on that silently doing things that effectively ignore guest state usually > > ends badly, and I don't see a good argument for this patch (or any variant > > thereof). > > Ok, here is a experimental patch on top of the 7/10 to return error. Is this > a direction? or do we want to invoke KVM page fault handler without any check? > > I can see the following options. > > - Error if vCPU is in SMM mode or guest mode: This patch > Defer the decision until the use cases come up. We can utilize > KVM_CAP_MAP_MEMORY and struct kvm_map_memory.flags for future > enhancement. > Pro: Keep room for future enhancement for unclear use cases to defer > the decision. > Con: The use space VMM has to check/switch the vCPU mode. > > - No check of vCPU mode and go on > Pro: It works. > Con: Unclear how the uAPI should be without concrete use cases. > > - Always populate with L1 GPA: > This is a bad idea. Not always. If L1 is using shadow paging, then L1 and L2 GPAs are in the same domain from KVM's perspective. As I said in v1 (I think it was v1), whether or not mapping an L1 GPA is supported should be a property of the mmu, not an explicit check. As far as the TDP MMU is concerend, there's nothing special about SMM nor is there anything special about guest_mode, except for the fact that they use different roots than "normal" mode. But that part Just Works. And if L1 is using TDP, i.e. KVM is shadowing L1's TDP page tables, and L2 is active, then vcpu->arch.mmu will point at a variation of the shadow MMU, e.g. the PTTYPE_EPT MMU on Intel CPUs. The shadow MMU won't support pre-mapping GPAs because it's non-sensical (legacy shadow paging needs a GVA, nested TDP needs an L2 GPA), and so the ioctl() fails because mmu->map_gpa or whatever is NULL. In other words, unless I'm forgetting something, explicit checks for "unsupported" modes shoud be unnecessary, because
On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote: > + /* Force to use L1 GPA despite of vcpu MMU mode. */ > + is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK); > + if (is_smm || > + vcpu->arch.mmu != &vcpu->arch.root_mmu || > + vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) { > + vcpu->arch.hflags &= ~HF_SMM_MASK; 0-day informs me that the definition for HF_SMM_MASK depends on CONFIG_KVM_SMM. > + mmu = vcpu->arch.mmu; > + walk_mmu = vcpu->arch.walk_mmu; > + vcpu->arch.mmu = &vcpu->arch.root_mmu; > + vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; > + kvm_mmu_reset_context(vcpu); > + } > +
On Mon, Apr 15, 2024 at 11:17 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Apr 15, 2024, Rick P Edgecombe wrote: > > I wouldn't call myself much of an expert on nested, but... > > > > On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote: > > > There are several options to populate L1 GPA irrelevant to vCPU mode. > > > - Switch vCPU MMU only: This patch. > > > Pros: Concise implementation. > > > Cons: Heavily dependent on the KVM MMU implementation. > > Con: Makes it impossible to support other MMUs/modes without extending the uAPI. +1. > The first question to answer is, do we want to return an error or "silently" > install mappings for !SMM, !guest_mode. And so this option becomes relevant only > _if_ we want to unconditionally install mappings for the 'base" mode. > > > > - Return error on guest mode or SMM mode: Without this patch. > > > Pros: No additional patch. > > > Cons: Difficult to use. > > > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX > > there shouldn't be an issue. If so, maybe this last one is not so horrible. It doesn't even have to be ABI that it gives an error. As you say, this ioctl can just be advisory only for !confidential machines. Even if it were implemented, the shadow MMU can drop roots at any moment and/or kill the mapping via the shrinker. That said, I can't fully shake the feeling that this ioctl should be an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The implementation was ugly but the API was fine. Sorry about this; patches 3-5 can still be included in kvm-coco-queue sooner rather than later. > And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode) > basically invalidates the argument that returning an error makes the ioctl() hard > to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's > existing code, but I don't buy that the ioctl() itself is hard to use. Nah, I don't think so. With TDX it's just MAXPHYADDR; just invoke it after KVM_SET_CPUID2 or TDX_INIT_VCPU which is very early. > Literally the only thing userspace needs to do is set CPUID to implicitly select > between 4-level and 5-level paging. If userspace wants to pre-map memory during > live migration, or when jump-starting the guest with pre-defined state, simply > pre-map memory before stuffing guest state. In and of itself, that doesn't seem > difficult, e.g. at a quick glance, QEMU could add a hook somewhere in > kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge > disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous). Hehe :) the machine_done_notifier is probably a better place. /me checks... yes it's exactly where Xiaoyao did it (tdx_finalize_vm is the notifier, it calls KVM_TDX_INIT_VCPU, from tdx_post_init_vcpus and then KVM_MEMORY_MAPPING). > I would describe the overall cons for this patch versus returning an error > differently. Switching MMU state puts the complexity in the kernel. Returning > an error punts any complexity to userspace. Specifically, anything that KVM can > do regarding vCPU state to get the right MMU, userspace can do too. > > Add on that silently doing things that effectively ignore guest state usually > ends badly, and I don't see a good argument for this patch (or any variant > thereof). Agreed. Paolo
On Tue, Apr 16, 2024, Paolo Bonzini wrote: > On Mon, Apr 15, 2024 at 11:17 PM Sean Christopherson <seanjc@google.com> wrote: > > The first question to answer is, do we want to return an error or "silently" > > install mappings for !SMM, !guest_mode. And so this option becomes relevant only > > _if_ we want to unconditionally install mappings for the 'base" mode. > > > > > > - Return error on guest mode or SMM mode: Without this patch. > > > > Pros: No additional patch. > > > > Cons: Difficult to use. > > > > > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX > > > there shouldn't be an issue. If so, maybe this last one is not so horrible. > > It doesn't even have to be ABI that it gives an error. As you say, > this ioctl can just be advisory only for !confidential machines. Even > if it were implemented, the shadow MMU can drop roots at any moment Sure, but there's a difference between KVM _potentially_ dropping roots and guaranteed failure because userspace is trying to do something that's unsupported. But I think this is a non-issue, because it should really just be as simple as: if (!mmu->pre_map_memory) return -EOPNOTSUPP; Hmm, or probably this to avoid adding an MMU hook for a single MMU flavor: if (!tdp_mmu_enabled || !mmu->root_role.direct) return -EOPNOTSUPP; > and/or kill the mapping via the shrinker. Ugh, we really need to kill that code. > That said, I can't fully shake the feeling that this ioctl should be > an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The > implementation was ugly but the API was fine. Hmm, but IMO the implementation was ugly in no small part because of the contraints put on KVM by the API. Mapping S-EPT *and* doing TDH.MEM.PAGE.ADD in the same ioctl() forced KVM to operate on vcpu0, and necessitated shoving temporary data into a per-VM structure in order to get the source contents into TDH.MEM.PAGE.ADD. We could eliminate the vcpu0 grossness, but it would require a massive refactor, which is also not a problem per se, but it's obviously not free. Eliminating kvm_tdx.source_page is also doable, but it's not clear to me that end result would be a net positive. If userspace pre-maps the S-EPT entries ahead of time, then KVM should have a straight shot to PAGE.ADD, i.e. doesn't need to "pass" the source page via a scratch field in kvm_tdx, and I think/hope would avoid the need to grab vcpu0 in order to get at an MMU to build the S-EPT. And stating the obvious, TDX_INIT_MEM_REGION also doesn't allow pre-mapping memory, which is generally useful, and can be especially beneficial for confidential VMs (and TDX in particular) due to the added cost of a page fault VM-Exit. I'm not dead set on this generic ioctl(), but unless it ends up being a train wreck for userspace, I think it will allow for cleaner and more reusable code in KVM.
On Wed, Apr 17, 2024 at 1:00 AM Sean Christopherson <seanjc@google.com> wrote: > > > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX > > > > there shouldn't be an issue. If so, maybe this last one is not so horrible. > > > > It doesn't even have to be ABI that it gives an error. As you say, > > this ioctl can just be advisory only for !confidential machines. Even > > if it were implemented, the shadow MMU can drop roots at any moment > > Sure, but there's a difference between KVM _potentially_ dropping roots and > guaranteed failure because userspace is trying to do something that's unsupported. > But I think this is a non-issue, because it should really just be as simple as: > > if (!mmu->pre_map_memory) > return -EOPNOTSUPP; > > Hmm, or probably this to avoid adding an MMU hook for a single MMU flavor: > > if (!tdp_mmu_enabled || !mmu->root_role.direct) > return -EOPNOTSUPP; > > > and/or kill the mapping via the shrinker. > > Ugh, we really need to kill that code. Ok, so let's add a KVM_CHECK_EXTENSION so that people can check if it's supported. > > That said, I can't fully shake the feeling that this ioctl should be > > an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The > > implementation was ugly but the API was fine. > > Hmm, but IMO the implementation was ugly in no small part because of the contraints > put on KVM by the API. Mapping S-EPT *and* doing TDH.MEM.PAGE.ADD in the same > ioctl() forced KVM to operate on vcpu0, and necessitated shoving temporary data > into a per-VM structure in order to get the source contents into TDH.MEM.PAGE.ADD. That's because it was trying to do two things with a single loop. It's not needed - and in fact KVM_CAP_MEMORY_MAPPING forces userspace to do it in two passes. > And stating the obvious, TDX_INIT_MEM_REGION also doesn't allow pre-mapping memory, > which is generally useful, and can be especially beneficial for confidential VMs > (and TDX in particular) due to the added cost of a page fault VM-Exit. > > I'm not dead set on this generic ioctl(), but unless it ends up being a train wreck > for userspace, I think it will allow for cleaner and more reusable code in KVM. Yes, this ioctl() can stay. Forcing it before adding memory to TDX is ugly, but it's not a blocker. I'll look at it closely and see how far it is from being committable to kvm-coco-queue. Paolo
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2c765de3531e..8ba9c1720ac9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5871,8 +5871,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu, struct kvm_memory_mapping *mapping) { + struct kvm_mmu *mmu = NULL, *walk_mmu = NULL; u64 end, error_code = 0; u8 level = PG_LEVEL_4K; + bool is_smm; int r; /* @@ -5882,18 +5884,40 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu, if (!tdp_enabled) return -EOPNOTSUPP; + /* Force to use L1 GPA despite of vcpu MMU mode. */ + is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK); + if (is_smm || + vcpu->arch.mmu != &vcpu->arch.root_mmu || + vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) { + vcpu->arch.hflags &= ~HF_SMM_MASK; + mmu = vcpu->arch.mmu; + walk_mmu = vcpu->arch.walk_mmu; + vcpu->arch.mmu = &vcpu->arch.root_mmu; + vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; + kvm_mmu_reset_context(vcpu); + } + /* reload is optimized for repeated call. */ kvm_mmu_reload(vcpu); r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level); if (r) - return r; + goto out; /* mapping->base_address is not necessarily aligned to level-hugepage. */ end = (mapping->base_address & KVM_HPAGE_MASK(level)) + KVM_HPAGE_SIZE(level); mapping->size -= end - mapping->base_address; mapping->base_address = end; + +out: + /* Restore MMU state. */ + if (is_smm || mmu) { + vcpu->arch.hflags |= is_smm ? HF_SMM_MASK : 0; + vcpu->arch.mmu = mmu; + vcpu->arch.walk_mmu = walk_mmu; + kvm_mmu_reset_context(vcpu); + } return r; }