Message ID | 20240904030751.117579-10-rick.p.edgecombe@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | TDX MMU Part 2 | expand |
On 4/09/2024 3:07 pm, Rick Edgecombe wrote: > From: Yuan Yao <yuan.yao@intel.com> > > TDX module internally uses locks to protect internal resources. It tries > to acquire the locks. If it fails to obtain the lock, it returns > TDX_OPERAND_BUSY error without spin because its execution time limitation. > > TDX SEAMCALL API reference describes what resources are used. It's known > which TDX SEAMCALL can cause contention with which resources. VMM can > avoid contention inside the TDX module by avoiding contentious TDX SEAMCALL > with, for example, spinlock. Because OS knows better its process > scheduling and its scalability, a lock at OS/VMM layer would work better > than simply retrying TDX SEAMCALLs. > > TDH.MEM.* API except for TDH.MEM.TRACK operates on a secure EPT tree and > the TDX module internally tries to acquire the lock of the secure EPT tree. > They return TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT in case of failure to > get the lock. TDX KVM allows sept callbacks to return error so that TDP > MMU layer can retry. > > Retry TDX TDH.MEM.* API on the error because the error is a rare event > caused by zero-step attack mitigation. The last paragraph seems can be improved: It seems to say the "TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT" can only be cauesd by zero-step attack detection/mitigation, which isn't true from the previous paragraph. In fact, I think this patch can be dropped: 1) The TDH_MEM_xx()s can return BUSY due to nature of TDP MMU, but all the callers of TDH_MEM_xx()s are already explicitly retrying by looking at the patch "KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table" -- they either return PF_RETRY to let the fault to happen again or explicitly loop until no BUSY is returned. So I am not sure why we need to "loo SEAMCALL_RETRY_MAX (16) times" in the common code. 2) TDH_VP_ENTER explicitly retries immediately for such case: /* See the comment of tdx_seamcall_sept(). */ if (unlikely(vp_enter_ret == TDX_ERROR_SEPT_BUSY)) return EXIT_FASTPATH_REENTER_GUEST; 3) That means the _ONLY_ reason to retry in the common code for TDH_MEM_xx()s is to mitigate zero-step attack by reducing the times of letting guest to fault on the same instruction. I don't think we need to handle zero-step attack mitigation in the first TDX support submission. So I think we can just remove this patch. > > Signed-off-by: Yuan Yao <yuan.yao@intel.com> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> > --- > TDX MMU part 2 v1: > - Updates from seamcall overhaul (Kai) > > v19: > - fix typo TDG.VP.ENTER => TDH.VP.ENTER, > TDX_OPRRAN_BUSY => TDX_OPERAND_BUSY > - drop the description on TDH.VP.ENTER as this patch doesn't touch > TDH.VP.ENTER > --- > arch/x86/kvm/vmx/tdx_ops.h | 48 ++++++++++++++++++++++++++++++++------ > 1 file changed, 41 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h > index 0363d8544f42..8ca3e252a6ed 100644 > --- a/arch/x86/kvm/vmx/tdx_ops.h > +++ b/arch/x86/kvm/vmx/tdx_ops.h > @@ -31,6 +31,40 @@ > #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8) \ > pr_tdx_error_N(__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8) > > +/* > + * TDX module acquires its internal lock for resources. It doesn't spin to get > + * locks because of its restrictions of allowed execution time. Instead, it > + * returns TDX_OPERAND_BUSY with an operand id. > + * > + * Multiple VCPUs can operate on SEPT. Also with zero-step attack mitigation, > + * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step > + * attack is suspected. It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT > + * with TDH.MEM.* operation. Note: TDH.MEM.TRACK is an exception. > + * > + * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL > + * spoils TDP MMU effort. Retry several times with the assumption that SEPT > + * lock contention is rare. But don't loop forever to avoid lockup. Let TDP > + * MMU retry. > + */ > +#define TDX_ERROR_SEPT_BUSY (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT) > + > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in) > +{ > +#define SEAMCALL_RETRY_MAX 16 > + struct tdx_module_args args_in; > + int retry = SEAMCALL_RETRY_MAX; > + u64 ret; > + > + do { > + args_in = *in; > + ret = seamcall_ret(op, in); > + } while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0); > + > + *in = args_in; > + > + return ret; > +} > + > static inline u64 tdh_mng_addcx(struct kvm_tdx *kvm_tdx, hpa_t addr) > { > struct tdx_module_args in = { > @@ -55,7 +89,7 @@ static inline u64 tdh_mem_page_add(struct kvm_tdx *kvm_tdx, gpa_t gpa, > u64 ret; > > clflush_cache_range(__va(hpa), PAGE_SIZE); > - ret = seamcall_ret(TDH_MEM_PAGE_ADD, &in); > + ret = tdx_seamcall_sept(TDH_MEM_PAGE_ADD, &in); > > *rcx = in.rcx; > *rdx = in.rdx; > @@ -76,7 +110,7 @@ static inline u64 tdh_mem_sept_add(struct kvm_tdx *kvm_tdx, gpa_t gpa, > > clflush_cache_range(__va(page), PAGE_SIZE); > > - ret = seamcall_ret(TDH_MEM_SEPT_ADD, &in); > + ret = tdx_seamcall_sept(TDH_MEM_SEPT_ADD, &in); > > *rcx = in.rcx; > *rdx = in.rdx; > @@ -93,7 +127,7 @@ static inline u64 tdh_mem_sept_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa, > }; > u64 ret; > > - ret = seamcall_ret(TDH_MEM_SEPT_REMOVE, &in); > + ret = tdx_seamcall_sept(TDH_MEM_SEPT_REMOVE, &in); > > *rcx = in.rcx; > *rdx = in.rdx; > @@ -123,7 +157,7 @@ static inline u64 tdh_mem_page_aug(struct kvm_tdx *kvm_tdx, gpa_t gpa, hpa_t hpa > u64 ret; > > clflush_cache_range(__va(hpa), PAGE_SIZE); > - ret = seamcall_ret(TDH_MEM_PAGE_AUG, &in); > + ret = tdx_seamcall_sept(TDH_MEM_PAGE_AUG, &in); > > *rcx = in.rcx; > *rdx = in.rdx; > @@ -140,7 +174,7 @@ static inline u64 tdh_mem_range_block(struct kvm_tdx *kvm_tdx, gpa_t gpa, > }; > u64 ret; > > - ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &in); > + ret = tdx_seamcall_sept(TDH_MEM_RANGE_BLOCK, &in); > > *rcx = in.rcx; > *rdx = in.rdx; > @@ -335,7 +369,7 @@ static inline u64 tdh_mem_page_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa, > }; > u64 ret; > > - ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &in); > + ret = tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in); > > *rcx = in.rcx; > *rdx = in.rdx; > @@ -361,7 +395,7 @@ static inline u64 tdh_mem_range_unblock(struct kvm_tdx *kvm_tdx, gpa_t gpa, > }; > u64 ret; > > - ret = seamcall_ret(TDH_MEM_RANGE_UNBLOCK, &in); > + ret = tdx_seamcall_sept(TDH_MEM_RANGE_UNBLOCK, &in); > > *rcx = in.rcx; > *rdx = in.rdx;
On 9/4/24 05:07, Rick Edgecombe wrote: > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in) > +{ > +#define SEAMCALL_RETRY_MAX 16 How is the 16 determined? Also, is the lock per-VM or global? Thanks, Paolo > + struct tdx_module_args args_in; > + int retry = SEAMCALL_RETRY_MAX; > + u64 ret; > + > + do { > + args_in = *in; > + ret = seamcall_ret(op, in); > + } while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0); > + > + *in = args_in; > + > + return ret; > +} > +
On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote: > On 9/4/24 05:07, Rick Edgecombe wrote: > > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in) > > +{ > > +#define SEAMCALL_RETRY_MAX 16 > > How is the 16 determined? Also, is the lock per-VM or global? The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can share the history. In any case, there seems to be some problems with this patch or justification. Regarding the zero-step mitigation, the TDX Module has a mitigation for an attack where a malicious VMM causes repeated private EPT violations for the same GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of zero-step detection, these SEPT related SEAMCALLs will exit with the checked error code if they contend the mentioned lock. If there was some other (non- zero-step related) contention for this lock and KVM tries to re-enter the TD too many times without resolving an EPT violation, it might inadvertently trigger the zero-step mitigation. I *think* this patch is trying to say not to worry about this case, and do a simple retry loop instead to handle the contention. But why 16 retries would be sufficient, I can't find a reason for. Getting this required retry logic right is important because some failures (TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s. Per the docs, in general the VMM is supposed to retry SEAMCALLs that return TDX_OPERAND_BUSY. I think we need to revisit the general question of which SEAMCALLs we should be retrying and how many times/how long. The other consideration is that KVM already has per-VM locking, that would prevent contention for some of the locks. So depending on internal details KVM may not need to do any retries in some cases.
On Fri, 2024-09-06 at 13:41 +1200, Huang, Kai wrote: > 3) That means the _ONLY_ reason to retry in the common code for > TDH_MEM_xx()s is to mitigate zero-step attack by reducing the times of > letting guest to fault on the same instruction. My read of the zero-step mitigation is that it is implemented in the TDX module. (which makes sense since it is defending against VMMs). There is some optional ability for the guest to request notification, but the host defense is always in place. Is that your understanding? > > I don't think we need to handle zero-step attack mitigation in the first > TDX support submission. So I think we can just remove this patch. Thanks for highlighting the weirdness here. I think it needs more investigation.
On Mon, Sep 09, 2024, Rick P Edgecombe wrote: > On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote: > > On 9/4/24 05:07, Rick Edgecombe wrote: > > > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in) > > > +{ > > > +#define SEAMCALL_RETRY_MAX 16 > > > > How is the 16 determined? Also, is the lock per-VM or global? > > The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be > for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can > share the history. In any case, there seems to be some problems with this patch > or justification. > > Regarding the zero-step mitigation, the TDX Module has a mitigation for an > attack where a malicious VMM causes repeated private EPT violations for the same > GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of > zero-step detection, these SEPT related SEAMCALLs will exit with the checked > error code if they contend the mentioned lock. If there was some other (non- > zero-step related) contention for this lock and KVM tries to re-enter the TD too > many times without resolving an EPT violation, it might inadvertently trigger > the zero-step mitigation. I *think* this patch is trying to say not to worry > about this case, and do a simple retry loop instead to handle the contention. > > But why 16 retries would be sufficient, I can't find a reason for. Getting this > required retry logic right is important because some failures > (TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s. I (somewhat indirectly) raised this as an issue in v11, and at a (very quick) glance, nothing has changed to alleviate my concerns. In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever. For its operations, I'm pretty sure the only sane approach is for KVM to ensure there will be no contention. And if the TDX module's single-step protection spuriously kicks in, KVM exits to userspace. If the TDX module can't/doesn't/won't communicate that it's mitigating single-step, e.g. so that KVM can forward the information to userspace, then that's a TDX module problem to solve. > Per the docs, in general the VMM is supposed to retry SEAMCALLs that return > TDX_OPERAND_BUSY. IMO, that's terrible advice. SGX has similar behavior, where the xucode "module" signals #GP if there's a conflict. #GP is obviously far, far worse as it lacks the precision that would help software understand exactly what went wrong, but I think one of the better decisions we made with the SGX driver was to have a "zero tolerance" policy where the driver would _never_ retry due to a potential resource conflict, i.e. that any conflict in the module would be treated as a kernel bug. > I think we need to revisit the general question of which > SEAMCALLs we should be retrying and how many times/how long. The other > consideration is that KVM already has per-VM locking, that would prevent > contention for some of the locks. So depending on internal details KVM may not > need to do any retries in some cases. Yes, and if KVM can't avoid conflict/retry, then before we go any further, I want to know exactly why that is the case.
On Mon, Sep 09, 2024, Sean Christopherson wrote: > On Mon, Sep 09, 2024, Rick P Edgecombe wrote: > > On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote: > > > On 9/4/24 05:07, Rick Edgecombe wrote: > > > > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in) > > > > +{ > > > > +#define SEAMCALL_RETRY_MAX 16 > > > > > > How is the 16 determined? Also, is the lock per-VM or global? > > > > The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be > > for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can > > share the history. In any case, there seems to be some problems with this patch > > or justification. > > > > Regarding the zero-step mitigation, the TDX Module has a mitigation for an > > attack where a malicious VMM causes repeated private EPT violations for the same > > GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of > > zero-step detection, these SEPT related SEAMCALLs will exit with the checked > > error code if they contend the mentioned lock. If there was some other (non- > > zero-step related) contention for this lock and KVM tries to re-enter the TD too > > many times without resolving an EPT violation, it might inadvertently trigger > > the zero-step mitigation. I *think* this patch is trying to say not to worry > > about this case, and do a simple retry loop instead to handle the contention. > > > > But why 16 retries would be sufficient, I can't find a reason for. Getting this > > required retry logic right is important because some failures > > (TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s. > > I (somewhat indirectly) raised this as an issue in v11, and at a (very quick) > glance, nothing has changed to alleviate my concerns. Gah, went out of my way to find the thread and then forgot to post the link: https://lore.kernel.org/all/Y8m34OEVBfL7Q4Ns@google.com > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever. For > its operations, I'm pretty sure the only sane approach is for KVM to ensure there > will be no contention. And if the TDX module's single-step protection spuriously > kicks in, KVM exits to userspace. If the TDX module can't/doesn't/won't communicate > that it's mitigating single-step, e.g. so that KVM can forward the information > to userspace, then that's a TDX module problem to solve. > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that return > > TDX_OPERAND_BUSY. > > IMO, that's terrible advice. SGX has similar behavior, where the xucode "module" > signals #GP if there's a conflict. #GP is obviously far, far worse as it lacks > the precision that would help software understand exactly what went wrong, but I > think one of the better decisions we made with the SGX driver was to have a > "zero tolerance" policy where the driver would _never_ retry due to a potential > resource conflict, i.e. that any conflict in the module would be treated as a > kernel bug. > > > I think we need to revisit the general question of which > > SEAMCALLs we should be retrying and how many times/how long. The other > > consideration is that KVM already has per-VM locking, that would prevent > > contention for some of the locks. So depending on internal details KVM may not > > need to do any retries in some cases. > > Yes, and if KVM can't avoid conflict/retry, then before we go any further, I want > to know exactly why that is the case.
On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote: > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever. > > For > > its operations, I'm pretty sure the only sane approach is for KVM to ensure > > there > > will be no contention. And if the TDX module's single-step protection > > spuriously > > kicks in, KVM exits to userspace. If the TDX module can't/doesn't/won't > > communicate > > that it's mitigating single-step, e.g. so that KVM can forward the > > information > > to userspace, then that's a TDX module problem to solve. > > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that > > > return > > > TDX_OPERAND_BUSY. > > > > IMO, that's terrible advice. SGX has similar behavior, where the xucode > > "module" > > signals #GP if there's a conflict. #GP is obviously far, far worse as it > > lacks > > the precision that would help software understand exactly what went wrong, > > but I > > think one of the better decisions we made with the SGX driver was to have a > > "zero tolerance" policy where the driver would _never_ retry due to a > > potential > > resource conflict, i.e. that any conflict in the module would be treated as > > a > > kernel bug. Thanks for the analysis. The direction seems reasonable to me for this lock in particular. We need to do some analysis on how much the existing mmu_lock can protects us. Maybe sprinkle some asserts for documentation purposes. For the general case of TDX_OPERAND_BUSY, there might be one wrinkle. The guest side operations can take the locks too. From "Base Architecture Specification": " Host-Side (SEAMCALL) Operation ------------------------------ The host VMM is expected to retry host-side operations that fail with a TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at most after a limited time (the longest guest-side TDX module flow) there will be no contention with a guest TD attempting to acquire access to the same resource. Lock operations process the HOST_PRIORITY bit as follows: - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is the host VMM’s responsibility to re-attempt the SEAMCALL function until is succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD from acquiring the lock. - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the lock’s HOST_PRIORITY bit. Guest-Side (TDCALL) Operation ----------------------------- A TDCALL (guest-side) function that attempt to acquire a lock fails if HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest. The guest is expected to retry the operation. Guest-side TDCALL flows that acquire a host priority lock have an upper bound on the host-side latency for that lock; once a lock is acquired, the flow either releases within a fixed upper time bound, or periodically monitor the HOST_PRIORITY flag to see if the host is attempting to acquire the lock. " So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is involved in sorting out contention between the guest as well. We need to double check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the functionality we need to exercise for base support. The thing that makes me nervous about retry based solution is the potential for some kind deadlock like pattern. Just to gather your opinion, if there was some SEAMCALL contention that couldn't be locked around from KVM, but came with some strong well described guarantees, would a retry loop be hard NAK still?
On Mon, Sep 09, 2024, Rick P Edgecombe wrote: > On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote: > > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, > > > ever. For its operations, I'm pretty sure the only sane approach is for > > > KVM to ensure there will be no contention. And if the TDX module's > > > single-step protection spuriously kicks in, KVM exits to userspace. If > > > the TDX module can't/doesn't/won't communicate that it's mitigating > > > single-step, e.g. so that KVM can forward the information to userspace, > > > then that's a TDX module problem to solve. > > > > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that > > > > return TDX_OPERAND_BUSY. > > > > > > IMO, that's terrible advice. SGX has similar behavior, where the xucode > > > "module" signals #GP if there's a conflict. #GP is obviously far, far > > > worse as it lacks the precision that would help software understand > > > exactly what went wrong, but I think one of the better decisions we made > > > with the SGX driver was to have a "zero tolerance" policy where the > > > driver would _never_ retry due to a potential resource conflict, i.e. > > > that any conflict in the module would be treated as a kernel bug. > > Thanks for the analysis. The direction seems reasonable to me for this lock in > particular. We need to do some analysis on how much the existing mmu_lock can > protects us. I would operate under the assumption that it provides SEPT no meaningful protection. I think I would even go so far as to say that it is a _requirement_ that mmu_lock does NOT provide the ordering required by SEPT, because I do not want to take on any risk (due to SEPT constraints) that would limit KVM's ability to do things while holding mmu_lock for read. > Maybe sprinkle some asserts for documentation purposes. Not sure I understand, assert on what? > For the general case of TDX_OPERAND_BUSY, there might be one wrinkle. The guest > side operations can take the locks too. From "Base Architecture Specification": > " > Host-Side (SEAMCALL) Operation > ------------------------------ > The host VMM is expected to retry host-side operations that fail with a > TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at > most after a limited time (the longest guest-side TDX module flow) there will be > no contention with a guest TD attempting to acquire access to the same resource. > > Lock operations process the HOST_PRIORITY bit as follows: > - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s > HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is > the host VMM’s responsibility to re-attempt the SEAMCALL function until is > succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD > from acquiring the lock. > - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the > lock’s HOST_PRIORITY bit. *sigh* > Guest-Side (TDCALL) Operation > ----------------------------- > A TDCALL (guest-side) function that attempt to acquire a lock fails if > HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest. > The guest is expected to retry the operation. > > Guest-side TDCALL flows that acquire a host priority lock have an upper bound on > the host-side latency for that lock; once a lock is acquired, the flow either > releases within a fixed upper time bound, or periodically monitor the > HOST_PRIORITY flag to see if the host is attempting to acquire the lock. > " > > So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is > involved in sorting out contention between the guest as well. We need to double > check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the > functionality we need to exercise for base support. > > The thing that makes me nervous about retry based solution is the potential for > some kind deadlock like pattern. Just to gather your opinion, if there was some > SEAMCALL contention that couldn't be locked around from KVM, but came with some > strong well described guarantees, would a retry loop be hard NAK still? I don't know. It would depend on what operations can hit BUSY, and what the alternatives are. E.g. if we can narrow down the retry paths to a few select cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of deadlock, then maybe that's the least awful option. What I don't think KVM should do is blindly retry N number of times, because then there are effectively no rules whatsoever. E.g. if KVM is tearing down a VM then KVM should assert on immediate success. And if KVM is handling a fault on behalf of a vCPU, then KVM can and should resume the guest and let it retry. Ugh, but that would likely trigger the annoying "zero-step mitigation" crap. What does this actually mean in practice? What's the threshold, is the VM-Enter error uniquely identifiable, and can KVM rely on HOST_PRIORITY to be set if KVM runs afoul of the zero-step mitigation? After a pre-determined number of such EPT violations occur on the same instruction, the TDX module starts tracking the GPAs that caused Secure EPT faults and fails further host VMM attempts to enter the TD VCPU unless previously faulting private GPAs are properly mapped in the Secure EPT. If HOST_PRIORITY is set, then one idea would be to resume the guest if there's SEPT contention on a fault, and then _if_ the zero-step mitigation is triggered, kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked and can't be re-locked by the guest. That would allow KVM to guarantee forward progress without an arbitrary retry loop in the TDP MMU. Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure the one and only retry is guaranteed to succeed.
On Mon, 2024-09-09 at 16:58 -0700, Sean Christopherson wrote: > On Mon, Sep 09, 2024, Rick P Edgecombe wrote: > > On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote: > > > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, > > > > ever. For its operations, I'm pretty sure the only sane approach is for > > > > KVM to ensure there will be no contention. And if the TDX module's > > > > single-step protection spuriously kicks in, KVM exits to userspace. If > > > > the TDX module can't/doesn't/won't communicate that it's mitigating > > > > single-step, e.g. so that KVM can forward the information to userspace, > > > > then that's a TDX module problem to solve. > > > > > > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that > > > > > return TDX_OPERAND_BUSY. > > > > > > > > IMO, that's terrible advice. SGX has similar behavior, where the xucode > > > > "module" signals #GP if there's a conflict. #GP is obviously far, far > > > > worse as it lacks the precision that would help software understand > > > > exactly what went wrong, but I think one of the better decisions we made > > > > with the SGX driver was to have a "zero tolerance" policy where the > > > > driver would _never_ retry due to a potential resource conflict, i.e. > > > > that any conflict in the module would be treated as a kernel bug. > > > > Thanks for the analysis. The direction seems reasonable to me for this lock > > in > > particular. We need to do some analysis on how much the existing mmu_lock > > can > > protects us. > > I would operate under the assumption that it provides SEPT no meaningful > protection. > I think I would even go so far as to say that it is a _requirement_ that > mmu_lock > does NOT provide the ordering required by SEPT, because I do not want to take > on > any risk (due to SEPT constraints) that would limit KVM's ability to do things > while holding mmu_lock for read. Ok. Not sure, but I think you are saying not to add any extra acquisitions of mmu_lock. > > > > Maybe sprinkle some asserts for documentation purposes. > > Not sure I understand, assert on what? Please ignore. For the asserts, I was imagining mmu_lock acquisitions in core MMU code might already protect the non-zero-step TDX_OPERAND_BUSY cases, and we could somehow explain this in code. But it seems less likely. [snip] > > I don't know. It would depend on what operations can hit BUSY, and what the > alternatives are. E.g. if we can narrow down the retry paths to a few select > cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of > deadlock, then maybe that's the least awful option. > > What I don't think KVM should do is blindly retry N number of times, because > then there are effectively no rules whatsoever. Complete agreement. > E.g. if KVM is tearing down a > VM then KVM should assert on immediate success. And if KVM is acquit ions a > fault > on behalf of a vCPU, then KVM can and should resume the guest and let it > retry. > Ugh, but that would likely trigger the annoying "zero-step mitigation" crap. > > What does this actually mean in practice? What's the threshold, is the VM- > Enter > error uniquely identifiable, and can KVM rely on HOST_PRIORITY to be set if > KVM > runs afoul of the zero-step mitigation? > > After a pre-determined number of such EPT violations occur on the same > instruction, > the TDX module starts tracking the GPAs that caused Secure EPT faults and > fails > further host VMM attempts to enter the TD VCPU unless previously faulting > private > GPAs are properly mapped in the Secure EPT. > > If HOST_PRIORITY is set, then one idea would be to resume the guest if there's > SEPT contention on a fault, and then _if_ the zero-step mitigation is > triggered, > kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked > and > can't be re-locked by the guest. That would allow KVM to guarantee forward > progress without an arbitrary retry loop in the TDP MMU. > > Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure > the > one and only retry is guaranteed to succeed. Ok so not against retry loops, just against magic number retry loops with no explanation that can be found. Makes sense. Until we answer some of the questions (i.e. HOST_PRIORITY exposure), it's hard to say. We need to check some stuff on our end.
On Tue, Sep 10, 2024, Rick P Edgecombe wrote: > On Mon, 2024-09-09 at 16:58 -0700, Sean Christopherson wrote: > > On Mon, Sep 09, 2024, Rick P Edgecombe wrote: > > > On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote: > > > > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, > > > > > ever. For its operations, I'm pretty sure the only sane approach is for > > > > > KVM to ensure there will be no contention. And if the TDX module's > > > > > single-step protection spuriously kicks in, KVM exits to userspace. If > > > > > the TDX module can't/doesn't/won't communicate that it's mitigating > > > > > single-step, e.g. so that KVM can forward the information to userspace, > > > > > then that's a TDX module problem to solve. > > > > > > > > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that > > > > > > return TDX_OPERAND_BUSY. > > > > > > > > > > IMO, that's terrible advice. SGX has similar behavior, where the xucode > > > > > "module" signals #GP if there's a conflict. #GP is obviously far, far > > > > > worse as it lacks the precision that would help software understand > > > > > exactly what went wrong, but I think one of the better decisions we made > > > > > with the SGX driver was to have a "zero tolerance" policy where the > > > > > driver would _never_ retry due to a potential resource conflict, i.e. > > > > > that any conflict in the module would be treated as a kernel bug. > > > > > > Thanks for the analysis. The direction seems reasonable to me for this lock > > > in > > > particular. We need to do some analysis on how much the existing mmu_lock > > > can > > > protects us. > > > > I would operate under the assumption that it provides SEPT no meaningful > > protection. > > I think I would even go so far as to say that it is a _requirement_ that > > mmu_lock > > does NOT provide the ordering required by SEPT, because I do not want to take > > on > > any risk (due to SEPT constraints) that would limit KVM's ability to do things > > while holding mmu_lock for read. > > Ok. Not sure, but I think you are saying not to add any extra acquisitions of > mmu_lock. No new write_lock. If read_lock is truly needed, no worries. But SEPT needing a write_lock is likely a hard "no", as the TDP MMU's locking model depends heavily on vCPUs being readers. E.g. the TDP MMU has _much_ coarser granularity than core MM, but it works because almost everything is done while holding mmu_lock for read. > Until we answer some of the questions (i.e. HOST_PRIORITY exposure), it's hard > to say. We need to check some stuff on our end. Ya, agreed.
On 9/9/24 23:11, Sean Christopherson wrote: > In general, I am_very_ opposed to blindly retrying an SEPT SEAMCALL, ever. For > its operations, I'm pretty sure the only sane approach is for KVM to ensure there > will be no contention. And if the TDX module's single-step protection spuriously > kicks in, KVM exits to userspace. If the TDX module can't/doesn't/won't communicate > that it's mitigating single-step, e.g. so that KVM can forward the information > to userspace, then that's a TDX module problem to solve. In principle I agree but we also need to be pragmatic. Exiting to userspace may not be practical in all flows, for example. First of all, we can add a spinlock around affected seamcalls. This way we know that "busy" errors must come from the guest and have set HOST_PRIORITY. It is still kinda bad that guests can force the VMM to loop, but the VMM can always say enough is enough. In other words, let's assume that a limit of 16 is probably appropriate but we can also increase the limit and crash the VM if things become ridiculous. Something like this: static u32 max = 16; int retry = 0; spin_lock(&kvm->arch.seamcall_lock); for (;;) { args_in = *in; ret = seamcall_ret(op, in); if (++retry == 1) { /* protected by the same seamcall_lock */ kvm->stat.retried_seamcalls++; } else if (retry == READ_ONCE(max)) { pr_warn("Exceeded %d retries for S-EPT operation\n", max); if (KVM_BUG_ON(kvm, retry == 1024)) { pr_err("Crashing due to lock contention in the TDX module\n"); break; } cmpxchg(&max, retry, retry * 2); } } spin_unlock(&kvm->arch.seamcall_lock); This way we can do some testing and figure out a useful limit. For zero step detection, my reading is that it's TDH.VP.ENTER that fails; not any of the MEM seamcalls. For that one to be resolved, it should be enough to do take and release the mmu_lock back to back, which ensures that all pending critical sections have completed (that is, "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then loop. Adding a vCPU stat for that one is a good idea, too. Paolo
On Tue, Sep 10, 2024, Paolo Bonzini wrote: > On 9/9/24 23:11, Sean Christopherson wrote: > > In general, I am_very_ opposed to blindly retrying an SEPT SEAMCALL, ever. For > > its operations, I'm pretty sure the only sane approach is for KVM to ensure there > > will be no contention. And if the TDX module's single-step protection spuriously > > kicks in, KVM exits to userspace. If the TDX module can't/doesn't/won't communicate > > that it's mitigating single-step, e.g. so that KVM can forward the information > > to userspace, then that's a TDX module problem to solve. > > In principle I agree but we also need to be pragmatic. Exiting to userspace > may not be practical in all flows, for example. > > First of all, we can add a spinlock around affected seamcalls. No, because that defeates the purpose of having mmu_lock be a rwlock. > This way we know that "busy" errors must come from the guest and have set > HOST_PRIORITY. We should be able to achieve that without a VM-wide spinlock. My thought (from v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep it set until the SEAMCALL completes. > It is still kinda bad that guests can force the VMM to loop, but the VMM can > always say enough is enough. In other words, let's assume that a limit of > 16 is probably appropriate but we can also increase the limit and crash the > VM if things become ridiculous. > > Something like this: > > static u32 max = 16; > int retry = 0; > spin_lock(&kvm->arch.seamcall_lock); > for (;;) { > args_in = *in; > ret = seamcall_ret(op, in); > if (++retry == 1) { > /* protected by the same seamcall_lock */ > kvm->stat.retried_seamcalls++; > } else if (retry == READ_ONCE(max)) { > pr_warn("Exceeded %d retries for S-EPT operation\n", max); > if (KVM_BUG_ON(kvm, retry == 1024)) { > pr_err("Crashing due to lock contention in the TDX module\n"); > break; > } > cmpxchg(&max, retry, retry * 2); > } > } > spin_unlock(&kvm->arch.seamcall_lock); > > This way we can do some testing and figure out a useful limit. 2 :-) One try that guarantees no other host task is accessing the S-EPT entry, and a second try after blasting IPI to kick vCPUs to ensure no guest-side task has locked the S-EPT entry. My concern with an arbitrary retry loop is that we'll essentially propagate the TDX module issues to the broader kernel. Each of those SEAMCALLs is slooow, so retrying even ~20 times could exceed the system's tolerances for scheduling, RCU, etc... > For zero step detection, my reading is that it's TDH.VP.ENTER that fails; > not any of the MEM seamcalls. For that one to be resolved, it should be > enough to do take and release the mmu_lock back to back, which ensures that > all pending critical sections have completed (that is, > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then > loop. Adding a vCPU stat for that one is a good idea, too. As above and in my discussion with Rick, I would prefer to kick vCPUs to force forward progress, especially for the zero-step case. If KVM gets to the point where it has retried TDH.VP.ENTER on the same fault so many times that zero-step kicks in, then it's time to kick and wait, not keep retrying blindly. There is still risk of a hang, e.g. if a CPU fails to respond to the IPI, but that's a possibility that always exists. Kicking vCPUs allows KVM to know with 100% certainty that a SEAMCALL should succeed. Hrm, the wrinkle is that if we want to guarantee success, the vCPU kick would need to happen when the SPTE is frozen, to ensure some other host task doesn't "steal" the lock.
On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote: > On Tue, Sep 10, 2024, Paolo Bonzini wrote: > No, because that defeates the purpose of having mmu_lock be a rwlock. But if this part of the TDX module is wrapped in a single big try_lock, there's no difference in spinning around busy seamcalls, or doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention in the same way. With respect to FROZEN_SPTE... > > This way we know that "busy" errors must come from the guest and have set > > HOST_PRIORITY. > > We should be able to achieve that without a VM-wide spinlock. My thought (from > v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep > it set until the SEAMCALL completes. Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3, which documents that the TDX module returns TDX_OPERAND_BUSY on a CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough to prevent contention in the TDX module. If we want to be a bit more optimistic, let's do something more sophisticated, like only take the lock after the first busy reply. But the spinlock is the easiest way to completely remove host-induced TDX_OPERAND_BUSY, and only have to deal with guest-induced ones. > > It is still kinda bad that guests can force the VMM to loop, but the VMM can > > always say enough is enough. In other words, let's assume that a limit of > > 16 is probably appropriate but we can also increase the limit and crash the > > VM if things become ridiculous. > > 2 :-) > > One try that guarantees no other host task is accessing the S-EPT entry, and a > second try after blasting IPI to kick vCPUs to ensure no guest-side task has > locked the S-EPT entry. Fair enough. Though in principle it is possible to race and have the vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call. So I would make it 5 or so just to be safe. > My concern with an arbitrary retry loop is that we'll essentially propagate the > TDX module issues to the broader kernel. Each of those SEAMCALLs is slooow, so > retrying even ~20 times could exceed the system's tolerances for scheduling, RCU, > etc... How slow are the failed ones? The number of retries is essentially the cost of successful seamcall / cost of busy seamcall. If HOST_PRIORITY works, even a not-small-but-not-huge number of retries would be better than the IPIs. IPIs are not cheap either. > > For zero step detection, my reading is that it's TDH.VP.ENTER that fails; > > not any of the MEM seamcalls. For that one to be resolved, it should be > > enough to do take and release the mmu_lock back to back, which ensures that > > all pending critical sections have completed (that is, > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then > > loop. Adding a vCPU stat for that one is a good idea, too. > > As above and in my discussion with Rick, I would prefer to kick vCPUs to force > forward progress, especially for the zero-step case. If KVM gets to the point > where it has retried TDH.VP.ENTER on the same fault so many times that zero-step > kicks in, then it's time to kick and wait, not keep retrying blindly. Wait, zero-step detection should _not_ affect TDH.MEM latency. Only TDH.VP.ENTER is delayed. If it is delayed to the point of failing, we can do write_lock/write_unlock() in the vCPU entry path. My issue is that, even if we could make it a bit better by looking at the TDX module source code, we don't have enough information to make a good choice. For now we should start with something _easy_, even if it may not be the greatest. Paolo
On Tue, Sep 10, 2024, Paolo Bonzini wrote: > On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Sep 10, 2024, Paolo Bonzini wrote: > > No, because that defeates the purpose of having mmu_lock be a rwlock. > > But if this part of the TDX module is wrapped in a single big > try_lock, there's no difference in spinning around busy seamcalls, or > doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention > in the same way. With respect to FROZEN_SPTE... > > > > This way we know that "busy" errors must come from the guest and have set > > > HOST_PRIORITY. > > > > We should be able to achieve that without a VM-wide spinlock. My thought (from > > v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep > > it set until the SEAMCALL completes. > > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3, > which documents that the TDX module returns TDX_OPERAND_BUSY on a > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough > to prevent contention in the TDX module. Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read mode. So for the operations that KVM can do in parallel, the locking should effectively be per-entry. Because KVM will never throw away an entire S-EPT root, zapping SPTEs will need to be done while holding mmu_lock for write, i.e. KVM shouldn't have problems with host tasks competing for the TDX module's VM-wide lock. > If we want to be a bit more optimistic, let's do something more > sophisticated, like only take the lock after the first busy reply. But > the spinlock is the easiest way to completely remove host-induced > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones. I am not convinced that's necessary or a good idea. I worry that doing so would just kick the can down the road, and potentially make the problems harder to solve, e.g. because we'd have to worry about regressing existing setups. > > > It is still kinda bad that guests can force the VMM to loop, but the VMM can > > > always say enough is enough. In other words, let's assume that a limit of > > > 16 is probably appropriate but we can also increase the limit and crash the > > > VM if things become ridiculous. > > > > 2 :-) > > > > One try that guarantees no other host task is accessing the S-EPT entry, and a > > second try after blasting IPI to kick vCPUs to ensure no guest-side task has > > locked the S-EPT entry. > > Fair enough. Though in principle it is possible to race and have the > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call. My limit of '2' is predicated on the lock being a "host priority" lock, i.e. that kicking vCPUs would ensure the lock has been dropped and can't be re-acquired by the guest. > So I would make it 5 or so just to be safe. > > > My concern with an arbitrary retry loop is that we'll essentially propagate the > > TDX module issues to the broader kernel. Each of those SEAMCALLs is slooow, so > > retrying even ~20 times could exceed the system's tolerances for scheduling, RCU, > > etc... > > How slow are the failed ones? The number of retries is essentially the > cost of successful seamcall / cost of busy seamcall. I haven't measured, but would be surprised if it's less than 2000 cycles. > If HOST_PRIORITY works, even a not-small-but-not-huge number of > retries would be better than the IPIs. IPIs are not cheap either. Agreed, but we also need to account for the operations that are conflicting. E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy waiting for the to-be-zapped S-EPT entry to be available doesn't make much sense. > > > For zero step detection, my reading is that it's TDH.VP.ENTER that fails; > > > not any of the MEM seamcalls. For that one to be resolved, it should be > > > enough to do take and release the mmu_lock back to back, which ensures that > > > all pending critical sections have completed (that is, > > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then > > > loop. Adding a vCPU stat for that one is a good idea, too. > > > > As above and in my discussion with Rick, I would prefer to kick vCPUs to force > > forward progress, especially for the zero-step case. If KVM gets to the point > > where it has retried TDH.VP.ENTER on the same fault so many times that zero-step > > kicks in, then it's time to kick and wait, not keep retrying blindly. > > Wait, zero-step detection should _not_ affect TDH.MEM latency. Only > TDH.VP.ENTER is delayed. Blocked, not delayed. Yes, it's TDH.VP.ENTER that "fails", but to get past TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to guarantee forward progress for TDH.MEM (or whatever the operations are called). Though I wonder, are there any operations guest/host operations that can conflict if the vCPU is faulting? Maybe this particular scenario is a complete non-issue. > If it is delayed to the point of failing, we can do write_lock/write_unlock() > in the vCPU entry path. I was thinking that KVM could set a flag (another synthetic error code bit?) to tell the page fault handler that it needs to kick vCPUs. But as above, it might be unnecessary. > My issue is that, even if we could make it a bit better by looking at > the TDX module source code, we don't have enough information to make a > good choice. For now we should start with something _easy_, even if > it may not be the greatest. I am not opposed to an easy/simple solution, but I am very much opposed to implementing a retry loop without understanding _exactly_ when and why it's needed.
On Tue, 2024-09-10 at 08:57 -0700, Sean Christopherson wrote: > > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3, > > which documents that the TDX module returns TDX_OPERAND_BUSY on a > > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough > > to prevent contention in the TDX module. > > Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM > lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read > mode. AUG does take other locks as exclusive: https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_mem_page_aug.c I count 5 locks in total as well. I think trying to mirror the locking in KVM will be an uphill battle. > > So for the operations that KVM can do in parallel, the locking should > effectively > be per-entry. Because KVM will never throw away an entire S-EPT root, zapping > SPTEs will need to be done while holding mmu_lock for write, i.e. KVM > shouldn't > have problems with host tasks competing for the TDX module's VM-wide lock. > > > If we want to be a bit more optimistic, let's do something more > > sophisticated, like only take the lock after the first busy reply. But > > the spinlock is the easiest way to completely remove host-induced > > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones. > > I am not convinced that's necessary or a good idea. I worry that doing so > would > just kick the can down the road, and potentially make the problems harder to > solve, > e.g. because we'd have to worry about regressing existing setups. > > > > > It is still kinda bad that guests can force the VMM to loop, but the VMM > > > > can > > > > always say enough is enough. In other words, let's assume that a limit > > > > of > > > > 16 is probably appropriate but we can also increase the limit and crash > > > > the > > > > VM if things become ridiculous. > > > > > > 2 :-) > > > > > > One try that guarantees no other host task is accessing the S-EPT entry, > > > and a > > > second try after blasting IPI to kick vCPUs to ensure no guest-side task > > > has > > > locked the S-EPT entry. > > > > Fair enough. Though in principle it is possible to race and have the > > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call. > > My limit of '2' is predicated on the lock being a "host priority" lock, i.e. > that > kicking vCPUs would ensure the lock has been dropped and can't be re-acquired > by > the guest. So kicking would be to try to break loose any deadlock we encountered? It sounds like the kind of kludge that could be hard to remove. > > > So I would make it 5 or so just to be safe. > > > > > My concern with an arbitrary retry loop is that we'll essentially > > > propagate the > > > TDX module issues to the broader kernel. Each of those SEAMCALLs is > > > slooow, so > > > retrying even ~20 times could exceed the system's tolerances for > > > scheduling, RCU, > > > etc... > > > > How slow are the failed ones? The number of retries is essentially the > > cost of successful seamcall / cost of busy seamcall. > > I haven't measured, but would be surprised if it's less than 2000 cycles. > > > If HOST_PRIORITY works, even a not-small-but-not-huge number of > > retries would be better than the IPIs. IPIs are not cheap either. > > Agreed, but we also need to account for the operations that are conflicting. > E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy > waiting > for the to-be-zapped S-EPT entry to be available doesn't make much sense. > > > > > For zero step detection, my reading is that it's TDH.VP.ENTER that > > > > fails; > > > > not any of the MEM seamcalls. For that one to be resolved, it should be > > > > enough to do take and release the mmu_lock back to back, which ensures > > > > that > > > > all pending critical sections have completed (that is, > > > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);"). And then > > > > loop. Adding a vCPU stat for that one is a good idea, too. > > > > > > As above and in my discussion with Rick, I would prefer to kick vCPUs to > > > force > > > forward progress, especially for the zero-step case. If KVM gets to the > > > point > > > where it has retried TDH.VP.ENTER on the same fault so many times that > > > zero-step > > > kicks in, then it's time to kick and wait, not keep retrying blindly. > > > > Wait, zero-step detection should _not_ affect TDH.MEM latency. Only > > TDH.VP.ENTER is delayed. > > Blocked, not delayed. Yes, it's TDH.VP.ENTER that "fails", but to get past > TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to > guarantee > forward progress for TDH.MEM (or whatever the operations are called). > > Though I wonder, are there any operations guest/host operations that can > conflict > if the vCPU is faulting? Maybe this particular scenario is a complete non- > issue. > > > If it is delayed to the point of failing, we can do > > write_lock/write_unlock() > > in the vCPU entry path. > > I was thinking that KVM could set a flag (another synthetic error code bit?) > to > tell the page fault handler that it needs to kick vCPUs. But as above, it > might > be unnecessary. > > > My issue is that, even if we could make it a bit better by looking at > > the TDX module source code, we don't have enough information to make a > > good choice. For now we should start with something _easy_, even if > > it may not be the greatest. > > I am not opposed to an easy/simple solution, but I am very much opposed to > implementing a retry loop without understanding _exactly_ when and why it's > needed. I'd like to explore letting KVM do the retries (i.e. EPT fault loop) a bit more. We can verify that we can survive zero-step in this case. After all, zero-step doesn't kill the TD, just generates an EPT violation exit. So we would just need to verify that the EPT violation getting generated would result in KVM eventually fixing whatever zero-step is requiring. Then we would have to handle BUSY in each SEAMCALL call chain, which currently we don't. Like the zapping case. If we ended up needing a retry loop for limited cases like that, at least it would be more limited.
On Tue, Sep 10, 2024, Rick P Edgecombe wrote: > On Tue, 2024-09-10 at 08:57 -0700, Sean Christopherson wrote: > > > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3, > > > which documents that the TDX module returns TDX_OPERAND_BUSY on a > > > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough > > > to prevent contention in the TDX module. > > > > Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM > > lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read > > mode. > > AUG does take other locks as exclusive: > https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_mem_page_aug.c Only a lock on the underlying physical page. guest_memfd should prevent mapping the same HPA into multiple GPAs, and FROZEN_SPTE should prevent two vCPUs from concurrently AUGing the same GPA+HPA. > I count 5 locks in total as well. I think trying to mirror the locking in KVM > will be an uphill battle. I don't want to mirror the locking, I want to understand and document the expectations and rules. "Throw 16 noodles and hope one sticks" is not a recipe for success. > > So for the operations that KVM can do in parallel, the locking should > > effectively > > be per-entry. Because KVM will never throw away an entire S-EPT root, zapping > > SPTEs will need to be done while holding mmu_lock for write, i.e. KVM > > shouldn't > > have problems with host tasks competing for the TDX module's VM-wide lock. > > > > > If we want to be a bit more optimistic, let's do something more > > > sophisticated, like only take the lock after the first busy reply. But > > > the spinlock is the easiest way to completely remove host-induced > > > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones. > > > > I am not convinced that's necessary or a good idea. I worry that doing so > > would > > just kick the can down the road, and potentially make the problems harder to > > solve, > > e.g. because we'd have to worry about regressing existing setups. > > > > > > > It is still kinda bad that guests can force the VMM to loop, but the VMM > > > > > can > > > > > always say enough is enough. In other words, let's assume that a limit > > > > > of > > > > > 16 is probably appropriate but we can also increase the limit and crash > > > > > the > > > > > VM if things become ridiculous. > > > > > > > > 2 :-) > > > > > > > > One try that guarantees no other host task is accessing the S-EPT entry, > > > > and a > > > > second try after blasting IPI to kick vCPUs to ensure no guest-side task > > > > has > > > > locked the S-EPT entry. > > > > > > Fair enough. Though in principle it is possible to race and have the > > > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call. > > > > My limit of '2' is predicated on the lock being a "host priority" lock, > > i.e. that kicking vCPUs would ensure the lock has been dropped and can't > > be re-acquired by the guest. > > So kicking would be to try to break loose any deadlock we encountered? It sounds > like the kind of kludge that could be hard to remove. No, the intent of the kick would be to wait for vCPUs to exit, which in turn guarantees that any locks held by vCPUs have been dropped. Again, this idea is predicated on the lock being "host priority", i.e. that vCPUs can't re-take the lock before KVM.
>> Host-Side (SEAMCALL) Operation >> ------------------------------ >> The host VMM is expected to retry host-side operations that fail with a >> TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at >> most after a limited time (the longest guest-side TDX module flow) there will be >> no contention with a guest TD attempting to acquire access to the same resource. >> >> Lock operations process the HOST_PRIORITY bit as follows: >> - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s >> HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is >> the host VMM’s responsibility to re-attempt the SEAMCALL function until is >> succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD >> from acquiring the lock. >> - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the >> lock’s HOST_PRIORITY bit. > > *sigh* > >> Guest-Side (TDCALL) Operation >> ----------------------------- >> A TDCALL (guest-side) function that attempt to acquire a lock fails if >> HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest. >> The guest is expected to retry the operation. >> >> Guest-side TDCALL flows that acquire a host priority lock have an upper bound on >> the host-side latency for that lock; once a lock is acquired, the flow either >> releases within a fixed upper time bound, or periodically monitor the >> HOST_PRIORITY flag to see if the host is attempting to acquire the lock. >> " >> >> So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is >> involved in sorting out contention between the guest as well. We need to double >> check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the >> functionality we need to exercise for base support. >> >> The thing that makes me nervous about retry based solution is the potential for >> some kind deadlock like pattern. Just to gather your opinion, if there was some >> SEAMCALL contention that couldn't be locked around from KVM, but came with some >> strong well described guarantees, would a retry loop be hard NAK still? > > I don't know. It would depend on what operations can hit BUSY, and what the > alternatives are. E.g. if we can narrow down the retry paths to a few select > cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of > deadlock, then maybe that's the least awful option. > > What I don't think KVM should do is blindly retry N number of times, because > then there are effectively no rules whatsoever. E.g. if KVM is tearing down a > VM then KVM should assert on immediate success. And if KVM is handling a fault > on behalf of a vCPU, then KVM can and should resume the guest and let it retry. > Ugh, but that would likely trigger the annoying "zero-step mitigation" crap. > > What does this actually mean in practice? What's the threshold, FWIW, the limit in the public TDX module code is 6: #define STEPPING_EPF_THRESHOLD 6 // Threshold of confidence in detecting EPT fault-based stepping in progress We might be able to change it to a larger value though but we need to understand why it is necessary. > is the VM-Enter > error uniquely identifiable, When zero-step mitigation is active in the module, TDH.VP.ENTER tries to grab the SEPT lock thus it can fail with SEPT BUSY error. But if it does grab the lock successfully, it exits to VMM with EPT violation on that GPA immediately. In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step mitigation" must have been active. A normal EPT violation _COULD_ mean mitigation is already active, but AFAICT we don't have a way to tell that in the EPT violation. > and can KVM rely on HOST_PRIORITY to be set if KVM > runs afoul of the zero-step mitigation? I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY. > > After a pre-determined number of such EPT violations occur on the same instruction, > the TDX module starts tracking the GPAs that caused Secure EPT faults and fails > further host VMM attempts to enter the TD VCPU unless previously faulting private > GPAs are properly mapped in the Secure EPT. > > If HOST_PRIORITY is set, then one idea would be to resume the guest if there's > SEPT contention on a fault, and then _if_ the zero-step mitigation is triggered, > kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked and > can't be re-locked by the guest. That would allow KVM to guarantee forward > progress without an arbitrary retry loop in the TDP MMU. I think this should work. It doesn't seem we can tell whether the zero step mitigation is active in EPT violation TDEXIT, or when SEPT SEAMCALL fails with SEPT BUSY. But when any SEPT SEAMCALL fails with SEPT BUSY, if we just kick all vCPUs and make them wait until the next retry is done (which must be successful otherwise it is illegal error), then this should handle both contention from guest and the zero-step mitigation. > > Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure the > one and only retry is guaranteed to succeed. Yeah seems so.
On Wed, 2024-09-11 at 13:17 +1200, Huang, Kai wrote: > > is the VM-Enter > > error uniquely identifiable, > > When zero-step mitigation is active in the module, TDH.VP.ENTER tries to > grab the SEPT lock thus it can fail with SEPT BUSY error. But if it > does grab the lock successfully, it exits to VMM with EPT violation on > that GPA immediately. > > In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step > mitigation" must have been active. I think this isn't true. A sept locking related busy, maybe. But there are other things going on that return BUSY. > A normal EPT violation _COULD_ mean > mitigation is already active, but AFAICT we don't have a way to tell > that in the EPT violation. > > > and can KVM rely on HOST_PRIORITY to be set if KVM > > runs afoul of the zero-step mitigation? > > I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY. What led you to think this? It seemed more limited to me.
On 11/09/2024 2:48 pm, Edgecombe, Rick P wrote: > On Wed, 2024-09-11 at 13:17 +1200, Huang, Kai wrote: >>> is the VM-Enter >>> error uniquely identifiable, >> >> When zero-step mitigation is active in the module, TDH.VP.ENTER tries to >> grab the SEPT lock thus it can fail with SEPT BUSY error. But if it >> does grab the lock successfully, it exits to VMM with EPT violation on >> that GPA immediately. >> >> In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step >> mitigation" must have been active. > > I think this isn't true. A sept locking related busy, maybe. But there are other > things going on that return BUSY. I thought we are talking about SEPT locking here. For BUSY in general yeah it tries to grab other locks too (e.g., share lock of TDR/TDCS/TDVPS etc) but those are impossible to contend in the current KVM TDX implementation I suppose? Perhaps we need to look more to make sure. > >> A normal EPT violation _COULD_ mean >> mitigation is already active, but AFAICT we don't have a way to tell >> that in the EPT violation. >> >>> and can KVM rely on HOST_PRIORITY to be set if KVM >>> runs afoul of the zero-step mitigation? >> >> I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY. > > What led you to think this? It seemed more limited to me. I interpreted from the spec (chapter 18.1.4 Concurrency Restrictions with Host Priority). But looking at the module public code, it seems only when the lock can be contended from the guest the HOST_PRIORITY will be set when host fails to grab the lock (see acquire_sharex_lock_hp_ex() and acquire_sharex_lock_hp_sh()), which makes sense anyway.
This is a lock status report of TDX module for current SEAMCALL retry issue based on code in TDX module public repo https://github.com/intel/tdx-module.git branch TDX_1.5.05. TL;DR: - tdh_mem_track() can contend with tdh_vp_enter(). - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. - tdg_mem_page_accept() can contend with other tdh_mem*(). Proposal: - Return -EAGAIN directly in ops link_external_spt/set_external_spte when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY. - Kick off vCPUs at the beginning of page removal path, i.e. before the tdh_mem_range_block(). Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done. (one possible optimization: since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare, do not kick off vCPUs in normal conditions. When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow TD enter until page removal completes.) Below are detailed analysis: === Background === In TDX module, there are 4 kinds of locks: 1. sharex_lock: Normal read/write lock. (no host priority stuff) 2. sharex_hp_lock: Just like normal read/write lock, except that host can set host priority bit on failure. when guest tries to acquire the lock and sees host priority bit set, it will return "busy host priority" directly, letting host win. After host acquires the lock successfully, host priority bit is cleared. 3. sept entry lock: Lock utilizing software bits in SEPT entry. HP bit (Host priority): bit 52 EL bit (Entry lock): bit 11, used as a bit lock. - host sets HP bit when host fails to acquire EL bit lock; - host resets HP bit when host wins. - guest returns "busy host priority" if HP bit is found set when guest tries to acquire EL bit lock. 4. mutex lock: Lock with only 2 states: free, lock. (not the same as linux mutex, not re-scheduled, could pause() for debugging). ===Resources & users list=== Resources SHARED users EXCLUSIVE users ------------------------------------------------------------------------ (1) TDR tdh_mng_rdwr tdh_mng_create tdh_vp_create tdh_mng_add_cx tdh_vp_addcx tdh_mng_init tdh_vp_init tdh_mng_vpflushdone tdh_vp_enter tdh_mng_key_config tdh_vp_flush tdh_mng_key_freeid tdh_vp_rd_wr tdh_mr_extend tdh_mem_sept_add tdh_mr_finalize tdh_mem_sept_remove tdh_vp_init_apicid tdh_mem_page_aug tdh_mem_page_add tdh_mem_page_remove tdh_mem_range_block tdh_mem_track tdh_mem_range_unblock tdh_phymem_page_reclaim ------------------------------------------------------------------------ (2) KOT tdh_phymem_cache_wb tdh_mng_create tdh_mng_vpflushdone tdh_mng_key_freeid ------------------------------------------------------------------------ (3) TDCS tdh_mng_rdwr tdh_vp_create tdh_vp_addcx tdh_vp_init tdh_vp_init_apicid tdh_vp_enter tdh_vp_rd_wr tdh_mem_sept_add tdh_mem_sept_remove tdh_mem_page_aug tdh_mem_page_remove tdh_mem_range_block tdh_mem_track tdh_mem_range_unblock ------------------------------------------------------------------------ (4) TDVPR tdh_vp_rd_wr tdh_vp_create tdh_vp_addcx tdh_vp_init tdh_vp_init_apicid tdh_vp_enter tdh_vp_flush ------------------------------------------------------------------------ (5) TDCS epoch tdh_vp_enter tdh_mem_track ------------------------------------------------------------------------ (6) secure_ept_lock tdh_mem_sept_add tdh_vp_enter tdh_mem_page_aug tdh_mem_sept_remove tdh_mem_page_remove tdh_mem_range_block tdh_mem_range_unblock ------------------------------------------------------------------------ (7) SEPT entry tdh_mem_sept_add tdh_mem_sept_remove tdh_mem_page_aug tdh_mem_page_remove tdh_mem_range_block tdh_mem_range_unblock tdg_mem_page_accept Current KVM interested SEAMCALLs: ------------------------------------------------------------------------ SEAMCALL Lock Name Lock Type Resource tdh_mng_create sharex_hp_lock EXCLUSIVE TDR sharex_lock EXCLUSIVE KOT tdh_mng_add_cx sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock EXCLUSIVE page to add tdh_mng_init sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock NO_LOCK TDCS tdh_mng_vpflushdone sharex_hp_lock EXCLUSIVE TDR sharex_lock EXCLUSIVE KOT tdh_mng_key_config sharex_hp_lock EXCLUSIVE TDR tdh_mng_key_freeid sharex_hp_lock EXCLUSIVE TDR sharex_lock EXCLUSIVE KOT tdh_mng_rdwr sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS tdh_mr_extend sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock NO_LOCK TDCS tdh_mr_finalize sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock NO_LOCK TDCS tdh_vp_create sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_hp_lock EXCLUSIVE TDVPR tdh_vp_addcx sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_hp_lock EXCLUSIVE page to add tdh_vp_init sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS tdh_vp_init_apicid sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock SHARED TDCS tdh_vp_enter(*) sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock SHARED TDCS epoch_lock sharex_lock EXCLUSIVE TDCS secure_ept_lock tdh_vp_flush sharex_hp_lock EXCLUSIVE TDVPR sharex_hp_lock SHARED TDR tdh_vp_rd_wr sharex_hp_lock SHARED TDVPR sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS tdh_mem_sept_add sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock SHARED TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify sharex_hp_lock EXCLUSIVE page to add tdh_mem_sept_remove sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock EXCLUSIVE TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify sharex_hp_lock EXCLUSIVE page to remove tdh_mem_page_add sharex_hp_lock EXCLUSIVE TDR sharex_hp_lock NO_LOCK TDCS sharex_lock NO_LOCK TDCS secure_ept_lock sharex_hp_lock EXCLUSIVE page to add tdh_mem_page_aug sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock SHARED TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify sharex_hp_lock EXCLUSIVE page to aug tdh_mem_page_remove sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock SHARED TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify sharex_hp_lock EXCLUSIVE page to remove tdh_mem_range_block sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock EXCLUSIVE TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify tdh_mem_track sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock EXCLUSIVE TDCS epoch_lock tdh_mem_range_unblock sharex_hp_lock SHARED TDR sharex_hp_lock SHARED TDCS sharex_lock EXCLUSIVE TDCS secure_ept_lock sept entry lock HOST,EXCLUSIVE SEPT entry to modify tdh_phymem_page_reclaim sharex_hp_lock EXCLUSIVE page to reclaim sharex_hp_lock SHARED TDR tdh_phymem_cache_wb mutex_lock per package wbt_entries sharex_lock SHARED KOT tdh_phymem_page_wbinvd sharex_hp_lock SHARED page to be wbinvd Current KVM interested TDCALLs: ------------------------------------------------------------------------ tdg_mem_page_accept sept entry lock GUEST SEPT entry to modify TDCALLs like tdg_mr_rtmr_extend(), tdg_servtd_rd_wr(), tdg_mem_page_attr_wr() tdg_mem_page_attr_rd() are not included. *:(a) tdh_vp_enter() holds shared TDR lock and exclusive TDVPR lock, the two locks are released when exiting to VMM. (b) tdh_vp_enter() holds shared TDCS lock and shared TDCS epoch_lock lock, releases them before entering non-root mode. (c) tdh_vp_enter() holds shared epoch lock, contending with tdh_mem_track(). (d) tdh_vp_enter() only holds EXCLUSIVE secure_ept_lock when 0-stepping is suspected, i.e. when last_epf_gpa_list is not empty. When a EPT violation happens, TDX module checks if the guest RIP equals to the guest RIP of last TD entry. Only when this is true for 6 continuous times, the gpa will be recorded in last_epf_gpa_list. The list will be reset once guest RIP of a EPT violation and last TD enter RIP are unequal. === Summary === For the 8 kinds of common resources protected in TDX module: (1) TDR: There are only shared accesses to TDR during runtime (i.e. after TD is finalized and before TD tearing down), if we don't need to support calling tdh_vp_init_apicid() at runtime (e.g. for vCPU hotplug). tdh_vp_enter() holds shared TDR lock until exiting to VMM. TDCALLs do not acquire the TDR lock. (2) KOT (Key Ownership Table) Current KVM code should have avoided contention to this resource. (3) TDCS: Shared accessed or access with no lock when TDR is exclusively locked. Seamcalls in runtime (after TD finalized and before TD tearing down) do not contend with each other on TDCS. tdh_vp_enter() holds shared TDCS lock and releases it before entering non-root mode. Current TDCALLs for basic TDX do not acquire this lock. (4) TDVPR: Per-vCPU exclusive accessed except for tdh_vp_rd_wr(). tdh_vp_enter() holds exclusive TDVPR lock until exiting to VMM. TDCALLs do not acquire the TDVPR lock. (5) TDCS epoch: tdh_mem_track() requests exclusive access, and tdh_vp_enter() requests shared access. tdh_mem_track() can contend with tdh_vp_enter(). (6) SEPT tree: Protected by secure_ept_lock (sharex_lock). tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() holds shared lock; tdh_mem_sept_remove()/tdh_mem_range_block()/tdh_mem_range_unblock() holds exclusive lock. tdh_vp_enter() requests exclusive access when 0-stepping is suspected, contending with all other tdh_mem*(). Guest does not acquire this lock. So, kvm mmu_lock has prevented contentions between tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() and tdh_mem_sept_remove()/tdh_mem_range_block(). Though tdh_mem_sept_add()/tdh_mem_page_aug() races with tdh_vp_enter(), returning -EAGAIN directly is fine for them. The remaining issue is the contention between tdh_vp_enter() and tdh_mem_page_remove()/tdh_mem_sept_remove()/tdh_mem_range_block(). (7) SEPT entry: All exclusive access. tdg_mem_page_accept() may contend with other tdh_mem*() on a specific SEPT entry. (8) PAMT entry for target pages (e.g. page to add/aug/remove/reclaim/wbinvd): Though they are all exclusively locked, no race should be met as long as they belong to different pamt entries. Conclusion: Current KVM code should have avoided contentions of resources (1)-(4),(8), while (5),(6),(7) are still possible to meet contention. - tdh_mem_track() can contend with tdh_vp_enter() for (5) - tdh_vp_enter() contends with tdh_mem*() for (6) when 0-stepping is suspected. - tdg_mem_page_accept() can contend with other tdh_mem*() for (7).
On Fri, Sep 13, 2024, Yan Zhao wrote: > This is a lock status report of TDX module for current SEAMCALL retry issue > based on code in TDX module public repo https://github.com/intel/tdx-module.git > branch TDX_1.5.05. > > TL;DR: > - tdh_mem_track() can contend with tdh_vp_enter(). > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. The zero-step logic seems to be the most problematic. E.g. if KVM is trying to install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for whatever reason. Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still hits the fault? For non-TDX, resuming the guest and letting the vCPU retry the instruction is desirable because in many cases, the winning task will install a valid mapping before KVM can re-run the vCPU, i.e. the fault will be fixed before the instruction is re-executed. In the happy case, that provides optimal performance as KVM doesn't introduce any extra delay/latency. But for TDX, the math is different as the cost of a re-hitting a fault is much, much higher, especially in light of the zero-step issues. E.g. if the TDP MMU returns a unique error code for the frozen case, and kvm_mmu_page_fault() is modified to return the raw return code instead of '1', then the TDX EPT violation path can safely retry locally, similar to the do-while loop in kvm_tdp_map_page(). The only part I don't like about this idea is having two "retry" return values, which creates the potential for bugs due to checking one but not the other. Hmm, that could be avoided by passing a bool pointer as an out-param to communicate to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that option better even though the out-param is a bit gross, because it makes it more obvious that the "frozen_spte" is a special case that doesn't need attention for most paths. > - tdg_mem_page_accept() can contend with other tdh_mem*(). > > Proposal: > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when > tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY. What is the result of returning -EAGAIN? E.g. does KVM redo tdh_vp_enter()? Also tdh_mem_sept_add() is strictly pre-finalize, correct? I.e. should never contend with tdg_mem_page_accept() because vCPUs can't yet be run. Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()? The page isn't yet mapped, so why would the guest be allowed to take a lock on the S-EPT entry? > - Kick off vCPUs at the beginning of page removal path, i.e. before the > tdh_mem_range_block(). > Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done. This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS. > (one possible optimization: > since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare, > do not kick off vCPUs in normal conditions. > When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow Which SEAMCALL is this specifically? tdh_mem_range_block()? > TD enter until page removal completes.) Idea #1: --- diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b45258285c9c..8113c17bd2f6 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, return -EINTR; cond_resched(); r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); - } while (r == RET_PF_RETRY); + } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN); if (r < 0) return r; @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err vcpu->stat.pf_spurious++; if (r != RET_PF_EMULATE) - return 1; + return r; emulate: return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn, diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 8d3fb3c8c213..690f03d7daae 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * and of course kvm_mmu_do_page_fault(). * * RET_PF_CONTINUE: So far, so good, keep handling the page fault. + * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_RETRY: let CPU fault again on the address. + * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen. + * Let the CPU fault again on the address, or retry the + * fault "locally", i.e. without re-entering the guest. * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the * gfn and retry, or emulate the instruction directly. * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. - * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. * * Any names added to this enum should be exported to userspace for use in @@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which * will allow for efficient machine code when checking for CONTINUE, e.g. * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. + * + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code + * scheme, where 1==success, translates '1' to RET_PF_FIXED. */ enum { RET_PF_CONTINUE = 0, + RET_PF_FIXED = 1, RET_PF_RETRY, + RET_PF_RETRY_FROZEN, RET_PF_EMULATE, RET_PF_WRITE_PROTECTED, RET_PF_INVALID, - RET_PF_FIXED, RET_PF_SPURIOUS, }; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 5a475a6456d4..cbf9e46203f3 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) retry: rcu_read_unlock(); + if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)) + return RET_PF_RETRY_FOZEN; return ret; } --- Idea #2: --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/mmu/mmu.c | 12 ++++++------ arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++--- arch/x86/kvm/mmu/tdp_mmu.c | 1 + arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/vmx.c | 4 ++-- 6 files changed, 23 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 46e0a466d7fb..200fecd1de88 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr, int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, - void *insn, int insn_len); + void *insn, int insn_len, bool *frozen_spte); void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg); void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva); void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index b45258285c9c..207840a316d3 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) return; r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, - true, NULL, NULL); + true, NULL, NULL, NULL); /* * Account fixed page faults, otherwise they'll never be counted, but @@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, trace_kvm_page_fault(vcpu, fault_address, error_code); r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn, - insn_len); + insn_len, NULL); } else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) { vcpu->arch.apf.host_apf_flags = 0; local_irq_disable(); @@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, if (signal_pending(current)) return -EINTR; cond_resched(); - r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); + r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL); } while (r == RET_PF_RETRY); if (r < 0) @@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, } int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, - void *insn, int insn_len) + void *insn, int insn_len, bool *frozen_spte) { int r, emulation_type = EMULTYPE_PF; bool direct = vcpu->arch.mmu->root_role.direct; @@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err vcpu->stat.pf_taken++; r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false, - &emulation_type, NULL); + &emulation_type, NULL, frozen_spte); if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm)) return -EIO; } @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err vcpu->stat.pf_spurious++; if (r != RET_PF_EMULATE) - return 1; + return r; emulate: return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn, diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 8d3fb3c8c213..5b1fc77695c1 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -247,6 +247,9 @@ struct kvm_page_fault { * is changing its own translation in the guest page tables. */ bool write_fault_to_shadow_pgtable; + + /* Indicates the page fault needs to be retried due to a frozen SPTE. */ + bool frozen_spte; }; int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); @@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * and of course kvm_mmu_do_page_fault(). * * RET_PF_CONTINUE: So far, so good, keep handling the page fault. + * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_RETRY: let CPU fault again on the address. * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the * gfn and retry, or emulate the instruction directly. * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. - * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. * * Any names added to this enum should be exported to userspace for use in @@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which * will allow for efficient machine code when checking for CONTINUE, e.g. * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. + * + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code + * scheme, where 1==success, translates '1' to RET_PF_FIXED. */ enum { RET_PF_CONTINUE = 0, + RET_PF_FIXED = 1, RET_PF_RETRY, RET_PF_EMULATE, RET_PF_WRITE_PROTECTED, RET_PF_INVALID, - RET_PF_FIXED, RET_PF_SPURIOUS, }; @@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err, bool prefetch, - int *emulation_type, u8 *level) + int *emulation_type, u8 *level, + bool *frozen_spte) { struct kvm_page_fault fault = { .addr = cr2_or_gpa, @@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, *emulation_type |= EMULTYPE_WRITE_PF_TO_SP; if (level) *level = fault.goal_level; + if (frozen_spte) + *frozen_spte = fault.frozen_spte; return r; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 5a475a6456d4..e7fc5ea4b437 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) retry: rcu_read_unlock(); + fault->frozen_spte = is_frozen_spte(iter.old_spte); return ret; } diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 38723b0c435d..269de6a9eb13 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu) rc = kvm_mmu_page_fault(vcpu, fault_address, error_code, static_cpu_has(X86_FEATURE_DECODEASSISTS) ? svm->vmcb->control.insn_bytes : NULL, - svm->vmcb->control.insn_len); + svm->vmcb->control.insn_len, NULL); if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK) sev_handle_rmp_fault(vcpu, fault_address, error_code); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 368acfebd476..fc2ff5d91a71 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa))) return kvm_emulate_instruction(vcpu, 0); - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL); } static int handle_ept_misconfig(struct kvm_vcpu *vcpu) @@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) return kvm_skip_emulated_instruction(vcpu); } - return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0); + return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL); } static int handle_nmi_window(struct kvm_vcpu *vcpu) base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe --
On Fri, 2024-09-13 at 16:36 +0800, Yan Zhao wrote: > Thanks Yan, this is great! > This is a lock status report of TDX module for current SEAMCALL retry issue > based on code in TDX module public repo > https://github.com/intel/tdx-module.git > branch TDX_1.5.05. > > TL;DR: > - tdh_mem_track() can contend with tdh_vp_enter(). > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > - tdg_mem_page_accept() can contend with other tdh_mem*(). > > Proposal: > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when > tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY. Regarding the sept entry contention with the guest, I think KVM might not be guaranteed to retry the same path and clear the sept entry host priority bit. What if the first failure exited to userspace because of a pending signal or something? Then the vcpu could reenter the guest, handle an NMI and go off in another direction, never to trigger the EPT violation again. This would leave the SEPT entry locked to the guest. That is a convoluted scenario that could probably be considered a buggy guest, but what I am sort of pondering is that the retry solution that loop outside the fault handler guts will have more complex failure modes around the host priority bit. The N local retries solution really is a brown paper bag design, but the more proper looking solution actually has two downsides compared to it: 1. It is based on locking behavior that is not in the spec (yes we can work with TDX module folks to keep it workable) 2. Failure modes get complex I think I'm still onboard. Just trying to stress the design a bit. (BTW it looks like Linux guest doesn't actually retry accept on host priority busy, so they won't spin on it anyway. Probably any contention here would be a buggy guest for Linux TDs at least.) > - Kick off vCPUs at the beginning of page removal path, i.e. before the > tdh_mem_range_block(). > Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done. > (one possible optimization: > since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare, > do not kick off vCPUs in normal conditions. > When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow > TD enter until page removal completes.) > > Below are detailed analysis: > > === Background === > In TDX module, there are 4 kinds of locks: > 1. sharex_lock: > Normal read/write lock. (no host priority stuff) > > 2. sharex_hp_lock: > Just like normal read/write lock, except that host can set host priority > bit > on failure. > when guest tries to acquire the lock and sees host priority bit set, it > will > return "busy host priority" directly, letting host win. > After host acquires the lock successfully, host priority bit is cleared. > > 3. sept entry lock: > Lock utilizing software bits in SEPT entry. > HP bit (Host priority): bit 52 > EL bit (Entry lock): bit 11, used as a bit lock. > > - host sets HP bit when host fails to acquire EL bit lock; > - host resets HP bit when host wins. > - guest returns "busy host priority" if HP bit is found set when guest > tries > to acquire EL bit lock. > > 4. mutex lock: > Lock with only 2 states: free, lock. > (not the same as linux mutex, not re-scheduled, could pause() for > debugging). > > ===Resources & users list=== > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------------------ > (1) TDR tdh_mng_rdwr tdh_mng_create > tdh_vp_create tdh_mng_add_cx > tdh_vp_addcx tdh_mng_init > tdh_vp_init tdh_mng_vpflushdone > tdh_vp_enter tdh_mng_key_config > tdh_vp_flush tdh_mng_key_freeid > tdh_vp_rd_wr tdh_mr_extend > tdh_mem_sept_add tdh_mr_finalize > tdh_mem_sept_remove tdh_vp_init_apicid > tdh_mem_page_aug tdh_mem_page_add > tdh_mem_page_remove > tdh_mem_range_block > tdh_mem_track > tdh_mem_range_unblock > tdh_phymem_page_reclaim In pamt_walk() it calls promote_sharex_lock_hp() with the lock type passed into pamt_walk(), and tdh_phymem_page_reclaim() passed TDX_LOCK_EXCLUSIVE. So that is an exclusive lock. But we can ignore it because we only do reclaim at TD tear down time? Separately, I wonder if we should try to add this info as comments around the SEAMCALL implementations. The locking is not part of the spec, but never-the- less the kernel is being coded against these assumptions. So it can sort of be like "the kernel assumes this" and we can at least record what the reason was. Or maybe just comment the parts that KVM assumes.
On Fri, 2024-09-13 at 10:23 -0700, Sean Christopherson wrote: > > TL;DR: > > - tdh_mem_track() can contend with tdh_vp_enter(). > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying > to > install a page on behalf of two vCPUs, and KVM resumes the guest if it > encounters > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > whatever reason. Can you explain more about what the concern is here? That the zero-step mitigation activation will be a drag on the TD because of extra contention with the TDH.MEM calls? > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM > retries > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU > still > hits the fault? It seems like an optimization. To me, I would normally want to know how much it helped before adding it. But if you think it's an obvious win I'll defer. > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > desirable because in many cases, the winning task will install a valid mapping > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > instruction is re-executed. In the happy case, that provides optimal > performance > as KVM doesn't introduce any extra delay/latency. > > But for TDX, the math is different as the cost of a re-hitting a fault is > much, > much higher, especially in light of the zero-step issues. > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > then the TDX EPT violation path can safely retry locally, similar to the do- > while > loop in kvm_tdp_map_page(). > > The only part I don't like about this idea is having two "retry" return > values, > which creates the potential for bugs due to checking one but not the other. > > Hmm, that could be avoided by passing a bool pointer as an out-param to > communicate > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > option better even though the out-param is a bit gross, because it makes it > more > obvious that the "frozen_spte" is a special case that doesn't need attention > for > most paths.
On Fri, Sep 13, 2024, Rick P Edgecombe wrote: > On Fri, 2024-09-13 at 10:23 -0700, Sean Christopherson wrote: > > > TL;DR: > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying > > to I am getting a feeling of deja vu. Please fix your mail client to not generate newlines in the middle of quoted text. > > install a page on behalf of two vCPUs, and KVM resumes the guest if it > > encounters a FROZEN_SPTE when building the non-leaf SPTEs, then one of the > > vCPUs could trigger the zero-step mitigation if the vCPU that "wins" and > > gets delayed for whatever reason. > > Can you explain more about what the concern is here? That the zero-step > mitigation activation will be a drag on the TD because of extra contention with > the TDH.MEM calls? > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow > > slow-path, what if instead of resuming the guest if a page fault hits > > FROZEN_SPTE, KVM retries the fault "locally", i.e. _without_ redoing > > tdh_vp_enter() to see if the vCPU still hits the fault? > > It seems like an optimization. To me, I would normally want to know how much it > helped before adding it. But if you think it's an obvious win I'll defer. I'm not worried about any performance hit with zero-step, I'm worried about KVM not being able to differentiate between a KVM bug and guest interference. The goal with a local retry is to make it so that KVM _never_ triggers zero-step, unless there is a bug somewhere. At that point, if zero-step fires, KVM can report the error to userspace instead of trying to suppress guest activity, and potentially from other KVM tasks too. It might even be simpler overall too. E.g. report status up the call chain and let the top-level TDX S-EPT handler to do its thing, versus adding various flags and control knobs to ensure a vCPU can make forward progress.
On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > On Fri, Sep 13, 2024, Yan Zhao wrote: > > This is a lock status report of TDX module for current SEAMCALL retry issue > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > branch TDX_1.5.05. > > > > TL;DR: > > - tdh_mem_track() can contend with tdh_vp_enter(). > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > whatever reason. > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > hits the fault? > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > desirable because in many cases, the winning task will install a valid mapping > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > instruction is re-executed. In the happy case, that provides optimal performance > as KVM doesn't introduce any extra delay/latency. > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > much higher, especially in light of the zero-step issues. > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > then the TDX EPT violation path can safely retry locally, similar to the do-while > loop in kvm_tdp_map_page(). > > The only part I don't like about this idea is having two "retry" return values, > which creates the potential for bugs due to checking one but not the other. > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > option better even though the out-param is a bit gross, because it makes it more > obvious that the "frozen_spte" is a special case that doesn't need attention for > most paths. Good idea. But could we extend it a bit more to allow TDX's EPT violation handler to also retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? > > > - tdg_mem_page_accept() can contend with other tdh_mem*(). > > > > Proposal: > > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when > > tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY. > What is the result of returning -EAGAIN? E.g. does KVM redo tdh_vp_enter()? Sorry, I meant -EBUSY originally. With the current code in kvm_tdp_map_page(), vCPU should just retry without tdh_vp_enter() except when there're signals pending. With a real EPT violation, tdh_vp_enter() should be called again. I realized that this is not good enough. So, is it better to return -EAGAIN in ops link_external_spt/set_external_spte and have kvm_tdp_mmu_map() return RET_PF_RETRY_FROZEN for -EAGAIN? (or maybe some other name for RET_PF_RETRY_FROZEN). > Also tdh_mem_sept_add() is strictly pre-finalize, correct? I.e. should never > contend with tdg_mem_page_accept() because vCPUs can't yet be run. tdh_mem_page_add() is pre-finalize, tdh_mem_sept_add() is not. tdh_mem_sept_add() can be called runtime by tdp_mmu_link_sp(). > Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()? > The page isn't yet mapped, so why would the guest be allowed to take a lock on > the S-EPT entry? Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second tdh_mem_page_aug() is called on the same gpa, the second one may contend with tdg_mem_page_accept(). But given KVM does not allow the second tdh_mem_page_aug(), looks the contention between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen. > > > - Kick off vCPUs at the beginning of page removal path, i.e. before the > > tdh_mem_range_block(). > > Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done. > > This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS. Great! > > > (one possible optimization: > > since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare, > > do not kick off vCPUs in normal conditions. > > When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow > > Which SEAMCALL is this specifically? tdh_mem_range_block()? Yes, they are - tdh_mem_range_block() contends with tdh_vp_enter() for secure_ept_lock. - tdh_mem_track() contends with tdh_vp_enter() for TD epoch. (current code in MMU part 2 just retry tdh_mem_track() endlessly), - tdh_mem_page_remove()/tdh_mem_range_block() contends with tdg_mem_page_accept() for SEPT entry lock. (this one should not happen on a sane guest). Resources SHARED users EXCLUSIVE users ------------------------------------------------------------------------ (5) TDCS epoch tdh_vp_enter tdh_mem_track ------------------------------------------------------------------------ (6) secure_ept_lock tdh_mem_sept_add tdh_vp_enter tdh_mem_page_aug tdh_mem_sept_remove tdh_mem_page_remove tdh_mem_range_block tdh_mem_range_unblock ------------------------------------------------------------------------ (7) SEPT entry tdh_mem_sept_add tdh_mem_sept_remove tdh_mem_page_aug tdh_mem_page_remove tdh_mem_range_block tdh_mem_range_unblock tdg_mem_page_accept > > > TD enter until page removal completes.) > > > Idea #1: > --- > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index b45258285c9c..8113c17bd2f6 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, > return -EINTR; > cond_resched(); > r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); > - } while (r == RET_PF_RETRY); > + } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN); > > if (r < 0) > return r; > @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err > vcpu->stat.pf_spurious++; > > if (r != RET_PF_EMULATE) > - return 1; > + return r; > > emulate: > return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn, > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 8d3fb3c8c213..690f03d7daae 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > * and of course kvm_mmu_do_page_fault(). > * > * RET_PF_CONTINUE: So far, so good, keep handling the page fault. > + * RET_PF_FIXED: The faulting entry has been fixed. > * RET_PF_RETRY: let CPU fault again on the address. > + * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen. > + * Let the CPU fault again on the address, or retry the > + * fault "locally", i.e. without re-entering the guest. > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. > * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the > * gfn and retry, or emulate the instruction directly. > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. > - * RET_PF_FIXED: The faulting entry has been fixed. > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. > * > * Any names added to this enum should be exported to userspace for use in > @@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which > * will allow for efficient machine code when checking for CONTINUE, e.g. > * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. > + * > + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code > + * scheme, where 1==success, translates '1' to RET_PF_FIXED. > */ Looks "r > 0" represents success in vcpu_run()? So, moving RET_PF_FIXED to 1 is not necessary? > enum { > RET_PF_CONTINUE = 0, > + RET_PF_FIXED = 1, > RET_PF_RETRY, > + RET_PF_RETRY_FROZEN, > RET_PF_EMULATE, > RET_PF_WRITE_PROTECTED, > RET_PF_INVALID, > - RET_PF_FIXED, > RET_PF_SPURIOUS, > }; > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 5a475a6456d4..cbf9e46203f3 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > retry: > rcu_read_unlock(); > + if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)) > + return RET_PF_RETRY_FOZEN; > return ret; > } > > --- > > > Idea #2: > --- > arch/x86/include/asm/kvm_host.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 12 ++++++------ > arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++--- > arch/x86/kvm/mmu/tdp_mmu.c | 1 + > arch/x86/kvm/svm/svm.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 4 ++-- > 6 files changed, 23 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 46e0a466d7fb..200fecd1de88 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr, > int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); > > int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, > - void *insn, int insn_len); > + void *insn, int insn_len, bool *frozen_spte); > void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg); > void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva); > void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index b45258285c9c..207840a316d3 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > return; > > r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code, > - true, NULL, NULL); > + true, NULL, NULL, NULL); > > /* > * Account fixed page faults, otherwise they'll never be counted, but > @@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > trace_kvm_page_fault(vcpu, fault_address, error_code); > > r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn, > - insn_len); > + insn_len, NULL); > } else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) { > vcpu->arch.apf.host_apf_flags = 0; > local_irq_disable(); > @@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, > if (signal_pending(current)) > return -EINTR; > cond_resched(); > - r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level); > + r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL); > } while (r == RET_PF_RETRY); > > if (r < 0) > @@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > } > > int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code, > - void *insn, int insn_len) > + void *insn, int insn_len, bool *frozen_spte) > { > int r, emulation_type = EMULTYPE_PF; > bool direct = vcpu->arch.mmu->root_role.direct; > @@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err > vcpu->stat.pf_taken++; > > r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false, > - &emulation_type, NULL); > + &emulation_type, NULL, frozen_spte); > if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm)) > return -EIO; > } > @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err > vcpu->stat.pf_spurious++; > > if (r != RET_PF_EMULATE) > - return 1; > + return r; > > emulate: > return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn, > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 8d3fb3c8c213..5b1fc77695c1 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -247,6 +247,9 @@ struct kvm_page_fault { > * is changing its own translation in the guest page tables. > */ > bool write_fault_to_shadow_pgtable; > + > + /* Indicates the page fault needs to be retried due to a frozen SPTE. */ > + bool frozen_spte; > }; > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > @@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > * and of course kvm_mmu_do_page_fault(). > * > * RET_PF_CONTINUE: So far, so good, keep handling the page fault. > + * RET_PF_FIXED: The faulting entry has been fixed. > * RET_PF_RETRY: let CPU fault again on the address. > * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. > * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the > * gfn and retry, or emulate the instruction directly. > * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. > - * RET_PF_FIXED: The faulting entry has been fixed. > * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. > * > * Any names added to this enum should be exported to userspace for use in > @@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > * on -errno return values. Somewhat arbitrarily use '0' for CONTINUE, which > * will allow for efficient machine code when checking for CONTINUE, e.g. > * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero. > + * > + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code > + * scheme, where 1==success, translates '1' to RET_PF_FIXED. > */ > enum { > RET_PF_CONTINUE = 0, > + RET_PF_FIXED = 1, > RET_PF_RETRY, > RET_PF_EMULATE, > RET_PF_WRITE_PROTECTED, > RET_PF_INVALID, > - RET_PF_FIXED, > RET_PF_SPURIOUS, > }; > > @@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > > static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > u64 err, bool prefetch, > - int *emulation_type, u8 *level) > + int *emulation_type, u8 *level, > + bool *frozen_spte) > { > struct kvm_page_fault fault = { > .addr = cr2_or_gpa, > @@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > *emulation_type |= EMULTYPE_WRITE_PF_TO_SP; > if (level) > *level = fault.goal_level; > + if (frozen_spte) > + *frozen_spte = fault.frozen_spte; > > return r; > } > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 5a475a6456d4..e7fc5ea4b437 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > retry: > rcu_read_unlock(); > + fault->frozen_spte = is_frozen_spte(iter.old_spte); > return ret; > } > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index 38723b0c435d..269de6a9eb13 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu) > rc = kvm_mmu_page_fault(vcpu, fault_address, error_code, > static_cpu_has(X86_FEATURE_DECODEASSISTS) ? > svm->vmcb->control.insn_bytes : NULL, > - svm->vmcb->control.insn_len); > + svm->vmcb->control.insn_len, NULL); > > if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK) > sev_handle_rmp_fault(vcpu, fault_address, error_code); > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 368acfebd476..fc2ff5d91a71 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) > if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa))) > return kvm_emulate_instruction(vcpu, 0); > > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL); > } > > static int handle_ept_misconfig(struct kvm_vcpu *vcpu) > @@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu) > return kvm_skip_emulated_instruction(vcpu); > } > > - return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0); > + return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL); > } > > static int handle_nmi_window(struct kvm_vcpu *vcpu) > > base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe > -- >
> > ===Resources & users list=== > > > > Resources SHARED users EXCLUSIVE users > > ------------------------------------------------------------------------ > > (1) TDR tdh_mng_rdwr tdh_mng_create > > tdh_vp_create tdh_mng_add_cx > > tdh_vp_addcx tdh_mng_init > > tdh_vp_init tdh_mng_vpflushdone > > tdh_vp_enter tdh_mng_key_config > > tdh_vp_flush tdh_mng_key_freeid > > tdh_vp_rd_wr tdh_mr_extend > > tdh_mem_sept_add tdh_mr_finalize > > tdh_mem_sept_remove tdh_vp_init_apicid > > tdh_mem_page_aug tdh_mem_page_add > > tdh_mem_page_remove > > tdh_mem_range_block > > tdh_mem_track > > tdh_mem_range_unblock > > tdh_phymem_page_reclaim > > In pamt_walk() it calls promote_sharex_lock_hp() with the lock type passed into > pamt_walk(), and tdh_phymem_page_reclaim() passed TDX_LOCK_EXCLUSIVE. So that is > an exclusive lock. But we can ignore it because we only do reclaim at TD tear > down time? Hmm, if the page to reclaim is not a TDR page, lock_and_map_implicit_tdr() is called to lock the page's corresponding TDR page with SHARED lock. if the page to reclaim is a TDR page, it's indeed locked with EXCLUSIVE. But in pamt_walk() it calls promote_sharex_lock_hp() for the passed in TDX_LOCK_EXCLUSIVE only when if ((pamt_1gb->pt == PT_REG) || (target_size == PT_1GB)) or if ((pamt_2mb->pt == PT_REG) || (target_size == PT_2MB)) "pamt_1gb->pt == PT_REG" (or "pamt_2mb->pt == PT_REG)") is true when it's assigned (not PT_NDA) and is a normal page (i.e. not TDR, TDVPR...). This is true only after tdh_mem_page_add()/tdh_mem_page_aug() assigns the page to a TD with huge page size. This will not happen for a TDR page. For normal pages when huge page is supported in future, looks we need to update tdh_phymem_page_reclaim() to include size info too. > > Separately, I wonder if we should try to add this info as comments around the > SEAMCALL implementations. The locking is not part of the spec, but never-the- > less the kernel is being coded against these assumptions. So it can sort of be > like "the kernel assumes this" and we can at least record what the reason was. > Or maybe just comment the parts that KVM assumes. Agreed.
On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > > Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()? > > The page isn't yet mapped, so why would the guest be allowed to take a lock on > > the S-EPT entry? > Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second > tdh_mem_page_aug() is called on the same gpa, the second one may contend with > tdg_mem_page_accept(). > > But given KVM does not allow the second tdh_mem_page_aug(), looks the contention > between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen. I withdraw the reply above. tdh_mem_page_aug() and tdg_mem_page_accept() both attempt to modify the same SEPT entry, leading to contention. - tdg_mem_page_accept() first walks the SEPT tree with no lock to get the SEPT entry. It then acquire the guest side lock of the found SEPT entry before checking entry state. - tdh_mem_page_aug() first walks the SEPT tree with shared lock to locate the SEPT entry to modify, it then aquires host side lock of the SEPT entry before checking entry state.
On 15/09/2024 9:53 pm, Zhao, Yan Y wrote: > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: >>> Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()? >>> The page isn't yet mapped, so why would the guest be allowed to take a lock on >>> the S-EPT entry? >> Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second >> tdh_mem_page_aug() is called on the same gpa, the second one may contend with >> tdg_mem_page_accept(). >> >> But given KVM does not allow the second tdh_mem_page_aug(), looks the contention >> between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen. > I withdraw the reply above. > > tdh_mem_page_aug() and tdg_mem_page_accept() both attempt to modify the same > SEPT entry, leading to contention. > - tdg_mem_page_accept() first walks the SEPT tree with no lock to get the SEPT > entry. It then acquire the guest side lock of the found SEPT entry before > checking entry state. > - tdh_mem_page_aug() first walks the SEPT tree with shared lock to locate the > SEPT entry to modify, it then aquires host side lock of the SEPT entry before > checking entry state. This seems can only happen when there are multiple threads in guest trying to do tdg_mem_page_accept() on the same page. This should be extremely rare to happen, and if this happens, it will eventually result in another fault in KVM. So now we set SPTE to FROZEN_SPTE before doing AUG to prevent from other threads from going on. I think when tdh_mem_page_aug() fails with secure EPT "entry" busy, we can reset FROZEN_SPTE back to old_spte and return PF_RETRY so that this thread and another fault thread can both try to complete AUG again? The thread fails with AUG can also go back to guest though, but since host priority bit is already set, the further PAGE.ACCEPT will fail but this is fine due to another AUG in KVM will eventually resolve this and make progress to the guest.
On 14/09/2024 5:23 am, Sean Christopherson wrote: > On Fri, Sep 13, 2024, Yan Zhao wrote: >> This is a lock status report of TDX module for current SEAMCALL retry issue >> based on code in TDX module public repo https://github.com/intel/tdx-module.git >> branch TDX_1.5.05. >> >> TL;DR: >> - tdh_mem_track() can contend with tdh_vp_enter(). >> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > whatever reason. > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > hits the fault? > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > desirable because in many cases, the winning task will install a valid mapping > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > instruction is re-executed. In the happy case, that provides optimal performance > as KVM doesn't introduce any extra delay/latency. > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > much higher, especially in light of the zero-step issues. > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > then the TDX EPT violation path can safely retry locally, similar to the do-while > loop in kvm_tdp_map_page(). > > The only part I don't like about this idea is having two "retry" return values, > which creates the potential for bugs due to checking one but not the other. > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > option better even though the out-param is a bit gross, because it makes it more > obvious that the "frozen_spte" is a special case that doesn't need attention for > most paths. > [...] > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 5a475a6456d4..cbf9e46203f3 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > retry: > rcu_read_unlock(); > + if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)) > + return RET_PF_RETRY_FOZEN; Ack the whole "retry on frozen" approach, either with RET_PF_RETRY_FOZEN or fault->frozen_spte. One minor side effect: For normal VMs, the fault handler can also see a frozen spte, e.g, when kvm_tdp_mmu_map() checks the middle level SPTE: /* * If SPTE has been frozen by another thread, just give up and * retry, avoiding unnecessary page table allocation and free. */ if (is_frozen_spte(iter.old_spte)) goto retry; So for normal VM this RET_PF_RETRY_FOZEN will change "go back to guest to retry" to "retry in KVM internally". As you mentioned above for normal VMs we probably always want to "go back to guest to retry" even for FROZEN SPTE, but I guess this is a minor issue that we can even notice. Or we can additionally add: if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte) && is_mirrored_sptep(iter.sptep)) return RET_PF_RETRY_FOZEN; So it only applies to TDX.
On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > > On Fri, Sep 13, 2024, Yan Zhao wrote: > > > This is a lock status report of TDX module for current SEAMCALL retry issue > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > > branch TDX_1.5.05. > > > > > > TL;DR: > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > > whatever reason. > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > > hits the fault? > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > > desirable because in many cases, the winning task will install a valid mapping > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > > instruction is re-executed. In the happy case, that provides optimal performance > > as KVM doesn't introduce any extra delay/latency. > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > > much higher, especially in light of the zero-step issues. > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > > then the TDX EPT violation path can safely retry locally, similar to the do-while > > loop in kvm_tdp_map_page(). > > > > The only part I don't like about this idea is having two "retry" return values, > > which creates the potential for bugs due to checking one but not the other. > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > > option better even though the out-param is a bit gross, because it makes it more > > obvious that the "frozen_spte" is a special case that doesn't need attention for > > most paths. > Good idea. > But could we extend it a bit more to allow TDX's EPT violation handler to also > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing FROZEN_SPTE might not be enough to prevent zero step mitigation. E.g. in below selftest with a TD configured with pending_ve_disable=N, zero step mitigation can be triggered on a vCPU that is stuck in EPT violation vm exit for more than 6 times (due to that user space does not do memslot conversion correctly). So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck in EPT violation vm exits. #include <stdint.h> #include "kvm_util.h" #include "processor.h" #include "tdx/tdcall.h" #include "tdx/tdx.h" #include "tdx/tdx_util.h" #include "tdx/test_util.h" #include "test_util.h" /* * 0x80000000 is arbitrarily selected, but it should not overlap with selftest * code or boot page. */ #define ZERO_STEP_TEST_AREA_GPA (0x80000000) /* Test area GPA is arbitrarily selected */ #define ZERO_STEP_AREA_GVA_PRIVATE (0x90000000) /* The test area is 2MB in size */ #define ZERO_STEP_AREA_SIZE (2 << 20) #define ZERO_STEP_ASSERT(x) \ do { \ if (!(x)) \ tdx_test_fatal(__LINE__); \ } while (0) #define ZERO_STEP_ACCEPT_PRINT_PORT 0x87 #define ZERO_STEP_THRESHOLD 6 #define TRIGGER_ZERO_STEP_MITIGATION 1 static int convert_request_cnt; static void guest_test_zero_step(void) { void *test_area_gva_private = (void *)ZERO_STEP_AREA_GVA_PRIVATE; memset(test_area_gva_private, 1, 8); tdx_test_success(); } static void guest_ve_handler(struct ex_regs *regs) { uint64_t ret; struct ve_info ve; ret = tdg_vp_veinfo_get(&ve); ZERO_STEP_ASSERT(!ret); /* For this test, we will only handle EXIT_REASON_EPT_VIOLATION */ ZERO_STEP_ASSERT(ve.exit_reason == EXIT_REASON_EPT_VIOLATION); tdx_test_send_64bit(ZERO_STEP_ACCEPT_PRINT_PORT, ve.gpa); #define MEM_PAGE_ACCEPT_LEVEL_4K 0 #define MEM_PAGE_ACCEPT_LEVEL_2M 1 ret = tdg_mem_page_accept(ve.gpa, MEM_PAGE_ACCEPT_LEVEL_4K); ZERO_STEP_ASSERT(!ret); } static void zero_step_test(void) { struct kvm_vm *vm; struct kvm_vcpu *vcpu; void *guest_code; uint64_t test_area_npages; vm_vaddr_t test_area_gva_private; vm = td_create(); td_initialize(vm, VM_MEM_SRC_ANONYMOUS, 0); guest_code = guest_test_zero_step; vcpu = td_vcpu_add(vm, 0, guest_code); vm_install_exception_handler(vm, VE_VECTOR, guest_ve_handler); test_area_npages = ZERO_STEP_AREA_SIZE / vm->page_size; vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, ZERO_STEP_TEST_AREA_GPA, 3, test_area_npages, KVM_MEM_GUEST_MEMFD); vm->memslots[MEM_REGION_TEST_DATA] = 3; test_area_gva_private = ____vm_vaddr_alloc( vm, ZERO_STEP_AREA_SIZE, ZERO_STEP_AREA_GVA_PRIVATE, ZERO_STEP_TEST_AREA_GPA, MEM_REGION_TEST_DATA, true); TEST_ASSERT_EQ(test_area_gva_private, ZERO_STEP_AREA_GVA_PRIVATE); td_finalize(vm); handle_memory_conversion(vm, ZERO_STEP_TEST_AREA_GPA, ZERO_STEP_AREA_SIZE, false); for (;;) { vcpu_run(vcpu); if (vcpu->run->exit_reason == KVM_EXIT_IO && vcpu->run->io.port == ZERO_STEP_ACCEPT_PRINT_PORT) { uint64_t gpa = tdx_test_read_64bit( vcpu, ZERO_STEP_ACCEPT_PRINT_PORT); printf("\t ... guest accepting 1 page at GPA: 0x%lx\n", gpa); continue; } else if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT) { bool skip = TRIGGER_ZERO_STEP_MITIGATION && (convert_request_cnt < ZERO_STEP_THRESHOLD -1); convert_request_cnt++; printf("guest request conversion of gpa 0x%llx - 0x%llx to %s, skip=%d\n", vcpu->run->memory_fault.gpa, vcpu->run->memory_fault.size, (vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE) ? "private" : "shared", skip); if (skip) continue; handle_memory_conversion( vm, vcpu->run->memory_fault.gpa, vcpu->run->memory_fault.size, vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE); continue; } printf("exit reason %d\n", vcpu->run->exit_reason); break; } kvm_vm_free(vm); } int main(int argc, char **argv) { /* Disable stdout buffering */ setbuf(stdout, NULL); if (!is_tdx_enabled()) { printf("TDX is not supported by the KVM\n" "Skipping the TDX tests.\n"); return 0; } run_in_new_process(&zero_step_test); }
On Wed, Sep 25, 2024, Yan Zhao wrote: > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > > > On Fri, Sep 13, 2024, Yan Zhao wrote: > > > > This is a lock status report of TDX module for current SEAMCALL retry issue > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > > > branch TDX_1.5.05. > > > > > > > > TL;DR: > > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > > > whatever reason. > > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > > > hits the fault? > > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > > > desirable because in many cases, the winning task will install a valid mapping > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > > > instruction is re-executed. In the happy case, that provides optimal performance > > > as KVM doesn't introduce any extra delay/latency. > > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > > > much higher, especially in light of the zero-step issues. > > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > > > then the TDX EPT violation path can safely retry locally, similar to the do-while > > > loop in kvm_tdp_map_page(). > > > > > > The only part I don't like about this idea is having two "retry" return values, > > > which creates the potential for bugs due to checking one but not the other. > > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > > > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > > > option better even though the out-param is a bit gross, because it makes it more > > > obvious that the "frozen_spte" is a special case that doesn't need attention for > > > most paths. > > Good idea. > > But could we extend it a bit more to allow TDX's EPT violation handler to also > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing > FROZEN_SPTE might not be enough to prevent zero step mitigation. The goal isn't to make it completely impossible for zero-step to fire, it's to make it so that _if_ zero-step fires, KVM can report the error to userspace without having to retry, because KVM _knows_ that advancing past the zero-step isn't something KVM can solve. : I'm not worried about any performance hit with zero-step, I'm worried about KVM : not being able to differentiate between a KVM bug and guest interference. The : goal with a local retry is to make it so that KVM _never_ triggers zero-step, : unless there is a bug somewhere. At that point, if zero-step fires, KVM can ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ : report the error to userspace instead of trying to suppress guest activity, and : potentially from other KVM tasks too. In other words, for the selftest you crafted, KVM reporting an error to userspace due to zero-step would be working as intended. > E.g. in below selftest with a TD configured with pending_ve_disable=N, > zero step mitigation can be triggered on a vCPU that is stuck in EPT violation > vm exit for more than 6 times (due to that user space does not do memslot > conversion correctly). > > So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may > contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck > in EPT violation vm exits.
On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote: > On Wed, Sep 25, 2024, Yan Zhao wrote: > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > > > > On Fri, Sep 13, 2024, Yan Zhao wrote: > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > > > > branch TDX_1.5.05. > > > > > > > > > > TL;DR: > > > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > > > > whatever reason. > > > > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > > > > hits the fault? > > > > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > > > > desirable because in many cases, the winning task will install a valid mapping > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > > > > instruction is re-executed. In the happy case, that provides optimal performance > > > > as KVM doesn't introduce any extra delay/latency. > > > > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > > > > much higher, especially in light of the zero-step issues. > > > > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while > > > > loop in kvm_tdp_map_page(). > > > > > > > > The only part I don't like about this idea is having two "retry" return values, > > > > which creates the potential for bugs due to checking one but not the other. > > > > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > > > > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > > > > option better even though the out-param is a bit gross, because it makes it more > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for > > > > most paths. > > > Good idea. > > > But could we extend it a bit more to allow TDX's EPT violation handler to also > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing > > FROZEN_SPTE might not be enough to prevent zero step mitigation. > > The goal isn't to make it completely impossible for zero-step to fire, it's to > make it so that _if_ zero-step fires, KVM can report the error to userspace without > having to retry, because KVM _knows_ that advancing past the zero-step isn't > something KVM can solve. > > : I'm not worried about any performance hit with zero-step, I'm worried about KVM > : not being able to differentiate between a KVM bug and guest interference. The > : goal with a local retry is to make it so that KVM _never_ triggers zero-step, > : unless there is a bug somewhere. At that point, if zero-step fires, KVM can > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > : report the error to userspace instead of trying to suppress guest activity, and > : potentially from other KVM tasks too. > > In other words, for the selftest you crafted, KVM reporting an error to userspace > due to zero-step would be working as intended. Hmm, but the selftest is an example to show that 6 continuous EPT violations on the same GPA could trigger zero-step. For an extremely unlucky vCPU, is it still possible to fire zero step when nothing is wrong both in KVM and QEMU? e.g. 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) > > E.g. in below selftest with a TD configured with pending_ve_disable=N, > > zero step mitigation can be triggered on a vCPU that is stuck in EPT violation > > vm exit for more than 6 times (due to that user space does not do memslot > > conversion correctly). > > > > So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may > > contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck > > in EPT violation vm exits.
On Thu, Oct 10, 2024, Yan Zhao wrote: > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote: > > On Wed, Sep 25, 2024, Yan Zhao wrote: > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote: > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > > > > > branch TDX_1.5.05. > > > > > > > > > > > > TL;DR: > > > > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > > > > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > > > > > whatever reason. > > > > > > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > > > > > hits the fault? > > > > > > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > > > > > desirable because in many cases, the winning task will install a valid mapping > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > > > > > instruction is re-executed. In the happy case, that provides optimal performance > > > > > as KVM doesn't introduce any extra delay/latency. > > > > > > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > > > > > much higher, especially in light of the zero-step issues. > > > > > > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while > > > > > loop in kvm_tdp_map_page(). > > > > > > > > > > The only part I don't like about this idea is having two "retry" return values, > > > > > which creates the potential for bugs due to checking one but not the other. > > > > > > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > > > > > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > > > > > option better even though the out-param is a bit gross, because it makes it more > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for > > > > > most paths. > > > > Good idea. > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing > > > FROZEN_SPTE might not be enough to prevent zero step mitigation. > > > > The goal isn't to make it completely impossible for zero-step to fire, it's to > > make it so that _if_ zero-step fires, KVM can report the error to userspace without > > having to retry, because KVM _knows_ that advancing past the zero-step isn't > > something KVM can solve. > > > > : I'm not worried about any performance hit with zero-step, I'm worried about KVM > > : not being able to differentiate between a KVM bug and guest interference. The > > : goal with a local retry is to make it so that KVM _never_ triggers zero-step, > > : unless there is a bug somewhere. At that point, if zero-step fires, KVM can > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > : report the error to userspace instead of trying to suppress guest activity, and > > : potentially from other KVM tasks too. > > > > In other words, for the selftest you crafted, KVM reporting an error to userspace > > due to zero-step would be working as intended. > Hmm, but the selftest is an example to show that 6 continuous EPT violations on > the same GPA could trigger zero-step. > > For an extremely unlucky vCPU, is it still possible to fire zero step when > nothing is wrong both in KVM and QEMU? > e.g. > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) Very technically, this shouldn't be possible. The only way for there to be contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the 6th attempt should succeed, even if the faulting vCPU wasn't the one to create the SPTE. That said, a few thoughts: 1. Where did we end up on the idea of requiring userspace to pre-fault memory? 2. The zero-step logic really should have a slightly more conservative threshold. I have a hard time believing that e.g. 10 attempts would create a side channel, but 6 attempts is "fine". 3. This would be a good reason to implement a local retry in kvm_tdp_mmu_map(). Yes, I'm being somewhat hypocritical since I'm so against retrying for the S-EPT case, but my objection to retrying for S-EPT is that it _should_ be easy for KVM to guarantee success. E.g. for #3, the below (compile tested only) patch should make it impossible for the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror SPTEs should never trigger A/D assists, i.e. retry should always succeed. --- arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++++++++++++++++++++++------ 1 file changed, 40 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 3b996c1fdaab..e47573a652a9 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1097,6 +1097,18 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter, static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, struct kvm_mmu_page *sp, bool shared); +static struct kvm_mmu_page *tdp_mmu_realloc_sp(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp) +{ + if (!sp) + return tdp_mmu_alloc_sp(vcpu); + + memset(sp, 0, sizeof(*sp)); + memset64(sp->spt, vcpu->arch.mmu_shadow_page_cache.init_value, + PAGE_SIZE / sizeof(u64)); + return sp; +} + /* * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing * page tables and SPTEs to translate the faulting guest physical address. @@ -1104,9 +1116,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_mmu *mmu = vcpu->arch.mmu; + struct kvm_mmu_page *sp = NULL; struct kvm *kvm = vcpu->kvm; struct tdp_iter iter; - struct kvm_mmu_page *sp; int ret = RET_PF_RETRY; kvm_mmu_hugepage_adjust(vcpu, fault); @@ -1116,8 +1128,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) rcu_read_lock(); tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { - int r; - + /* + * Somewhat arbitrarily allow two local retries, e.g. to play + * nice with the extremely unlikely case that KVM encounters a + * huge SPTE an Access-assist _and_ a subsequent Dirty-assist. + * Retrying is inexpensive, but if KVM fails to install a SPTE + * three times, then a fourth attempt is likely futile and it's + * time to back off. + */ + int r, retry_locally = 2; +again: if (fault->nx_huge_page_workaround_enabled) disallowed_hugepage_adjust(fault, iter.old_spte, iter.level); @@ -1140,7 +1160,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) * The SPTE is either non-present or points to a huge page that * needs to be split. */ - sp = tdp_mmu_alloc_sp(vcpu); + sp = tdp_mmu_realloc_sp(vcpu, sp); tdp_mmu_init_child_sp(sp, &iter); sp->nx_huge_page_disallowed = fault->huge_page_disallowed; @@ -1151,11 +1171,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) r = tdp_mmu_link_sp(kvm, &iter, sp, true); /* - * Force the guest to retry if installing an upper level SPTE - * failed, e.g. because a different task modified the SPTE. + * If installing an upper level SPTE failed, retry the walk + * locally before forcing the guest to retry. If the SPTE was + * modified by a different task, odds are very good the new + * SPTE is usable as-is. And if the SPTE was modified by the + * CPU, e.g. to set A/D bits, then unless KVM gets *extremely* + * unlucky, the CMPXCHG should succeed the second time around. */ if (r) { - tdp_mmu_free_sp(sp); + if (retry_locally--) + goto again; goto retry; } @@ -1166,6 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) track_possible_nx_huge_page(kvm, sp); spin_unlock(&kvm->arch.tdp_mmu_pages_lock); } + sp = NULL; } /* @@ -1180,6 +1206,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) retry: rcu_read_unlock(); + + /* + * Free the previously allocated MMU page if KVM retried locally and + * ended up not using said page. + */ + if (sp) + tdp_mmu_free_sp(sp); return ret; } base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b --
On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote: > > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) Isn't there a more general scenario: vcpu0 vcpu1 1. Freezes PTE 2. External op to do the SEAMCALL 3. Faults same PTE, hits frozen PTE 4. Retries N times, triggers zero-step 5. Finally finishes external op Am I missing something? > > Very technically, this shouldn't be possible. The only way for there to be > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. > the > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create > the SPTE. > > That said, a few thoughts: > > 1. Where did we end up on the idea of requiring userspace to pre-fault memory? For others reference, I think you are referring to the idea to pre-fault the entire S-EPT even for GFNs that usually get AUGed, not the mirrored EPT pre- faulting/PAGE.ADD dance we are already doing. The last discussion with Paolo was to resume the retry solution discussion on the v2 posting because it would be easier "with everything else already addressed". Also, there was also some discussion that it was not immediately obvious how prefaulting everything would work for memory hot plug (i.e. memslots added during runtime). > > 2. The zero-step logic really should have a slightly more conservative > threshold. > I have a hard time believing that e.g. 10 attempts would create a side > channel, > but 6 attempts is "fine". No idea where the threshold came from. I'm not sure if it affects the KVM design? We can look into it for curiosity sake in either case. > > 3. This would be a good reason to implement a local retry in > kvm_tdp_mmu_map(). > Yes, I'm being somewhat hypocritical since I'm so against retrying for the > S-EPT case, but my objection to retrying for S-EPT is that it _should_ be > easy > for KVM to guarantee success. > > E.g. for #3, the below (compile tested only) patch should make it impossible > for > the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror > SPTEs > should never trigger A/D assists, i.e. retry should always succeed. I don't see how it addresses the scenario above. More retires could just make it rarer, but never fix it. Very possible I'm missing something though.
On Thu, Oct 10, 2024 at 10:33:30AM -0700, Sean Christopherson wrote: > On Thu, Oct 10, 2024, Yan Zhao wrote: > > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote: > > > On Wed, Sep 25, 2024, Yan Zhao wrote: > > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote: > > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue > > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > > > > > > branch TDX_1.5.05. > > > > > > > > > > > > > > TL;DR: > > > > > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > > > > > > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > > > > > > whatever reason. > > > > > > > > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > > > > > > hits the fault? > > > > > > > > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > > > > > > desirable because in many cases, the winning task will install a valid mapping > > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > > > > > > instruction is re-executed. In the happy case, that provides optimal performance > > > > > > as KVM doesn't introduce any extra delay/latency. > > > > > > > > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > > > > > > much higher, especially in light of the zero-step issues. > > > > > > > > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while > > > > > > loop in kvm_tdp_map_page(). > > > > > > > > > > > > The only part I don't like about this idea is having two "retry" return values, > > > > > > which creates the potential for bugs due to checking one but not the other. > > > > > > > > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > > > > > > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > > > > > > option better even though the out-param is a bit gross, because it makes it more > > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for > > > > > > most paths. > > > > > Good idea. > > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also > > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? > > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing > > > > FROZEN_SPTE might not be enough to prevent zero step mitigation. > > > > > > The goal isn't to make it completely impossible for zero-step to fire, it's to > > > make it so that _if_ zero-step fires, KVM can report the error to userspace without > > > having to retry, because KVM _knows_ that advancing past the zero-step isn't > > > something KVM can solve. > > > > > > : I'm not worried about any performance hit with zero-step, I'm worried about KVM > > > : not being able to differentiate between a KVM bug and guest interference. The > > > : goal with a local retry is to make it so that KVM _never_ triggers zero-step, > > > : unless there is a bug somewhere. At that point, if zero-step fires, KVM can > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > : report the error to userspace instead of trying to suppress guest activity, and > > > : potentially from other KVM tasks too. > > > > > > In other words, for the selftest you crafted, KVM reporting an error to userspace > > > due to zero-step would be working as intended. > > Hmm, but the selftest is an example to show that 6 continuous EPT violations on > > the same GPA could trigger zero-step. > > > > For an extremely unlucky vCPU, is it still possible to fire zero step when > > nothing is wrong both in KVM and QEMU? > > e.g. > > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) > > Very technically, this shouldn't be possible. The only way for there to be > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create > the SPTE. Hmm, the 7th EPT violation could still occur if the vCPU that sees failure of "try_cmpxchg64()" returns to guest faster than the one that successfully installs the SPTE. > > That said, a few thoughts: > > 1. Where did we end up on the idea of requiring userspace to pre-fault memory? I didn't follow this question. Do you want to disallow userspace to pre-fault memory after TD finalization or do you want to suggest userspace to do it? > > 2. The zero-step logic really should have a slightly more conservative threshold. > I have a hard time believing that e.g. 10 attempts would create a side channel, > but 6 attempts is "fine". Don't know where the value 6 comes. :) We may need to ask. > 3. This would be a good reason to implement a local retry in kvm_tdp_mmu_map(). > Yes, I'm being somewhat hypocritical since I'm so against retrying for the > S-EPT case, but my objection to retrying for S-EPT is that it _should_ be easy > for KVM to guarantee success. It's reasonable. But TDX code still needs to retry for the RET_PF_RETRY_FROZEN without re-entering guest. Would it be good for TDX code to retry whenever it sees RET_PF_RETRY or RET_PF_RETRY_FOZEN? We can have tdx_sept_link_private_spt()/tdx_sept_set_private_spte() to return -EBUSY on contention. > > E.g. for #3, the below (compile tested only) patch should make it impossible for > the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror SPTEs > should never trigger A/D assists, i.e. retry should always succeed. > > --- > arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++++++++++++++++++++++------ > 1 file changed, 40 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 3b996c1fdaab..e47573a652a9 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1097,6 +1097,18 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter, > static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, > struct kvm_mmu_page *sp, bool shared); > > +static struct kvm_mmu_page *tdp_mmu_realloc_sp(struct kvm_vcpu *vcpu, > + struct kvm_mmu_page *sp) > +{ > + if (!sp) > + return tdp_mmu_alloc_sp(vcpu); > + > + memset(sp, 0, sizeof(*sp)); > + memset64(sp->spt, vcpu->arch.mmu_shadow_page_cache.init_value, > + PAGE_SIZE / sizeof(u64)); > + return sp; > +} > + > /* > * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing > * page tables and SPTEs to translate the faulting guest physical address. > @@ -1104,9 +1116,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter, > int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > { > struct kvm_mmu *mmu = vcpu->arch.mmu; > + struct kvm_mmu_page *sp = NULL; > struct kvm *kvm = vcpu->kvm; > struct tdp_iter iter; > - struct kvm_mmu_page *sp; > int ret = RET_PF_RETRY; > > kvm_mmu_hugepage_adjust(vcpu, fault); > @@ -1116,8 +1128,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > rcu_read_lock(); > > tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { > - int r; > - > + /* > + * Somewhat arbitrarily allow two local retries, e.g. to play > + * nice with the extremely unlikely case that KVM encounters a > + * huge SPTE an Access-assist _and_ a subsequent Dirty-assist. > + * Retrying is inexpensive, but if KVM fails to install a SPTE > + * three times, then a fourth attempt is likely futile and it's > + * time to back off. > + */ > + int r, retry_locally = 2; > +again: > if (fault->nx_huge_page_workaround_enabled) > disallowed_hugepage_adjust(fault, iter.old_spte, iter.level); > > @@ -1140,7 +1160,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > * The SPTE is either non-present or points to a huge page that > * needs to be split. > */ > - sp = tdp_mmu_alloc_sp(vcpu); > + sp = tdp_mmu_realloc_sp(vcpu, sp); > tdp_mmu_init_child_sp(sp, &iter); > > sp->nx_huge_page_disallowed = fault->huge_page_disallowed; > @@ -1151,11 +1171,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > r = tdp_mmu_link_sp(kvm, &iter, sp, true); > > /* > - * Force the guest to retry if installing an upper level SPTE > - * failed, e.g. because a different task modified the SPTE. > + * If installing an upper level SPTE failed, retry the walk > + * locally before forcing the guest to retry. If the SPTE was > + * modified by a different task, odds are very good the new > + * SPTE is usable as-is. And if the SPTE was modified by the > + * CPU, e.g. to set A/D bits, then unless KVM gets *extremely* > + * unlucky, the CMPXCHG should succeed the second time around. > */ > if (r) { > - tdp_mmu_free_sp(sp); > + if (retry_locally--) > + goto again; > goto retry; > } > > @@ -1166,6 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > track_possible_nx_huge_page(kvm, sp); > spin_unlock(&kvm->arch.tdp_mmu_pages_lock); > } > + sp = NULL; > } > > /* > @@ -1180,6 +1206,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > > retry: > rcu_read_unlock(); > + > + /* > + * Free the previously allocated MMU page if KVM retried locally and > + * ended up not using said page. > + */ > + if (sp) > + tdp_mmu_free_sp(sp); > return ret; > } > > > base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b > -- >
On Fri, Oct 11, 2024 at 05:53:29AM +0800, Edgecombe, Rick P wrote: > On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote: > > > > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) > > Isn't there a more general scenario: > > vcpu0 vcpu1 > 1. Freezes PTE > 2. External op to do the SEAMCALL > 3. Faults same PTE, hits frozen PTE > 4. Retries N times, triggers zero-step > 5. Finally finishes external op > > Am I missing something? Yes, it's a follow-up discussion of Sean's proposal [1] of having TDX code to retry on RET_PF_RETRY_FROZEN to avoid zero-step. My worry is that merely avoiding entering guest for vCPUs seeing FROZEN_SPTE is not enough to prevent zero-step. The two examples shows zero-step is possible without re-entering guest for FROZEN_SPTE: - The selftest [2]: a single vCPU can fire zero-step when userspace does something wrong (though KVM is correct). - The above case: Nothing wrong in KVM/QEMU, except an extremely unlucky vCPU. [1] https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/ [2] https://lore.kernel.org/all/ZvPrqMj1BWrkkwqN@yzhao56-desk.sh.intel.com/ > > > > > Very technically, this shouldn't be possible. The only way for there to be > > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. > > the > > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create > > the SPTE. > > > > That said, a few thoughts: > > > > 1. Where did we end up on the idea of requiring userspace to pre-fault memory? > > For others reference, I think you are referring to the idea to pre-fault the > entire S-EPT even for GFNs that usually get AUGed, not the mirrored EPT pre- > faulting/PAGE.ADD dance we are already doing. > > The last discussion with Paolo was to resume the retry solution discussion on > the v2 posting because it would be easier "with everything else already > addressed". Also, there was also some discussion that it was not immediately > obvious how prefaulting everything would work for memory hot plug (i.e. memslots > added during runtime). > > > > > 2. The zero-step logic really should have a slightly more conservative > > threshold. > > I have a hard time believing that e.g. 10 attempts would create a side > > channel, > > but 6 attempts is "fine". > > No idea where the threshold came from. I'm not sure if it affects the KVM > design? We can look into it for curiosity sake in either case. > > > > > 3. This would be a good reason to implement a local retry in > > kvm_tdp_mmu_map(). > > Yes, I'm being somewhat hypocritical since I'm so against retrying for the > > S-EPT case, but my objection to retrying for S-EPT is that it _should_ be > > easy > > for KVM to guarantee success. > > > > E.g. for #3, the below (compile tested only) patch should make it impossible > > for > > the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror > > SPTEs > > should never trigger A/D assists, i.e. retry should always succeed. > > I don't see how it addresses the scenario above. More retires could just make it > rarer, but never fix it. Very possible I'm missing something though. I'm also not 100% sure if zero-step must not happen after this change even when KVM/QEMU do nothing wrong.
On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote: > On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote: > > > > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) > > Isn't there a more general scenario: > > vcpu0 vcpu1 > 1. Freezes PTE > 2. External op to do the SEAMCALL > 3. Faults same PTE, hits frozen PTE > 4. Retries N times, triggers zero-step > 5. Finally finishes external op > > Am I missing something? I must be missing something. I thought KVM is going to retry internally for step 4 (retries N times) because it sees the frozen PTE, but will never go back to guest after the fault is resolved? How can step 4 triggers zero-step?
On Mon, 2024-10-14 at 10:54 +0000, Huang, Kai wrote: > On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote: > > On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote: > > > > > > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > > > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) > > > > Isn't there a more general scenario: > > > > vcpu0 vcpu1 > > 1. Freezes PTE > > 2. External op to do the SEAMCALL > > 3. Faults same PTE, hits frozen PTE > > 4. Retries N times, triggers zero-step > > 5. Finally finishes external op > > > > Am I missing something? > > I must be missing something. I thought KVM is going to > "Is going to", as in "will be changed to"? Or "does today"? > retry internally for > step 4 (retries N times) because it sees the frozen PTE, but will never go back > to guest after the fault is resolved? How can step 4 triggers zero-step? Step 3-4 is saying it will go back to the guest and fault again. As far as what KVM will do in the future, I think it is still open. I've not had the chance to think about this for more than 30 min at a time, but the plan to handle OPERAND_BUSY by taking an expensive path to break any contention (i.e. kick+lock + whatever TDX module changes we come up with) seems to the leading idea. Retry N times is too hacky. Retry internally forever might be awkward to implement. Because of the signal_pending() check, you would have to handle exiting to userspace and going back to an EPT violation next time the vcpu tries to enter.
On 15/10/2024 6:36 am, Edgecombe, Rick P wrote: > On Mon, 2024-10-14 at 10:54 +0000, Huang, Kai wrote: >> On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote: >>> On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote: >>>>> >>>>> 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. >>>>> 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) >>> >>> Isn't there a more general scenario: >>> >>> vcpu0 vcpu1 >>> 1. Freezes PTE >>> 2. External op to do the SEAMCALL >>> 3. Faults same PTE, hits frozen PTE >>> 4. Retries N times, triggers zero-step >>> 5. Finally finishes external op >>> >>> Am I missing something? >> >> I must be missing something. I thought KVM is going to >> > > "Is going to", as in "will be changed to"? Or "does today"? Will be changed to (today's behaviour is to go back to guest to let the fault happen again to retry). AFAICT this is what Sean suggested: https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/ The whole point is to let KVM loop internally but not go back to guest when the fault handler sees a frozen PTE. And in this proposal this applies to both leaf and non-leaf PTEs IIUC, so it should handle the case where try_cmpxchg64() fails as mentioned by Yan. > >> retry internally for >> step 4 (retries N times) because it sees the frozen PTE, but will never go back >> to guest after the fault is resolved? How can step 4 triggers zero-step? > > Step 3-4 is saying it will go back to the guest and fault again. As said above, the whole point is to make KVM loop internally when it sees a frozen PTE, but not go back to guest.
On Tue, 2024-10-15 at 12:03 +1300, Huang, Kai wrote: > > "Is going to", as in "will be changed to"? Or "does today"? > > Will be changed to (today's behaviour is to go back to guest to let the > fault happen again to retry). > > AFAICT this is what Sean suggested: > > https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/ > > The whole point is to let KVM loop internally but not go back to guest > when the fault handler sees a frozen PTE. And in this proposal this > applies to both leaf and non-leaf PTEs IIUC, so it should handle the > case where try_cmpxchg64() fails as mentioned by Yan. > > > > > > retry internally for > > > step 4 (retries N times) because it sees the frozen PTE, but will never go > > > back > > > to guest after the fault is resolved? How can step 4 triggers zero-step? > > > > Step 3-4 is saying it will go back to the guest and fault again. > > As said above, the whole point is to make KVM loop internally when it > sees a frozen PTE, but not go back to guest. Yea, I was saying on that idea that I thought looping forever without checking for a signal would be problematic. Then userspace could re-enter the TD. I don't know if it's a show stopper. In any case the discussion between these threads and LPC/KVM forum hallway chatter has gotten a bit fragmented. I don't think there is any concrete consensus solution at this point.
On Thu, Oct 10, 2024 at 10:33:30AM -0700, Sean Christopherson wrote: > On Thu, Oct 10, 2024, Yan Zhao wrote: > > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote: > > > On Wed, Sep 25, 2024, Yan Zhao wrote: > > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote: > > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote: > > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote: > > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue > > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git > > > > > > > branch TDX_1.5.05. > > > > > > > > > > > > > > TL;DR: > > > > > > > - tdh_mem_track() can contend with tdh_vp_enter(). > > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > > > > > > > > > > > The zero-step logic seems to be the most problematic. E.g. if KVM is trying to > > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters > > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could > > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for > > > > > > whatever reason. > > > > > > > > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path, > > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries > > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still > > > > > > hits the fault? > > > > > > > > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is > > > > > > desirable because in many cases, the winning task will install a valid mapping > > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the > > > > > > instruction is re-executed. In the happy case, that provides optimal performance > > > > > > as KVM doesn't introduce any extra delay/latency. > > > > > > > > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much, > > > > > > much higher, especially in light of the zero-step issues. > > > > > > > > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and > > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1', > > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while > > > > > > loop in kvm_tdp_map_page(). > > > > > > > > > > > > The only part I don't like about this idea is having two "retry" return values, > > > > > > which creates the potential for bugs due to checking one but not the other. > > > > > > > > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate > > > > > > to the TDX S-EPT fault handler that the SPTE is frozen. I think I like that > > > > > > option better even though the out-param is a bit gross, because it makes it more > > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for > > > > > > most paths. > > > > > Good idea. > > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also > > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY? > > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing > > > > FROZEN_SPTE might not be enough to prevent zero step mitigation. > > > > > > The goal isn't to make it completely impossible for zero-step to fire, it's to > > > make it so that _if_ zero-step fires, KVM can report the error to userspace without > > > having to retry, because KVM _knows_ that advancing past the zero-step isn't > > > something KVM can solve. > > > > > > : I'm not worried about any performance hit with zero-step, I'm worried about KVM > > > : not being able to differentiate between a KVM bug and guest interference. The > > > : goal with a local retry is to make it so that KVM _never_ triggers zero-step, > > > : unless there is a bug somewhere. At that point, if zero-step fires, KVM can > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > : report the error to userspace instead of trying to suppress guest activity, and > > > : potentially from other KVM tasks too. > > > > > > In other words, for the selftest you crafted, KVM reporting an error to userspace > > > due to zero-step would be working as intended. > > Hmm, but the selftest is an example to show that 6 continuous EPT violations on > > the same GPA could trigger zero-step. > > > > For an extremely unlucky vCPU, is it still possible to fire zero step when > > nothing is wrong both in KVM and QEMU? > > e.g. > > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found. > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total) > > Very technically, this shouldn't be possible. The only way for there to be > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create > the SPTE. You are right! I just realized that if TDX code retries internally for FROZEN_SPTEs, the 6th attempt should succeed. But I found below might be another case to return RET_PF_RETRY and trigger zero-step: Suppose GFNs are shared from 0x80000 - 0x80200, with HVA starting from hva1 of size 0x200 vCPU 0 vCPU 1 1. Access GFN 0x80002 2. convert GFN 0x80002 to private 3.munmap hva1 of size 0x200 kvm_mmu_invalidate_begin mmu_invalidate_range_start=0x80000 mmu_invalidate_range_end=0x80200 4. kvm_faultin_pfn mmu_invalidate_retry_gfn_unsafe of GFN 0x80002 and return RET_PF_RETRY! 5.kvm_mmu_invalidate_end Before step 5, step 4 will produce RET_PF_RETRY and re-enter guest.
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h index 0363d8544f42..8ca3e252a6ed 100644 --- a/arch/x86/kvm/vmx/tdx_ops.h +++ b/arch/x86/kvm/vmx/tdx_ops.h @@ -31,6 +31,40 @@ #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8) \ pr_tdx_error_N(__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8) +/* + * TDX module acquires its internal lock for resources. It doesn't spin to get + * locks because of its restrictions of allowed execution time. Instead, it + * returns TDX_OPERAND_BUSY with an operand id. + * + * Multiple VCPUs can operate on SEPT. Also with zero-step attack mitigation, + * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step + * attack is suspected. It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT + * with TDH.MEM.* operation. Note: TDH.MEM.TRACK is an exception. + * + * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL + * spoils TDP MMU effort. Retry several times with the assumption that SEPT + * lock contention is rare. But don't loop forever to avoid lockup. Let TDP + * MMU retry. + */ +#define TDX_ERROR_SEPT_BUSY (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT) + +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in) +{ +#define SEAMCALL_RETRY_MAX 16 + struct tdx_module_args args_in; + int retry = SEAMCALL_RETRY_MAX; + u64 ret; + + do { + args_in = *in; + ret = seamcall_ret(op, in); + } while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0); + + *in = args_in; + + return ret; +} + static inline u64 tdh_mng_addcx(struct kvm_tdx *kvm_tdx, hpa_t addr) { struct tdx_module_args in = { @@ -55,7 +89,7 @@ static inline u64 tdh_mem_page_add(struct kvm_tdx *kvm_tdx, gpa_t gpa, u64 ret; clflush_cache_range(__va(hpa), PAGE_SIZE); - ret = seamcall_ret(TDH_MEM_PAGE_ADD, &in); + ret = tdx_seamcall_sept(TDH_MEM_PAGE_ADD, &in); *rcx = in.rcx; *rdx = in.rdx; @@ -76,7 +110,7 @@ static inline u64 tdh_mem_sept_add(struct kvm_tdx *kvm_tdx, gpa_t gpa, clflush_cache_range(__va(page), PAGE_SIZE); - ret = seamcall_ret(TDH_MEM_SEPT_ADD, &in); + ret = tdx_seamcall_sept(TDH_MEM_SEPT_ADD, &in); *rcx = in.rcx; *rdx = in.rdx; @@ -93,7 +127,7 @@ static inline u64 tdh_mem_sept_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa, }; u64 ret; - ret = seamcall_ret(TDH_MEM_SEPT_REMOVE, &in); + ret = tdx_seamcall_sept(TDH_MEM_SEPT_REMOVE, &in); *rcx = in.rcx; *rdx = in.rdx; @@ -123,7 +157,7 @@ static inline u64 tdh_mem_page_aug(struct kvm_tdx *kvm_tdx, gpa_t gpa, hpa_t hpa u64 ret; clflush_cache_range(__va(hpa), PAGE_SIZE); - ret = seamcall_ret(TDH_MEM_PAGE_AUG, &in); + ret = tdx_seamcall_sept(TDH_MEM_PAGE_AUG, &in); *rcx = in.rcx; *rdx = in.rdx; @@ -140,7 +174,7 @@ static inline u64 tdh_mem_range_block(struct kvm_tdx *kvm_tdx, gpa_t gpa, }; u64 ret; - ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &in); + ret = tdx_seamcall_sept(TDH_MEM_RANGE_BLOCK, &in); *rcx = in.rcx; *rdx = in.rdx; @@ -335,7 +369,7 @@ static inline u64 tdh_mem_page_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa, }; u64 ret; - ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &in); + ret = tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in); *rcx = in.rcx; *rdx = in.rdx; @@ -361,7 +395,7 @@ static inline u64 tdh_mem_range_unblock(struct kvm_tdx *kvm_tdx, gpa_t gpa, }; u64 ret; - ret = seamcall_ret(TDH_MEM_RANGE_UNBLOCK, &in); + ret = tdx_seamcall_sept(TDH_MEM_RANGE_UNBLOCK, &in); *rcx = in.rcx; *rdx = in.rdx;