Message ID | 20241121115139.26338-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | SEPT SEAMCALL retry proposal | expand |
On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote: > ==proposal details== > > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs > together which are required for page installation and uninstallation. > > These SEAMCALLs can be divided into three groups: > Group 1: tdh_mem_page_add(). > The SEAMCALL is invoked only during TD build time and therefore > KVM has ensured that no contention will occur. > > Proposal: (as in patch 1) > Just return error when TDX_OPERAND_BUSY is found. > > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug(). > These two SEAMCALLs are invoked for page installation. > They return TDX_OPERAND_BUSY when contending with tdh_vp_enter() > (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(), > tdg_mem_page_attr_rd/wr(). We did verify with TDX module folks that the TDX module could be changed to not take the sept host priority lock for zero entries (that happen during the guest operations). In that case, I think we shouldn't expect contention for tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for tdh_vp_enter() though. > > Proposal: (as in patch 1) > - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY > to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault(). > > - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as > long as there are no pending signals/interrupts. > > The retry inside TDX aims to reduce the count of tdh_vp_enter() > before resolving EPT violations in the local vCPU, thereby > minimizing contentions with other vCPUs. However, it can't > completely eliminate 0-step mitigation as it exits when there're > pending signals/interrupts and does not (and cannot) remove the > tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) > tdh_mem_page_aug > ------------------------------------------------------------ > SEPT entry tdh_mem_sept_add (Host lock) > tdh_mem_page_aug (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(). > These three SEAMCALLs are invoked for page uninstallation, with > KVM mmu_lock held for writing. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > TDCS epoch tdh_vp_enter tdh_mem_track > ------------------------------------------------------------ > SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) > tdh_mem_range_block > ------------------------------------------------------------ > SEPT entry tdh_mem_range_block (Host lock) > tdh_mem_page_remove (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Proposal: (as in patch 2) > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only > once. > - During the retry, kick off all vCPUs and prevent any vCPU from > entering to avoid potential contentions. > > This is because tdh_vp_enter() and TDCALLs are not protected by > KVM mmu_lock, and remove_external_spte() in KVM must not fail. The solution for group 3 actually doesn't look too bad at all to me. At least from code and complexity wise. It's pretty compact, doesn't add any locks, and limited to the tdx.c code. Although, I didn't evaluate the implementation correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet. Were you able to test the fallback path at all? Can we think of any real situations where it could be burdensome? One other thing to note I think, is that group 3 are part of no-fail operations. The core KVM calling code doesn't have the understanding of a failure there. So in this scheme of not avoiding contention we have to succeed before returning, where group 1 and 2 can fail, so don't need the special fallback scheme.
On Tue, Nov 26, 2024 at 08:46:51AM +0800, Edgecombe, Rick P wrote: > On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote: > > ==proposal details== > > > > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs > > together which are required for page installation and uninstallation. > > > > These SEAMCALLs can be divided into three groups: > > Group 1: tdh_mem_page_add(). > > The SEAMCALL is invoked only during TD build time and therefore > > KVM has ensured that no contention will occur. > > > > Proposal: (as in patch 1) > > Just return error when TDX_OPERAND_BUSY is found. > > > > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug(). > > These two SEAMCALLs are invoked for page installation. > > They return TDX_OPERAND_BUSY when contending with tdh_vp_enter() > > (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(), > > tdg_mem_page_attr_rd/wr(). > > We did verify with TDX module folks that the TDX module could be changed to not > take the sept host priority lock for zero entries (that happen during the guest > operations). In that case, I think we shouldn't expect contention for > tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for > tdh_vp_enter() though. Ah, you are right. I previously incorrectly thought TDX module will avoid locking free entries for tdg_mem_page_accept() only. > > > > > > Proposal: (as in patch 1) > > - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY > > to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault(). > > > > - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as > > long as there are no pending signals/interrupts. > > > > The retry inside TDX aims to reduce the count of tdh_vp_enter() > > before resolving EPT violations in the local vCPU, thereby > > minimizing contentions with other vCPUs. However, it can't > > completely eliminate 0-step mitigation as it exits when there're > > pending signals/interrupts and does not (and cannot) remove the > > tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT. > > > > Resources SHARED users EXCLUSIVE users > > ------------------------------------------------------------ > > SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) > > tdh_mem_page_aug > > ------------------------------------------------------------ > > SEPT entry tdh_mem_sept_add (Host lock) > > tdh_mem_page_aug (Host lock) > > tdg_mem_page_accept (Guest lock) > > tdg_mem_page_attr_rd (Guest lock) > > tdg_mem_page_attr_wr (Guest lock) > > > > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(). > > These three SEAMCALLs are invoked for page uninstallation, with > > KVM mmu_lock held for writing. > > > > Resources SHARED users EXCLUSIVE users > > ------------------------------------------------------------ > > TDCS epoch tdh_vp_enter tdh_mem_track > > ------------------------------------------------------------ > > SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) > > tdh_mem_range_block > > ------------------------------------------------------------ > > SEPT entry tdh_mem_range_block (Host lock) > > tdh_mem_page_remove (Host lock) > > tdg_mem_page_accept (Guest lock) > > tdg_mem_page_attr_rd (Guest lock) > > tdg_mem_page_attr_wr (Guest lock) > > > > Proposal: (as in patch 2) > > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only > > once. > > - During the retry, kick off all vCPUs and prevent any vCPU from > > entering to avoid potential contentions. > > > > This is because tdh_vp_enter() and TDCALLs are not protected by > > KVM mmu_lock, and remove_external_spte() in KVM must not fail. > > The solution for group 3 actually doesn't look too bad at all to me. At least > from code and complexity wise. It's pretty compact, doesn't add any locks, and > limited to the tdx.c code. Although, I didn't evaluate the implementation > correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet. > > Were you able to test the fallback path at all? Can we think of any real > situations where it could be burdensome? Yes, I did some negative tests to fail block/track/remove to check if the tdx_no_vcpus_enter*() work. Even without those negative tests, it's not rare for tdh_mem_track() to return busy due to its contention with tdh_vp_enter(). Given that normally it's not frequent to find tdh_mem_range_block() or tdh_mem_page_remove() to return busy (if we reduce the frequency of zero-step mitigation) and that tdh_mem_track() will kick off all vCPUs later any way, I think it's acceptable to do the tdx_no_vcpus_enter*() stuffs in the page removal path. > One other thing to note I think, is that group 3 are part of no-fail operations. > The core KVM calling code doesn't have the understanding of a failure there. So > in this scheme of not avoiding contention we have to succeed before returning, > where group 1 and 2 can fail, so don't need the special fallback scheme. Yes, exactly.
On Thu, Nov 21, 2024 at 07:51:39PM +0800, Yan Zhao wrote: > This SEPT SEAMCALL retry proposal aims to remove patch > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" > [1] at the tail of v2 series "TDX MMU Part 2". > > ==brief history== > > In the v1 series 'TDX MMU Part 2', there were several discussions regarding > the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the > SEAMCALL wrapper tdx_seamcall_sept(). > > The lock status of each SEAMCALL relevant to KVM was analyzed in [2]. > > The conclusion was that 16 retries was necessary because > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected. > > When the TDX module detects that EPT violations are caused by the same > RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter() > will take SEPT tree lock and contend with tdh_mem*(). > > - tdg_mem_page_accept() can contend with other tdh_mem*(). > > > Sean provided several good suggestions[3], including: > - Implement retries within TDX code when the TDP MMU returns > RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering > 0-step mitigation. > - It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*() > inside TDX module. > - Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and > prevent tdh_vp_enter() during page uninstallation. > > Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is > insufficient to prevent 0-step mitigation [4]. > > Rick and Yan then consulted TDX module team with findings that: > - The threshold of zero-step mitigation is counted per vCPU. > It's of value 6 because > > "There can be at most 2 mapping faults on instruction fetch > (x86 macro-instructions length is at most 15 bytes) when the > instruction crosses page boundary; then there can be at most 2 > mapping faults for each memory operand, when the operand crosses > page boundary. For most of x86 macro-instructions, there are up to 2 > memory operands and each one of them is small, which brings us to > maximum 2+2*2 = 6 legal mapping faults." > > - Besides tdg_mem_page_accept(), tdg_mem_page_attr_rd/wr() can also > contend with SEAMCALLs tdh_mem*(). > > So, we decided to make a proposal to tolerate 0-step mitigation. > > ==proposal details== > > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs > together which are required for page installation and uninstallation. > > These SEAMCALLs can be divided into three groups: > Group 1: tdh_mem_page_add(). > The SEAMCALL is invoked only during TD build time and therefore > KVM has ensured that no contention will occur. > > Proposal: (as in patch 1) > Just return error when TDX_OPERAND_BUSY is found. > > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug(). > These two SEAMCALLs are invoked for page installation. > They return TDX_OPERAND_BUSY when contending with tdh_vp_enter() > (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(), > tdg_mem_page_attr_rd/wr(). > > Proposal: (as in patch 1) > - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY > to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault(). > > - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as > long as there are no pending signals/interrupts. Alternatively, we can have the tdx_handle_ept_violation() do not retry internally TDX. Instead, keep the 16 times retries in tdx_seamcall_sept() for tdh_mem_sept_add() and tdh_mem_page_aug() only, i.e. only for SEAMCALLs in Group 2. > > The retry inside TDX aims to reduce the count of tdh_vp_enter() > before resolving EPT violations in the local vCPU, thereby > minimizing contentions with other vCPUs. However, it can't > completely eliminate 0-step mitigation as it exits when there're > pending signals/interrupts and does not (and cannot) remove the > tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) > tdh_mem_page_aug > ------------------------------------------------------------ > SEPT entry tdh_mem_sept_add (Host lock) > tdh_mem_page_aug (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove(). > These three SEAMCALLs are invoked for page uninstallation, with > KVM mmu_lock held for writing. > > Resources SHARED users EXCLUSIVE users > ------------------------------------------------------------ > TDCS epoch tdh_vp_enter tdh_mem_track > ------------------------------------------------------------ > SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) > tdh_mem_range_block > ------------------------------------------------------------ > SEPT entry tdh_mem_range_block (Host lock) > tdh_mem_page_remove (Host lock) > tdg_mem_page_accept (Guest lock) > tdg_mem_page_attr_rd (Guest lock) > tdg_mem_page_attr_wr (Guest lock) > > Proposal: (as in patch 2) > - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only > once. > - During the retry, kick off all vCPUs and prevent any vCPU from > entering to avoid potential contentions. > > This is because tdh_vp_enter() and TDCALLs are not protected by > KVM mmu_lock, and remove_external_spte() in KVM must not fail. > > > > SEAMCALL Lock Type Resource > -----------------------------Group 1-------------------------------- > tdh_mem_page_add EXCLUSIVE TDR > NO_LOCK TDCS > NO_LOCK SEPT tree > EXCLUSIVE page to add > > ----------------------------Group 2-------------------------------- > tdh_mem_sept_add SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to add > > > tdh_mem_page_aug SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to aug > > ----------------------------Group 3-------------------------------- > tdh_mem_range_block SHARED TDR > SHARED TDCS > EXCLUSIVE SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > > tdh_mem_track SHARED TDR > SHARED TDCS > EXCLUSIVE TDCS epoch > > tdh_mem_page_remove SHARED TDR > SHARED TDCS > SHARED SEPT tree > HOST,EXCLUSIVE SEPT entry to modify > EXCLUSIVE page to remove > > > [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com > [2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com > [3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com > [4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com > > Yan Zhao (2): > KVM: TDX: Retry in TDX when installing TD private/sept pages > KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal > > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/vmx/tdx.c | 102 +++++++++++++++++++++++++------- > 3 files changed, 85 insertions(+), 21 deletions(-) > > -- > 2.43.2 >
On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote: > This SEPT SEAMCALL retry proposal aims to remove patch > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" > [1] at the tail of v2 series "TDX MMU Part 2". We discussed this on the PUCK call. A couple alternatives were considered: - Avoiding 0-step. To handle signals kicking to userspace we could try to add code to generate synthetic EPT violations if KVM thinks the 0-step mitigation might be active (i.e. the fault was not resolved). The consensus was that this would be continuing battle and possibly impossible due normal guest behavior triggering the mitigation. - Pre-faulting all S-EPT, such that contention with AUG won't happen. The discussion was that this would only be a temporary solution as the MMU operations get more complicated (huge pages, etc). Also there is also private/shared conversions and memory hotplug already. So we will proceed with this kick+lock+retry solution. The reasoning is to optimize for the normal non-contention path, without having an overly complicated solution for KVM. In all the branch commotion recently, these patches fell out of our dev branch. So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no issues, we can post the 6.13 rebase to included in kvm-coco-queue with instructions on which patches to remove from kvm-coco-queue (i.e. the 16 retries). We also briefly touched on the TDX module behavior where guest operations can lock NP PTEs. The kick solution doesn't require changing this functionally, but it should still be done to help with debugging issues related to KVM's contention solution. Thanks all for the discussion!
On Tue, Dec 17, 2024, Rick P Edgecombe wrote: > On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote: > > This SEPT SEAMCALL retry proposal aims to remove patch > > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT" > > [1] at the tail of v2 series "TDX MMU Part 2". > > We discussed this on the PUCK call. A couple alternatives were considered: > - Avoiding 0-step. To handle signals kicking to userspace we could try to add > code to generate synthetic EPT violations if KVM thinks the 0-step mitigation > might be active (i.e. the fault was not resolved). The consensus was that this > would be continuing battle and possibly impossible due normal guest behavior > triggering the mitigation. Specifically, the TDX Module takes its write-lock if the guest takes EPT violations exits on the same RIP 6 times, i.e. detects forward progress based purely on the RIP at entry vs. exit. So a guest that is touching memory in a loop could trigger zero-step checking even if KVM promptly fixes every EPT violation. > - Pre-faulting all S-EPT, such that contention with AUG won't happen. The > discussion was that this would only be a temporary solution as the MMU > operations get more complicated (huge pages, etc). Also there is also > private/shared conversions and memory hotplug already. > > So we will proceed with this kick+lock+retry solution. The reasoning is to > optimize for the normal non-contention path, without having an overly > complicated solution for KVM. > > In all the branch commotion recently, these patches fell out of our dev branch. > So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We > need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no > issues, we can post the 6.13 rebase to included in kvm-coco-queue with > instructions on which patches to remove from kvm-coco-queue (i.e. the 16 > retries). > > > We also briefly touched on the TDX module behavior where guest operations can > lock NP PTEs. The kick solution doesn't require changing this functionally, but > it should still be done to help with debugging issues related to KVM's > contention solution. And so that KVM developers don't have to deal with customer escalations due to performance issues caused by known flaws in the TDX module. > > Thanks all for the discussion!