mbox series

[RFC,0/2] SEPT SEAMCALL retry proposal

Message ID 20241121115139.26338-1-yan.y.zhao@intel.com (mailing list archive)
Headers show
Series SEPT SEAMCALL retry proposal | expand

Message

Yan Zhao Nov. 21, 2024, 11:51 a.m. UTC
This SEPT SEAMCALL retry proposal aims to remove patch
"[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
[1] at the tail of v2 series "TDX MMU Part 2".

==brief history==

In the v1 series 'TDX MMU Part 2', there were several discussions regarding
the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the
SEAMCALL wrapper tdx_seamcall_sept().

The lock status of each SEAMCALL relevant to KVM was analyzed in [2].

The conclusion was that 16 retries was necessary because
- tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.

  When the TDX module detects that EPT violations are caused by the same
  RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter()
  will take SEPT tree lock and contend with tdh_mem*().

- tdg_mem_page_accept() can contend with other tdh_mem*().


Sean provided several good suggestions[3], including:
- Implement retries within TDX code when the TDP MMU returns
  RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering
  0-step mitigation.
- It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*()
  inside TDX module.
- Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and
  prevent tdh_vp_enter() during page uninstallation.

Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is
insufficient to prevent 0-step mitigation [4].

Rick and Yan then consulted TDX module team with findings that:
- The threshold of zero-step mitigation is counted per vCPU.
  It's of value 6 because

    "There can be at most 2 mapping faults on instruction fetch
     (x86 macro-instructions length is at most 15 bytes) when the
     instruction crosses page boundary; then there can be at most 2
     mapping faults for each memory operand, when the operand crosses
     page boundary. For most of x86 macro-instructions, there are up to 2
     memory operands and each one of them is small, which brings us to
     maximum 2+2*2 = 6 legal mapping faults."
  
- Besides tdg_mem_page_accept(),  tdg_mem_page_attr_rd/wr() can also 
  contend with SEAMCALLs tdh_mem*().

So, we decided to make a proposal to tolerate 0-step mitigation.

==proposal details==

The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
together which are required for page installation and uninstallation.

These SEAMCALLs can be divided into three groups:
Group 1: tdh_mem_page_add().
         The SEAMCALL is invoked only during TD build time and therefore
         KVM has ensured that no contention will occur.

         Proposal: (as in patch 1)
         Just return error when TDX_OPERAND_BUSY is found.

Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
         These two SEAMCALLs are invoked for page installation. 
         They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
	 tdg_mem_page_attr_rd/wr().

         Proposal: (as in patch 1)
         - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
           to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
         
         - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
           long as there are no pending signals/interrupts.

         The retry inside TDX aims to reduce the count of tdh_vp_enter()
         before resolving EPT violations in the local vCPU, thereby
         minimizing contentions with other vCPUs. However, it can't
         completely eliminate 0-step mitigation as it exits when there're
         pending signals/interrupts and does not (and cannot) remove the
         tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.

         Resources    SHARED  users      EXCLUSIVE users
         ------------------------------------------------------------
         SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
                    tdh_mem_page_aug
         ------------------------------------------------------------
         SEPT entry                      tdh_mem_sept_add (Host lock)
                                         tdh_mem_page_aug (Host lock)
                                         tdg_mem_page_accept (Guest lock)
                                         tdg_mem_page_attr_rd (Guest lock)
                                         tdg_mem_page_attr_wr (Guest lock)

Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
         These three SEAMCALLs are invoked for page uninstallation, with
         KVM mmu_lock held for writing.

         Resources     SHARED users      EXCLUSIVE users
         ------------------------------------------------------------
         TDCS epoch    tdh_vp_enter      tdh_mem_track
         ------------------------------------------------------------
         SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
                                         tdh_mem_range_block   
         ------------------------------------------------------------
         SEPT entry                      tdh_mem_range_block (Host lock)
                                         tdh_mem_page_remove (Host lock)
                                         tdg_mem_page_accept (Guest lock)
                                         tdg_mem_page_attr_rd (Guest lock)
                                         tdg_mem_page_attr_wr (Guest lock)

         Proposal: (as in patch 2)
         - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
           once.
         - During the retry, kick off all vCPUs and prevent any vCPU from
           entering to avoid potential contentions.

         This is because tdh_vp_enter() and TDCALLs are not protected by
         KVM mmu_lock, and remove_external_spte() in KVM must not fail.



SEAMCALL                Lock Type        Resource 
-----------------------------Group 1--------------------------------
tdh_mem_page_add        EXCLUSIVE        TDR
                        NO_LOCK          TDCS
                        NO_LOCK          SEPT tree
                        EXCLUSIVE        page to add

----------------------------Group 2--------------------------------
tdh_mem_sept_add        SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to add


tdh_mem_page_aug        SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to aug

----------------------------Group 3--------------------------------
tdh_mem_range_block     SHARED           TDR
                        SHARED           TDCS
                        EXCLUSIVE        SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify

tdh_mem_track           SHARED           TDR
                        SHARED           TDCS
                        EXCLUSIVE        TDCS epoch

tdh_mem_page_remove     SHARED           TDR
                        SHARED           TDCS
                        SHARED           SEPT tree
                        HOST,EXCLUSIVE   SEPT entry to modify
                        EXCLUSIVE        page to remove


[1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
[2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com
[3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com
[4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com

Yan Zhao (2):
  KVM: TDX: Retry in TDX when installing TD private/sept pages
  KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal

 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/mmu/mmu.c          |   2 +-
 arch/x86/kvm/vmx/tdx.c          | 102 +++++++++++++++++++++++++-------
 3 files changed, 85 insertions(+), 21 deletions(-)

Comments

Edgecombe, Rick P Nov. 26, 2024, 12:46 a.m. UTC | #1
On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> ==proposal details==
> 
> The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> together which are required for page installation and uninstallation.
> 
> These SEAMCALLs can be divided into three groups:
> Group 1: tdh_mem_page_add().
>          The SEAMCALL is invoked only during TD build time and therefore
>          KVM has ensured that no contention will occur.
> 
>          Proposal: (as in patch 1)
>          Just return error when TDX_OPERAND_BUSY is found.
> 
> Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
>          These two SEAMCALLs are invoked for page installation. 
>          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> 	 tdg_mem_page_attr_rd/wr().

We did verify with TDX module folks that the TDX module could be changed to not
take the sept host priority lock for zero entries (that happen during the guest
operations). In that case, I think we shouldn't expect contention for
tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for
tdh_vp_enter() though.


> 
>          Proposal: (as in patch 1)
>          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
>            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
>          
>          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
>            long as there are no pending signals/interrupts.
> 
>          The retry inside TDX aims to reduce the count of tdh_vp_enter()
>          before resolving EPT violations in the local vCPU, thereby
>          minimizing contentions with other vCPUs. However, it can't
>          completely eliminate 0-step mitigation as it exits when there're
>          pending signals/interrupts and does not (and cannot) remove the
>          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> 
>          Resources    SHARED  users      EXCLUSIVE users
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
>                     tdh_mem_page_aug
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_sept_add (Host lock)
>                                          tdh_mem_page_aug (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
> Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
>          These three SEAMCALLs are invoked for page uninstallation, with
>          KVM mmu_lock held for writing.
> 
>          Resources     SHARED users      EXCLUSIVE users
>          ------------------------------------------------------------
>          TDCS epoch    tdh_vp_enter      tdh_mem_track
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
>                                          tdh_mem_range_block   
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_range_block (Host lock)
>                                          tdh_mem_page_remove (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
>          Proposal: (as in patch 2)
>          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
>            once.
>          - During the retry, kick off all vCPUs and prevent any vCPU from
>            entering to avoid potential contentions.
> 
>          This is because tdh_vp_enter() and TDCALLs are not protected by
>          KVM mmu_lock, and remove_external_spte() in KVM must not fail.

The solution for group 3 actually doesn't look too bad at all to me. At least
from code and complexity wise. It's pretty compact, doesn't add any locks, and
limited to the tdx.c code. Although, I didn't evaluate the implementation 
correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet.

Were you able to test the fallback path at all? Can we think of any real
situations where it could be burdensome?

One other thing to note I think, is that group 3 are part of no-fail operations.
The core KVM calling code doesn't have the understanding of a failure there. So
in this scheme of not avoiding contention we have to succeed before returning,
where group 1 and 2 can fail, so don't need the special fallback scheme.
Yan Zhao Nov. 26, 2024, 6:24 a.m. UTC | #2
On Tue, Nov 26, 2024 at 08:46:51AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> > ==proposal details==
> > 
> > The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> > together which are required for page installation and uninstallation.
> > 
> > These SEAMCALLs can be divided into three groups:
> > Group 1: tdh_mem_page_add().
> >          The SEAMCALL is invoked only during TD build time and therefore
> >          KVM has ensured that no contention will occur.
> > 
> >          Proposal: (as in patch 1)
> >          Just return error when TDX_OPERAND_BUSY is found.
> > 
> > Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
> >          These two SEAMCALLs are invoked for page installation. 
> >          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> > 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> > 	 tdg_mem_page_attr_rd/wr().
> 
> We did verify with TDX module folks that the TDX module could be changed to not
> take the sept host priority lock for zero entries (that happen during the guest
> operations). In that case, I think we shouldn't expect contention for
> tdh_mem_sept_add() and tdh_mem_page_aug() from them? We still need it for
> tdh_vp_enter() though.
Ah, you are right.

I previously incorrectly thought TDX module will avoid locking free entries for
tdg_mem_page_accept() only.


> 
> 
> > 
> >          Proposal: (as in patch 1)
> >          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
> >            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
> >          
> >          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
> >            long as there are no pending signals/interrupts.
> > 
> >          The retry inside TDX aims to reduce the count of tdh_vp_enter()
> >          before resolving EPT violations in the local vCPU, thereby
> >          minimizing contentions with other vCPUs. However, it can't
> >          completely eliminate 0-step mitigation as it exits when there're
> >          pending signals/interrupts and does not (and cannot) remove the
> >          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> > 
> >          Resources    SHARED  users      EXCLUSIVE users
> >          ------------------------------------------------------------
> >          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
> >                     tdh_mem_page_aug
> >          ------------------------------------------------------------
> >          SEPT entry                      tdh_mem_sept_add (Host lock)
> >                                          tdh_mem_page_aug (Host lock)
> >                                          tdg_mem_page_accept (Guest lock)
> >                                          tdg_mem_page_attr_rd (Guest lock)
> >                                          tdg_mem_page_attr_wr (Guest lock)
> > 
> > Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
> >          These three SEAMCALLs are invoked for page uninstallation, with
> >          KVM mmu_lock held for writing.
> > 
> >          Resources     SHARED users      EXCLUSIVE users
> >          ------------------------------------------------------------
> >          TDCS epoch    tdh_vp_enter      tdh_mem_track
> >          ------------------------------------------------------------
> >          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
> >                                          tdh_mem_range_block   
> >          ------------------------------------------------------------
> >          SEPT entry                      tdh_mem_range_block (Host lock)
> >                                          tdh_mem_page_remove (Host lock)
> >                                          tdg_mem_page_accept (Guest lock)
> >                                          tdg_mem_page_attr_rd (Guest lock)
> >                                          tdg_mem_page_attr_wr (Guest lock)
> > 
> >          Proposal: (as in patch 2)
> >          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
> >            once.
> >          - During the retry, kick off all vCPUs and prevent any vCPU from
> >            entering to avoid potential contentions.
> > 
> >          This is because tdh_vp_enter() and TDCALLs are not protected by
> >          KVM mmu_lock, and remove_external_spte() in KVM must not fail.
> 
> The solution for group 3 actually doesn't look too bad at all to me. At least
> from code and complexity wise. It's pretty compact, doesn't add any locks, and
> limited to the tdx.c code. Although, I didn't evaluate the implementation 
> correctness of tdx_no_vcpus_enter_start() and tdx_no_vcpus_enter_stop() yet.
> 
> Were you able to test the fallback path at all? Can we think of any real
> situations where it could be burdensome?
Yes, I did some negative tests to fail block/track/remove to check if the
tdx_no_vcpus_enter*() work.

Even without those negative tests, it's not rare for tdh_mem_track() to return
busy due to its contention with tdh_vp_enter().

Given that normally it's not frequent to find tdh_mem_range_block() or
tdh_mem_page_remove() to return busy (if we reduce the frequency of zero-step
mitigation) and that tdh_mem_track() will kick off all vCPUs later any way, I
think it's acceptable to do the tdx_no_vcpus_enter*() stuffs in the page removal
path.


> One other thing to note I think, is that group 3 are part of no-fail operations.
> The core KVM calling code doesn't have the understanding of a failure there. So
> in this scheme of not avoiding contention we have to succeed before returning,
> where group 1 and 2 can fail, so don't need the special fallback scheme.
Yes, exactly.
Yan Zhao Dec. 13, 2024, 1:01 a.m. UTC | #3
On Thu, Nov 21, 2024 at 07:51:39PM +0800, Yan Zhao wrote:
> This SEPT SEAMCALL retry proposal aims to remove patch
> "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> [1] at the tail of v2 series "TDX MMU Part 2".
> 
> ==brief history==
> 
> In the v1 series 'TDX MMU Part 2', there were several discussions regarding
> the necessity of retrying SEPT-related SEAMCALLs up to 16 times within the
> SEAMCALL wrapper tdx_seamcall_sept().
> 
> The lock status of each SEAMCALL relevant to KVM was analyzed in [2].
> 
> The conclusion was that 16 retries was necessary because
> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
>   When the TDX module detects that EPT violations are caused by the same
>   RIP as in the last tdh_vp_enter() for 6 consecutive times, tdh_vp_enter()
>   will take SEPT tree lock and contend with tdh_mem*().
> 
> - tdg_mem_page_accept() can contend with other tdh_mem*().
> 
> 
> Sean provided several good suggestions[3], including:
> - Implement retries within TDX code when the TDP MMU returns
>   RET_PF_RETRY_FOZEN (for RET_PF_RETRY and frozen SPTE) to avoid triggering
>   0-step mitigation.
> - It's not necessary for tdg_mem_page_accept() to contend with tdh_mem*()
>   inside TDX module.
> - Use a method similar to KVM_REQ_MCLOCK_INPROGRESS to kick off vCPUs and
>   prevent tdh_vp_enter() during page uninstallation.
> 
> Yan later found out that only retry RET_PF_RETRY_FOZEN within TDX code is
> insufficient to prevent 0-step mitigation [4].
> 
> Rick and Yan then consulted TDX module team with findings that:
> - The threshold of zero-step mitigation is counted per vCPU.
>   It's of value 6 because
> 
>     "There can be at most 2 mapping faults on instruction fetch
>      (x86 macro-instructions length is at most 15 bytes) when the
>      instruction crosses page boundary; then there can be at most 2
>      mapping faults for each memory operand, when the operand crosses
>      page boundary. For most of x86 macro-instructions, there are up to 2
>      memory operands and each one of them is small, which brings us to
>      maximum 2+2*2 = 6 legal mapping faults."
>   
> - Besides tdg_mem_page_accept(),  tdg_mem_page_attr_rd/wr() can also 
>   contend with SEAMCALLs tdh_mem*().
> 
> So, we decided to make a proposal to tolerate 0-step mitigation.
> 
> ==proposal details==
> 
> The proposal discusses SEPT-related and TLB-flush-related SEAMCALLs
> together which are required for page installation and uninstallation.
> 
> These SEAMCALLs can be divided into three groups:
> Group 1: tdh_mem_page_add().
>          The SEAMCALL is invoked only during TD build time and therefore
>          KVM has ensured that no contention will occur.
> 
>          Proposal: (as in patch 1)
>          Just return error when TDX_OPERAND_BUSY is found.
> 
> Group 2: tdh_mem_sept_add(), tdh_mem_page_aug().
>          These two SEAMCALLs are invoked for page installation. 
>          They return TDX_OPERAND_BUSY when contending with tdh_vp_enter()
> 	 (due to 0-step mitigation) or TDCALLs tdg_mem_page_accept(),
> 	 tdg_mem_page_attr_rd/wr().
> 
>          Proposal: (as in patch 1)
>          - Return -EBUSY in KVM for TDX_OPERAND_BUSY to cause RET_PF_RETRY
>            to be returned in kvm_mmu_do_page_fault()/kvm_mmu_page_fault().
>          
>          - Inside TDX's EPT violation handler, retry on RET_PF_RETRY as
>            long as there are no pending signals/interrupts.
Alternatively, we can have the tdx_handle_ept_violation() do not retry
internally TDX.
Instead, keep the 16 times retries in tdx_seamcall_sept() for                    
tdh_mem_sept_add() and tdh_mem_page_aug() only, i.e. only for SEAMCALLs in       
Group 2.

> 
>          The retry inside TDX aims to reduce the count of tdh_vp_enter()
>          before resolving EPT violations in the local vCPU, thereby
>          minimizing contentions with other vCPUs. However, it can't
>          completely eliminate 0-step mitigation as it exits when there're
>          pending signals/interrupts and does not (and cannot) remove the
>          tdh_vp_enter() caused by KVM_EXIT_MEMORY_FAULT.
> 
>          Resources    SHARED  users      EXCLUSIVE users
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
>                     tdh_mem_page_aug
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_sept_add (Host lock)
>                                          tdh_mem_page_aug (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
> Group 3: tdh_mem_range_block(), tdh_mem_track(), tdh_mem_page_remove().
>          These three SEAMCALLs are invoked for page uninstallation, with
>          KVM mmu_lock held for writing.
> 
>          Resources     SHARED users      EXCLUSIVE users
>          ------------------------------------------------------------
>          TDCS epoch    tdh_vp_enter      tdh_mem_track
>          ------------------------------------------------------------
>          SEPT tree  tdh_mem_page_remove  tdh_vp_enter (0-step mitigation)
>                                          tdh_mem_range_block   
>          ------------------------------------------------------------
>          SEPT entry                      tdh_mem_range_block (Host lock)
>                                          tdh_mem_page_remove (Host lock)
>                                          tdg_mem_page_accept (Guest lock)
>                                          tdg_mem_page_attr_rd (Guest lock)
>                                          tdg_mem_page_attr_wr (Guest lock)
> 
>          Proposal: (as in patch 2)
>          - Upon detection of TDX_OPERAND_BUSY, retry each SEAMCALL only
>            once.
>          - During the retry, kick off all vCPUs and prevent any vCPU from
>            entering to avoid potential contentions.
> 
>          This is because tdh_vp_enter() and TDCALLs are not protected by
>          KVM mmu_lock, and remove_external_spte() in KVM must not fail.
> 
> 
> 
> SEAMCALL                Lock Type        Resource 
> -----------------------------Group 1--------------------------------
> tdh_mem_page_add        EXCLUSIVE        TDR
>                         NO_LOCK          TDCS
>                         NO_LOCK          SEPT tree
>                         EXCLUSIVE        page to add
> 
> ----------------------------Group 2--------------------------------
> tdh_mem_sept_add        SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to add
> 
> 
> tdh_mem_page_aug        SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to aug
> 
> ----------------------------Group 3--------------------------------
> tdh_mem_range_block     SHARED           TDR
>                         SHARED           TDCS
>                         EXCLUSIVE        SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
> 
> tdh_mem_track           SHARED           TDR
>                         SHARED           TDCS
>                         EXCLUSIVE        TDCS epoch
> 
> tdh_mem_page_remove     SHARED           TDR
>                         SHARED           TDCS
>                         SHARED           SEPT tree
>                         HOST,EXCLUSIVE   SEPT entry to modify
>                         EXCLUSIVE        page to remove
> 
> 
> [1] https://lore.kernel.org/all/20241112073909.22326-1-yan.y.zhao@intel.com
> [2] https://lore.kernel.org/kvm/ZuP5eNXFCljzRgWo@yzhao56-desk.sh.intel.com
> [3] https://lore.kernel.org/kvm/ZuR09EqzU1WbQYGd@google.com
> [4] https://lore.kernel.org/kvm/Zw%2FKElXSOf1xqLE7@yzhao56-desk.sh.intel.com
> 
> Yan Zhao (2):
>   KVM: TDX: Retry in TDX when installing TD private/sept pages
>   KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal
> 
>  arch/x86/include/asm/kvm_host.h |   2 +
>  arch/x86/kvm/mmu/mmu.c          |   2 +-
>  arch/x86/kvm/vmx/tdx.c          | 102 +++++++++++++++++++++++++-------
>  3 files changed, 85 insertions(+), 21 deletions(-)
> 
> -- 
> 2.43.2
>
Edgecombe, Rick P Dec. 17, 2024, 5 p.m. UTC | #4
On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> This SEPT SEAMCALL retry proposal aims to remove patch
> "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> [1] at the tail of v2 series "TDX MMU Part 2".

We discussed this on the PUCK call. A couple alternatives were considered:
 - Avoiding 0-step. To handle signals kicking to userspace we could try to add
code to generate synthetic EPT violations if KVM thinks the 0-step mitigation
might be active (i.e. the fault was not resolved). The consensus was that this
would be continuing battle and possibly impossible due normal guest behavior
triggering the mitigation. 
 - Pre-faulting all S-EPT, such that contention with AUG won't happen. The
discussion was that this would only be a temporary solution as the MMU
operations get more complicated (huge pages, etc). Also there is also
private/shared conversions and memory hotplug already.

So we will proceed with this kick+lock+retry solution. The reasoning is to
optimize for the normal non-contention path, without having an overly
complicated solution for KVM.

In all the branch commotion recently, these patches fell out of our dev branch.
So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We
need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no
issues, we can post the 6.13 rebase to included in kvm-coco-queue with
instructions on which patches to remove from kvm-coco-queue (i.e. the 16
retries).


We also briefly touched on the TDX module behavior where guest operations can
lock NP PTEs. The kick solution doesn't require changing this functionally, but
it should still be done to help with debugging issues related to KVM's
contention solution.

Thanks all for the discussion!
Sean Christopherson Dec. 17, 2024, 11:18 p.m. UTC | #5
On Tue, Dec 17, 2024, Rick P Edgecombe wrote:
> On Thu, 2024-11-21 at 19:51 +0800, Yan Zhao wrote:
> > This SEPT SEAMCALL retry proposal aims to remove patch
> > "[HACK] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT"
> > [1] at the tail of v2 series "TDX MMU Part 2".
> 
> We discussed this on the PUCK call. A couple alternatives were considered:
>  - Avoiding 0-step. To handle signals kicking to userspace we could try to add
> code to generate synthetic EPT violations if KVM thinks the 0-step mitigation
> might be active (i.e. the fault was not resolved). The consensus was that this
> would be continuing battle and possibly impossible due normal guest behavior
> triggering the mitigation. 

Specifically, the TDX Module takes its write-lock if the guest takes EPT
violations exits on the same RIP 6 times, i.e. detects forward progress based
purely on the RIP at entry vs. exit.  So a guest that is touching memory in a
loop could trigger zero-step checking even if KVM promptly fixes every EPT
violation.

>  - Pre-faulting all S-EPT, such that contention with AUG won't happen. The
> discussion was that this would only be a temporary solution as the MMU
> operations get more complicated (huge pages, etc). Also there is also
> private/shared conversions and memory hotplug already.
> 
> So we will proceed with this kick+lock+retry solution. The reasoning is to
> optimize for the normal non-contention path, without having an overly
> complicated solution for KVM.
> 
> In all the branch commotion recently, these patches fell out of our dev branch.
> So we just recently integrated then into a 6.13 kvm-coco-queue based branch. We
> need to perform some regression tests based on 6.13 TDP MMU changes. Assuming no
> issues, we can post the 6.13 rebase to included in kvm-coco-queue with
> instructions on which patches to remove from kvm-coco-queue (i.e. the 16
> retries).
> 
> 
> We also briefly touched on the TDX module behavior where guest operations can
> lock NP PTEs. The kick solution doesn't require changing this functionally, but
> it should still be done to help with debugging issues related to KVM's
> contention solution.

And so that KVM developers don't have to deal with customer escalations due to
performance issues caused by known flaws in the TDX module.

> 
> Thanks all for the discussion!