Message ID | 20220802074750.2581308-16-xiaoyao.li@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | TDX QEMU support | expand |
On Tue, Aug 02, 2022 at 03:47:25PM +0800, Xiaoyao Li wrote: > Bit 28, named SEPT_VE_DISABLE, disables EPT violation conversion to #VE > on guest TD access of PENDING pages when set to 1. Some guest OS (e.g., > Linux TD guest) may require this bit set as 1. Otherwise refuse to boot. --verbose please. That somehow doesn't make sense to me. A guest is either TDX-aware (which should be the case for linux 5.19+), or it is not. My expectation would be that guests which are not TDX-aware will be disturbed by any #VE exception, not only the ones triggered by EPT violations. So I'm wondering what this config bit actually is useful for ... take care, Gerd
On 8/25/2022 7:36 PM, Gerd Hoffmann wrote: > On Tue, Aug 02, 2022 at 03:47:25PM +0800, Xiaoyao Li wrote: >> Bit 28, named SEPT_VE_DISABLE, disables EPT violation conversion to #VE >> on guest TD access of PENDING pages when set to 1. Some guest OS (e.g., >> Linux TD guest) may require this bit set as 1. Otherwise refuse to boot. > > --verbose please. That somehow doesn't make sense to me. > > A guest is either TDX-aware (which should be the case for linux 5.19+), > or it is not. My expectation would be that guests which are not > TDX-aware will be disturbed by any #VE exception, not only the ones > triggered by EPT violations. So I'm wondering what this config bit > actually is useful for ... This bit, including other properties of tdx-guest object, are supposed to be configured for TD only. On VM creation phase, user needs to decide if it's a TD (TDX VM) or non-TD (previous normal VM) by attaching tdx-guest object or not. If it's a TD when VM creation, but the guest kernel is not TDX-capable/-aware, it's doomed to fail booting. For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g., linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap [1]. Frankly speaking, this bit is better to be configured by TD guest kernel, however current TDX architecture makes the design to let VMM configure. [1]: TD pages that are not accepted cause a #VE exception. It is possible for a hypervisor to take away a guest page and thus trigger a #VE the next time it is accessed. Normally the guest would just panic in such a case, but for that it first needs to execute the #VE handler reliably. This can cause problems with the "system call gap": a malicious hypervisor might trigger a #VE for example on the system call entry code, and when a user process does a system call it would trigger a and SYSCALL relies on the kernel code to switch to the kernel stack, this would lead to kernel code running on the ring 3 stack. This could be exploited by a combination of malicious host and malicious ring 3 program to attack the kernel. > take care, > Gerd >
Hi, > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g., > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap > [1]. Why is that a problem for a TD guest kernel? Installing exception handlers is done quite early in the boot process, certainly before any userspace code runs. So I think we should never see a syscall without a #VE handler being installed. /me is confused. Or do you want tell me linux has no #VE handler? > Frankly speaking, this bit is better to be configured by TD guest > kernel, however current TDX architecture makes the design to let VMM > configure. Indeed. Requiring users to know guest kernel capabilities and manually configuring the vmm accordingly looks fragile to me. Even better would be to not have that bit in the first place and require TD guests properly handle #VE exceptions. > This can cause problems with the "system call gap": a malicious > hypervisor might trigger a #VE for example on the system call entry > code, and when a user process does a system call it would trigger a > and SYSCALL relies on the kernel code to switch to the kernel stack, > this would lead to kernel code running on the ring 3 stack. Hmm? Exceptions switch to kernel context too ... take care, Gerd
On 8/26/2022 1:57 PM, Gerd Hoffmann wrote: > Hi, > >> For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g., >> linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap >> [1]. > > Why is that a problem for a TD guest kernel? Installing exception > handlers is done quite early in the boot process, certainly before any > userspace code runs. So I think we should never see a syscall without > a #VE handler being installed. /me is confused. > > Or do you want tell me linux has no #VE handler? The problem is not "no #VE handler" and Linux does have #VE handler. The problem is Linux doesn't want any (or certain) exception occurrence in syscall gap, it's not specific to #VE. Frankly, I don't understand the reason clearly, it's something related to IST used in x86 Linux kernel. >> Frankly speaking, this bit is better to be configured by TD guest >> kernel, however current TDX architecture makes the design to let VMM >> configure. > > Indeed. Requiring users to know guest kernel capabilities and manually > configuring the vmm accordingly looks fragile to me. > > Even better would be to not have that bit in the first place and require > TD guests properly handle #VE exceptions. > >> This can cause problems with the "system call gap": a malicious >> hypervisor might trigger a #VE for example on the system call entry >> code, and when a user process does a system call it would trigger a >> and SYSCALL relies on the kernel code to switch to the kernel stack, >> this would lead to kernel code running on the ring 3 stack. > > Hmm? Exceptions switch to kernel context too ... > > take care, > Gerd >
On Fri, Sep 02, 2022, Xiaoyao Li wrote: > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote: > > Hi, > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g., > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap > > > [1]. > > > > Why is that a problem for a TD guest kernel? Installing exception > > handlers is done quite early in the boot process, certainly before any > > userspace code runs. So I think we should never see a syscall without > > a #VE handler being installed. /me is confused. > > > > Or do you want tell me linux has no #VE handler? > > The problem is not "no #VE handler" and Linux does have #VE handler. The > problem is Linux doesn't want any (or certain) exception occurrence in > syscall gap, it's not specific to #VE. Frankly, I don't understand the > reason clearly, it's something related to IST used in x86 Linux kernel. The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction at the SYSCALL entry point runs with a userspaced-controlled RSP. With TDX, a malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel to run the #VE handler with a userspace stack. The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE, but ISTs are terrible because they don't play nice with re-entrancy (among other reasons). The RSP used for IST-based handlers is hardcoded, and so if a #VE handler triggers another #VE at any point before IRET, the second #VE will clobber the stack and hose the kernel. It's possible to workaround this, e.g. change the IST entry at the very beginning of the handler, but it's a maintenance burden. Since the only reason to use an IST is to guard against a malicious hypervisor, Linux decided it would be just as easy and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely.
On Fri, Sep 02, 2022 at 02:52:25AM +0000, Sean Christopherson wrote: > On Fri, Sep 02, 2022, Xiaoyao Li wrote: > > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote: > > > Hi, > > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g., > > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap > > > > [1]. > > > > > > Why is that a problem for a TD guest kernel? Installing exception > > > handlers is done quite early in the boot process, certainly before any > > > userspace code runs. So I think we should never see a syscall without > > > a #VE handler being installed. /me is confused. > > > > > > Or do you want tell me linux has no #VE handler? > > > > The problem is not "no #VE handler" and Linux does have #VE handler. The > > problem is Linux doesn't want any (or certain) exception occurrence in > > syscall gap, it's not specific to #VE. Frankly, I don't understand the > > reason clearly, it's something related to IST used in x86 Linux kernel. > > The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction > at the SYSCALL entry point runs with a userspaced-controlled RSP. With TDX, a > malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel > to run the #VE handler with a userspace stack. > > The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE, > but ISTs are terrible because they don't play nice with re-entrancy (among other > reasons). The RSP used for IST-based handlers is hardcoded, and so if a #VE > handler triggers another #VE at any point before IRET, the second #VE will clobber > the stack and hose the kernel. > v > It's possible to workaround this, e.g. change the IST entry at the very beginning > of the handler, but it's a maintenance burden. Since the only reason to use an IST > is to guard against a malicious hypervisor, Linux decided it would be just as easy > and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely. Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then? Having a hypervisor-controlled config bit to protect against a malicious hypervisor looks pointless to me ... take care, Gerd
On Fri, Sep 02, 2022, Gerd Hoffmann wrote: > On Fri, Sep 02, 2022 at 02:52:25AM +0000, Sean Christopherson wrote: > > On Fri, Sep 02, 2022, Xiaoyao Li wrote: > > > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote: > > > > Hi, > > > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g., > > > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap > > > > > [1]. > > > > > > > > Why is that a problem for a TD guest kernel? Installing exception > > > > handlers is done quite early in the boot process, certainly before any > > > > userspace code runs. So I think we should never see a syscall without > > > > a #VE handler being installed. /me is confused. > > > > > > > > Or do you want tell me linux has no #VE handler? > > > > > > The problem is not "no #VE handler" and Linux does have #VE handler. The > > > problem is Linux doesn't want any (or certain) exception occurrence in > > > syscall gap, it's not specific to #VE. Frankly, I don't understand the > > > reason clearly, it's something related to IST used in x86 Linux kernel. > > > > The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction > > at the SYSCALL entry point runs with a userspaced-controlled RSP. With TDX, a > > malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel > > to run the #VE handler with a userspace stack. > > > > The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE, > > but ISTs are terrible because they don't play nice with re-entrancy (among other > > reasons). The RSP used for IST-based handlers is hardcoded, and so if a #VE > > handler triggers another #VE at any point before IRET, the second #VE will clobber > > the stack and hose the kernel. > > v > > It's possible to workaround this, e.g. change the IST entry at the very beginning > > of the handler, but it's a maintenance burden. Since the only reason to use an IST > > is to guard against a malicious hypervisor, Linux decided it would be just as easy > > and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely. > > Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then? > > Having a hypervisor-controlled config bit to protect against a malicious > hypervisor looks pointless to me ... IIRC, all (most?) of the attributes are included in the attestation report, so a guest/customer can refuse to provision secrets to the guest if the hypervisor is misbehaving. I'm guessing Intel made it an attribute and not a dynamic control knob to simplify the TDX module implementation.
On Fri, Sep 02, 2022 at 03:26:35PM +0000, Sean Christopherson wrote: > On Fri, Sep 02, 2022, Gerd Hoffmann wrote: > > > > Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then? > > > > Having a hypervisor-controlled config bit to protect against a malicious > > hypervisor looks pointless to me ... > > IIRC, all (most?) of the attributes are included in the attestation report, so a > guest/customer can refuse to provision secrets to the guest if the hypervisor is > misbehaving. Good. I think we sorted all issues then. Acked-by: Gerd Hoffmann <kraxel@redhat.com> take care, Gerd
diff --git a/qapi/qom.json b/qapi/qom.json index 38177848abc1..2a5486bfed3e 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -835,10 +835,12 @@ # # Properties for tdx-guest objects. # +# @sept-ve-disable: bit 28 of TD attributes (default: 0) +# # Since: 7.2 ## { 'struct': 'TdxGuestProperties', - 'data': { }} + 'data': { '*sept-ve-disable': 'bool' } } ## # @ObjectType: diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index ecb0205651bd..bf57f270ac9d 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -30,6 +30,8 @@ (1U << KVM_FEATURE_PV_SCHED_YIELD) | \ (1U << KVM_FEATURE_MSI_EXT_DEST_ID)) +#define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE BIT_ULL(28) + #define TDX_ATTRIBUTES_MAX_BITS 64 static FeatureMask tdx_attrs_ctrl_fields[TDX_ATTRIBUTES_MAX_BITS] = { @@ -490,6 +492,24 @@ out: return r; } +static bool tdx_guest_get_sept_ve_disable(Object *obj, Error **errp) +{ + TdxGuest *tdx = TDX_GUEST(obj); + + return !!(tdx->attributes & TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE); +} + +static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error **errp) +{ + TdxGuest *tdx = TDX_GUEST(obj); + + if (value) { + tdx->attributes |= TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE; + } else { + tdx->attributes &= ~TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE; + } +} + /* tdx guest */ OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest, tdx_guest, @@ -505,6 +525,10 @@ static void tdx_guest_init(Object *obj) qemu_mutex_init(&tdx->lock); tdx->attributes = 0; + + object_property_add_bool(obj, "sept-ve-disable", + tdx_guest_get_sept_ve_disable, + tdx_guest_set_sept_ve_disable); } static void tdx_guest_finalize(Object *obj)
Bit 28, named SEPT_VE_DISABLE, disables EPT violation conversion to #VE on guest TD access of PENDING pages when set to 1. Some guest OS (e.g., Linux TD guest) may require this bit set as 1. Otherwise refuse to boot. Add sept-ve-disable property for tdx-guest object, for user to configure this bit. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> --- qapi/qom.json | 4 +++- target/i386/kvm/tdx.c | 24 ++++++++++++++++++++++++ 2 files changed, 27 insertions(+), 1 deletion(-)