diff mbox series

[v1,15/40] i386/tdx: Add property sept-ve-disable for tdx-guest object

Message ID 20220802074750.2581308-16-xiaoyao.li@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX QEMU support | expand

Commit Message

Xiaoyao Li Aug. 2, 2022, 7:47 a.m. UTC
Bit 28, named SEPT_VE_DISABLE, disables	EPT violation conversion to #VE
on guest TD access of PENDING pages when set to 1. Some guest OS (e.g.,
Linux TD guest) may require this bit set as 1. Otherwise refuse to boot.

Add sept-ve-disable property for tdx-guest object, for user to configure
this bit.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 qapi/qom.json         |  4 +++-
 target/i386/kvm/tdx.c | 24 ++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

Comments

Gerd Hoffmann Aug. 25, 2022, 11:36 a.m. UTC | #1
On Tue, Aug 02, 2022 at 03:47:25PM +0800, Xiaoyao Li wrote:
> Bit 28, named SEPT_VE_DISABLE, disables	EPT violation conversion to #VE
> on guest TD access of PENDING pages when set to 1. Some guest OS (e.g.,
> Linux TD guest) may require this bit set as 1. Otherwise refuse to boot.

--verbose please.  That somehow doesn't make sense to me.

A guest is either TDX-aware (which should be the case for linux 5.19+),
or it is not.  My expectation would be that guests which are not
TDX-aware will be disturbed by any #VE exception, not only the ones
triggered by EPT violations.  So I'm wondering what this config bit
actually is useful for ...

take care,
  Gerd
Xiaoyao Li Aug. 25, 2022, 2:42 p.m. UTC | #2
On 8/25/2022 7:36 PM, Gerd Hoffmann wrote:
> On Tue, Aug 02, 2022 at 03:47:25PM +0800, Xiaoyao Li wrote:
>> Bit 28, named SEPT_VE_DISABLE, disables	EPT violation conversion to #VE
>> on guest TD access of PENDING pages when set to 1. Some guest OS (e.g.,
>> Linux TD guest) may require this bit set as 1. Otherwise refuse to boot.
> 
> --verbose please.  That somehow doesn't make sense to me.
> 
> A guest is either TDX-aware (which should be the case for linux 5.19+),
> or it is not.  My expectation would be that guests which are not
> TDX-aware will be disturbed by any #VE exception, not only the ones
> triggered by EPT violations.  So I'm wondering what this config bit
> actually is useful for ...

This bit, including other properties of tdx-guest object, are supposed 
to be configured for TD only. On VM creation phase, user needs to decide 
if it's a TD (TDX VM) or non-TD (previous normal VM) by attaching 
tdx-guest object or not.

If it's a TD when VM creation, but the guest kernel is not 
TDX-capable/-aware, it's doomed to fail booting.

For TD guest kernel, it has its own reason to turn SEPT_VE on or off. 
E.g., linux TD guest requires SEPT_VE to be disabled to avoid #VE on 
syscall gap [1]. Frankly speaking, this bit is better to be configured 
by TD guest kernel, however current TDX architecture makes the design to 
let VMM configure.


[1]: TD pages that are not accepted cause a #VE exception.
It is possible for a hypervisor to take away a guest page
and thus trigger a #VE the next time it is accessed.
Normally the guest would just panic in such a case, but
for that it first needs to execute the #VE handler
reliably.

This can cause problems with the "system call gap": a malicious
hypervisor might trigger a #VE for example on the system call entry
code, and when a user process does a system call it would trigger a
and SYSCALL relies on the kernel code to switch to the kernel stack,
this would lead to kernel code running on the ring 3 stack.  This could
be exploited by a combination of malicious host and malicious ring 3
program to attack the kernel.


> take care,
>    Gerd
>
Gerd Hoffmann Aug. 26, 2022, 5:57 a.m. UTC | #3
Hi,
 
> For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g.,
> linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap
> [1].

Why is that a problem for a TD guest kernel?  Installing exception
handlers is done quite early in the boot process, certainly before any
userspace code runs.  So I think we should never see a syscall without
a #VE handler being installed.  /me is confused.

Or do you want tell me linux has no #VE handler?

> Frankly speaking, this bit is better to be configured by TD guest
> kernel, however current TDX architecture makes the design to let VMM
> configure.

Indeed.  Requiring users to know guest kernel capabilities and manually
configuring the vmm accordingly looks fragile to me.

Even better would be to not have that bit in the first place and require
TD guests properly handle #VE exceptions.

> This can cause problems with the "system call gap": a malicious
> hypervisor might trigger a #VE for example on the system call entry
> code, and when a user process does a system call it would trigger a
> and SYSCALL relies on the kernel code to switch to the kernel stack,
> this would lead to kernel code running on the ring 3 stack.

Hmm?  Exceptions switch to kernel context too ...

take care,
  Gerd
Xiaoyao Li Sept. 2, 2022, 2:33 a.m. UTC | #4
On 8/26/2022 1:57 PM, Gerd Hoffmann wrote:
>    Hi,
>   
>> For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g.,
>> linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap
>> [1].
> 
> Why is that a problem for a TD guest kernel?  Installing exception
> handlers is done quite early in the boot process, certainly before any
> userspace code runs.  So I think we should never see a syscall without
> a #VE handler being installed.  /me is confused.
> 
> Or do you want tell me linux has no #VE handler?

The problem is not "no #VE handler" and Linux does have #VE handler. The 
problem is Linux doesn't want any (or certain) exception occurrence in 
syscall gap, it's not specific to #VE. Frankly, I don't understand the 
reason clearly, it's something related to IST used in x86 Linux kernel.

>> Frankly speaking, this bit is better to be configured by TD guest
>> kernel, however current TDX architecture makes the design to let VMM
>> configure.
> 
> Indeed.  Requiring users to know guest kernel capabilities and manually
> configuring the vmm accordingly looks fragile to me.
> 
> Even better would be to not have that bit in the first place and require
> TD guests properly handle #VE exceptions.
> 
>> This can cause problems with the "system call gap": a malicious
>> hypervisor might trigger a #VE for example on the system call entry
>> code, and when a user process does a system call it would trigger a
>> and SYSCALL relies on the kernel code to switch to the kernel stack,
>> this would lead to kernel code running on the ring 3 stack.
> 
> Hmm?  Exceptions switch to kernel context too ...
> 
> take care,
>    Gerd
>
Sean Christopherson Sept. 2, 2022, 2:52 a.m. UTC | #5
On Fri, Sep 02, 2022, Xiaoyao Li wrote:
> On 8/26/2022 1:57 PM, Gerd Hoffmann wrote:
> >    Hi,
> > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g.,
> > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap
> > > [1].
> > 
> > Why is that a problem for a TD guest kernel?  Installing exception
> > handlers is done quite early in the boot process, certainly before any
> > userspace code runs.  So I think we should never see a syscall without
> > a #VE handler being installed.  /me is confused.
> > 
> > Or do you want tell me linux has no #VE handler?
> 
> The problem is not "no #VE handler" and Linux does have #VE handler. The
> problem is Linux doesn't want any (or certain) exception occurrence in
> syscall gap, it's not specific to #VE. Frankly, I don't understand the
> reason clearly, it's something related to IST used in x86 Linux kernel.

The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction
at the SYSCALL entry point runs with a userspaced-controlled RSP.  With TDX, a
malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel
to run the #VE handler with a userspace stack.

The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE,
but ISTs are terrible because they don't play nice with re-entrancy (among other
reasons).  The RSP used for IST-based handlers is hardcoded, and so if a #VE
handler triggers another #VE at any point before IRET, the second #VE will clobber
the stack and hose the kernel.

It's possible to workaround this, e.g. change the IST entry at the very beginning
of the handler, but it's a maintenance burden.  Since the only reason to use an IST
is to guard against a malicious hypervisor, Linux decided it would be just as easy
and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely.
Gerd Hoffmann Sept. 2, 2022, 5:46 a.m. UTC | #6
On Fri, Sep 02, 2022 at 02:52:25AM +0000, Sean Christopherson wrote:
> On Fri, Sep 02, 2022, Xiaoyao Li wrote:
> > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote:
> > >    Hi,
> > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g.,
> > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap
> > > > [1].
> > > 
> > > Why is that a problem for a TD guest kernel?  Installing exception
> > > handlers is done quite early in the boot process, certainly before any
> > > userspace code runs.  So I think we should never see a syscall without
> > > a #VE handler being installed.  /me is confused.
> > > 
> > > Or do you want tell me linux has no #VE handler?
> > 
> > The problem is not "no #VE handler" and Linux does have #VE handler. The
> > problem is Linux doesn't want any (or certain) exception occurrence in
> > syscall gap, it's not specific to #VE. Frankly, I don't understand the
> > reason clearly, it's something related to IST used in x86 Linux kernel.
> 
> The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction
> at the SYSCALL entry point runs with a userspaced-controlled RSP.  With TDX, a
> malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel
> to run the #VE handler with a userspace stack.
> 
> The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE,
> but ISTs are terrible because they don't play nice with re-entrancy (among other
> reasons).  The RSP used for IST-based handlers is hardcoded, and so if a #VE
> handler triggers another #VE at any point before IRET, the second #VE will clobber
> the stack and hose the kernel.
> v
> It's possible to workaround this, e.g. change the IST entry at the very beginning
> of the handler, but it's a maintenance burden.  Since the only reason to use an IST
> is to guard against a malicious hypervisor, Linux decided it would be just as easy
> and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely.

Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then?

Having a hypervisor-controlled config bit to protect against a malicious
hypervisor looks pointless to me ...

take care,
  Gerd
Sean Christopherson Sept. 2, 2022, 3:26 p.m. UTC | #7
On Fri, Sep 02, 2022, Gerd Hoffmann wrote:
> On Fri, Sep 02, 2022 at 02:52:25AM +0000, Sean Christopherson wrote:
> > On Fri, Sep 02, 2022, Xiaoyao Li wrote:
> > > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote:
> > > >    Hi,
> > > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g.,
> > > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap
> > > > > [1].
> > > > 
> > > > Why is that a problem for a TD guest kernel?  Installing exception
> > > > handlers is done quite early in the boot process, certainly before any
> > > > userspace code runs.  So I think we should never see a syscall without
> > > > a #VE handler being installed.  /me is confused.
> > > > 
> > > > Or do you want tell me linux has no #VE handler?
> > > 
> > > The problem is not "no #VE handler" and Linux does have #VE handler. The
> > > problem is Linux doesn't want any (or certain) exception occurrence in
> > > syscall gap, it's not specific to #VE. Frankly, I don't understand the
> > > reason clearly, it's something related to IST used in x86 Linux kernel.
> > 
> > The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction
> > at the SYSCALL entry point runs with a userspaced-controlled RSP.  With TDX, a
> > malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel
> > to run the #VE handler with a userspace stack.
> > 
> > The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE,
> > but ISTs are terrible because they don't play nice with re-entrancy (among other
> > reasons).  The RSP used for IST-based handlers is hardcoded, and so if a #VE
> > handler triggers another #VE at any point before IRET, the second #VE will clobber
> > the stack and hose the kernel.
> > v
> > It's possible to workaround this, e.g. change the IST entry at the very beginning
> > of the handler, but it's a maintenance burden.  Since the only reason to use an IST
> > is to guard against a malicious hypervisor, Linux decided it would be just as easy
> > and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely.
> 
> Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then?
> 
> Having a hypervisor-controlled config bit to protect against a malicious
> hypervisor looks pointless to me ...

IIRC, all (most?) of the attributes are included in the attestation report, so a
guest/customer can refuse to provision secrets to the guest if the hypervisor is
misbehaving.

I'm guessing Intel made it an attribute and not a dynamic control knob to simplify
the TDX module implementation.
Gerd Hoffmann Sept. 2, 2022, 4:52 p.m. UTC | #8
On Fri, Sep 02, 2022 at 03:26:35PM +0000, Sean Christopherson wrote:
> On Fri, Sep 02, 2022, Gerd Hoffmann wrote:
> > 
> > Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then?
> > 
> > Having a hypervisor-controlled config bit to protect against a malicious
> > hypervisor looks pointless to me ...
> 
> IIRC, all (most?) of the attributes are included in the attestation report, so a
> guest/customer can refuse to provision secrets to the guest if the hypervisor is
> misbehaving.

Good.  I think we sorted all issues then.

Acked-by: Gerd Hoffmann <kraxel@redhat.com>

take care,
  Gerd
diff mbox series

Patch

diff --git a/qapi/qom.json b/qapi/qom.json
index 38177848abc1..2a5486bfed3e 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -835,10 +835,12 @@ 
 #
 # Properties for tdx-guest objects.
 #
+# @sept-ve-disable: bit 28 of TD attributes (default: 0)
+#
 # Since: 7.2
 ##
 { 'struct': 'TdxGuestProperties',
-  'data': { }}
+  'data': { '*sept-ve-disable': 'bool' } }
 
 ##
 # @ObjectType:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index ecb0205651bd..bf57f270ac9d 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -30,6 +30,8 @@ 
                                      (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
                                      (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
+#define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
+
 #define TDX_ATTRIBUTES_MAX_BITS      64
 
 static FeatureMask tdx_attrs_ctrl_fields[TDX_ATTRIBUTES_MAX_BITS] = {
@@ -490,6 +492,24 @@  out:
     return r;
 }
 
+static bool tdx_guest_get_sept_ve_disable(Object *obj, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    return !!(tdx->attributes & TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE);
+}
+
+static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    if (value) {
+        tdx->attributes |= TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+    } else {
+        tdx->attributes &= ~TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+    }
+}
+
 /* tdx guest */
 OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
                                    tdx_guest,
@@ -505,6 +525,10 @@  static void tdx_guest_init(Object *obj)
     qemu_mutex_init(&tdx->lock);
 
     tdx->attributes = 0;
+
+    object_property_add_bool(obj, "sept-ve-disable",
+                             tdx_guest_get_sept_ve_disable,
+                             tdx_guest_set_sept_ve_disable);
 }
 
 static void tdx_guest_finalize(Object *obj)