diff mbox series

[1/9] x86emul: support LKGS

Message ID 1eb21ece-9d33-d8e1-1c2b-c682dbb1cda1@suse.com (mailing list archive)
State Superseded
Headers show
Series x86emul: misc additions | expand

Commit Message

Jan Beulich April 4, 2023, 2:49 p.m. UTC
Provide support for this insn, which is a prereq to FRED. CPUID-wise
introduce both its and FRED's bit at this occasion, thus allowing to
also express the dependency right away.

While adding a testcase, also add a SWAPGS one. In order to not affect
the behavior of pre-existing tests, install write_{segment,msr} hooks
only transiently.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Instead of ->read_segment() we could of course also use ->read_msr() to
fetch the original GS base. I don't think I can see a clear advantage of
either approach; the way it's done it matches how we handle SWAPGS.

For PV save_segments() would need adjustment, but the insn being
restricted to ring 0 means PV guests can't use it anyway (unless we
wanted to emulate it as another privileged insn).

Comments

Andrew Cooper April 4, 2023, 9:54 p.m. UTC | #1
On 04/04/2023 3:49 pm, Jan Beulich wrote:
> Provide support for this insn, which is a prereq to FRED. CPUID-wise
> introduce both its and FRED's bit at this occasion, thus allowing to
> also express the dependency right away.
>
> While adding a testcase, also add a SWAPGS one. In order to not affect
> the behavior of pre-existing tests, install write_{segment,msr} hooks
> only transiently.

IMO, the emulator is already complicated enough without us having
fallback logic to cope with callers that don't set up all the hooks.

Nor do I think making these hooks transient in the test harness is a
clever idea.

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Instead of ->read_segment() we could of course also use ->read_msr() to
> fetch the original GS base. I don't think I can see a clear advantage of
> either approach; the way it's done it matches how we handle SWAPGS.

read_segment() is a much shorter logic chain under the hood, so will be
marginally faster, but it will be a couple of unnecessary VMREADs (on
Intel at least).

We could expose the get/set reg paths for cases where we know we're not
going to need sanity checks, but I'm not sure it's worth it in this case.

> For PV save_segments() would need adjustment, but the insn being
> restricted to ring 0 means PV guests can't use it anyway (unless we
> wanted to emulate it as another privileged insn).

I know, it's on the list.

What is rather irritating is that, depending on FRED or not, it's
opposite whether the guest user or guest kernel GS base is in context. 
Sadly Intel refused my request for a control knob to turn off FRED's
auto-SWAPGS, but I didn't really push them on it because for practically
all other circumstances, it would just be a way for OSes to shoot
themselves in the foot.

For PV guests, our regular ABI is half-way to FRED anyway.  I suspect we
can get most of the interesting rest of the functionality by adding an
ERET bit to the HYPERCALL_iret flags.  I'm not sure yet if we ought to
bother exposing CSL in the pvFRED ABI or not, but doing so would reduce
the divergence from native even further.

> --- a/xen/arch/x86/x86_emulate/private.h
> +++ b/xen/arch/x86/x86_emulate/private.h
> @@ -594,6 +594,7 @@ amd_like(const struct x86_emulate_ctxt *
>  #define vcpu_has_tsxldtrk()    (ctxt->cpuid->feat.tsxldtrk)
>  #define vcpu_has_avx_vnni()    (ctxt->cpuid->feat.avx_vnni)
>  #define vcpu_has_avx512_bf16() (ctxt->cpuid->feat.avx512_bf16)
> +#define vcpu_has_lkgs()        (ctxt->cpuid->feat.lkgs)
>  
>  #define vcpu_must_have(feat) \
>      generate_exception_if(!vcpu_has_##feat(), X86_EXC_UD)
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -2886,8 +2886,31 @@ x86_emulate(
>                  break;
>              }
>              break;
> -        default:
> -            generate_exception_if(true, EXC_UD);
> +        case 6: /* lkgs */
> +            generate_exception_if((modrm_reg & 1) || vex.pfx != vex_f2, EXC_UD);
> +            generate_exception_if(!mode_64bit() || !mode_ring0(), EXC_UD);

Can we switch to X86_* please.  Alternatively, I've got such a patch
which I've just rebased over all your emulator changes anyway, if we're
happy to fix this in one fell swoop.

(Sadly, you did move some TRAP_* names into util-xen.c which I fixed up
in my other tree-wide exception constant patch.)

> +            vcpu_must_have(lkgs);
> +            fail_if(!ops->read_segment || !ops->read_msr ||
> +                    !ops->write_segment || !ops->write_msr);
> +            if ( (rc = ops->read_msr(MSR_SHADOW_GS_BASE, &msr_val,
> +                                     ctxt)) != X86EMUL_OKAY ||
> +                 (rc = ops->read_segment(x86_seg_gs, &sreg,
> +                                         ctxt)) != X86EMUL_OKAY )
> +                goto done;
> +            dst.orig_val = sreg.base;
> +            if ( (rc = protmode_load_seg(x86_seg_gs, src.val, false, &sreg,
> +                                         ctxt, ops)) != X86EMUL_OKAY ||
> +                 (rc = ops->write_msr(MSR_SHADOW_GS_BASE, sreg.base,
> +                                      ctxt)) != X86EMUL_OKAY )
> +                goto done;
> +            sreg.base = dst.orig_val;

Honestly, I think a comment is needed here, because I'm struggling to
work out if this is correct or not.

There is a 64->32 bit truncation of base with LGKS, just as there is
with MOV GS.

Which I think does happen as a side effect of protmode_load_seg() only
filling in the lower half of sreg.base, but I think it would be nicer to
have:

+            dst.orig_val = sreg.base; /* Preserve full GS Base */
+            if ( (rc = protmode_load_seg(x86_seg_gs, src.val, false, &sreg,
+                                         ctxt, ops)) != X86EMUL_OKAY ||
+                 /* Write truncated base into GS_SHADOW */
+                 (rc = ops->write_msr(MSR_SHADOW_GS_BASE, sreg.base,
+                                      ctxt)) != X86EMUL_OKAY )
+                goto done;
+            sreg.base = dst.orig_val; /* Reinstate full GS Base */

Or so, because it's weird not to see a (uint32_t) somewhere in this logic.

> +            if ( (rc = ops->write_segment(x86_seg_gs, &sreg,
> +                                          ctxt)) != X86EMUL_OKAY )
> +            {
> +                /* Best effort unwind (i.e. no error checking). */
> +                ops->write_msr(MSR_SHADOW_GS_BASE, msr_val, ctxt);

write_segment() can't fail.  (The sanity checks are actually deferred
until after emulation is complete, and I'm not sure if that's behaviour
we want...)

However, more importantly, if we actually take this error path (for some
future reason) then we've created a security vulnerability in the guest.

It will be strictly better to crash the domain in this case, than to try
and let it continue in this state.

> +                goto done;
> +            }
>              break;
>          }
>          break;
> --- a/xen/include/public/arch-x86/cpufeatureset.h
> +++ b/xen/include/public/arch-x86/cpufeatureset.h
> @@ -281,6 +281,8 @@ XEN_CPUFEATURE(AVX512_BF16,  10*32+ 5) /
>  XEN_CPUFEATURE(FZRM,         10*32+10) /*A  Fast Zero-length REP MOVSB */
>  XEN_CPUFEATURE(FSRS,         10*32+11) /*A  Fast Short REP STOSB */
>  XEN_CPUFEATURE(FSRCS,        10*32+12) /*A  Fast Short REP CMPSB/SCASB */
> +XEN_CPUFEATURE(FRED,         10*32+17) /*   Flexible Return and Event Delivery */
> +XEN_CPUFEATURE(LKGS,         10*32+18) /*S  Load Kernel GS Base */
>  XEN_CPUFEATURE(WRMSRNS,      10*32+19) /*   WRMSR Non-Serialising */
>  
>  /* AMD-defined CPU features, CPUID level 0x80000021.eax, word 11 */
> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -295,6 +295,9 @@ def crunch_numbers(state):
>  
>          # In principle the TSXLDTRK insns could also be considered independent.
>          RTM: [TSXLDTRK],
> +
> +        # FRED builds on the LKGS instruction.
> +        LKGS: [FRED],

Hmm...  This is the first case (I think) we've got where a dependency
that goes back numerically in terms of feature number.

Obviously we need to support it, but I'm not sure if the deep_deps loop
will cope in its current form.

~Andrew
Jan Beulich April 5, 2023, 7:51 a.m. UTC | #2
On 04.04.2023 23:54, Andrew Cooper wrote:
> On 04/04/2023 3:49 pm, Jan Beulich wrote:
>> Provide support for this insn, which is a prereq to FRED. CPUID-wise
>> introduce both its and FRED's bit at this occasion, thus allowing to
>> also express the dependency right away.
>>
>> While adding a testcase, also add a SWAPGS one. In order to not affect
>> the behavior of pre-existing tests, install write_{segment,msr} hooks
>> only transiently.
> 
> IMO, the emulator is already complicated enough without us having
> fallback logic to cope with callers that don't set up all the hooks.
> 
> Nor do I think making these hooks transient in the test harness is a
> clever idea.

Are you suggesting we start to _require_ all hooks to be set? That'll
mean many useless stubs in particular in PV emulation. Furthermore
absent hooks sometimes cause default behavior to apply rather than
outright failure, so altering what hooks are present can affect
overall behavior. Hence the transient establishing of the two hooks
here.

>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> Instead of ->read_segment() we could of course also use ->read_msr() to
>> fetch the original GS base. I don't think I can see a clear advantage of
>> either approach; the way it's done it matches how we handle SWAPGS.
> 
> read_segment() is a much shorter logic chain under the hood, so will be
> marginally faster, but it will be a couple of unnecessary VMREADs (on
> Intel at least).

And this is precisely why I think it's not entirely clear. Anyway,
the remark is here just in case you (or Roger) thought doing it the
other way might be better.

> We could expose the get/set reg paths for cases where we know we're not
> going to need sanity checks, but I'm not sure it's worth it in this case.

Probably not.

>> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
>> @@ -2886,8 +2886,31 @@ x86_emulate(
>>                  break;
>>              }
>>              break;
>> -        default:
>> -            generate_exception_if(true, EXC_UD);
>> +        case 6: /* lkgs */
>> +            generate_exception_if((modrm_reg & 1) || vex.pfx != vex_f2, EXC_UD);
>> +            generate_exception_if(!mode_64bit() || !mode_ring0(), EXC_UD);
> 
> Can we switch to X86_* please.  Alternatively, I've got such a patch
> which I've just rebased over all your emulator changes anyway, if we're
> happy to fix this in one fell swoop.

Of course, and I've applied this transformation to all the emulator
patches I have pending (i.e. no need to re-request this elsewhere).

> (Sadly, you did move some TRAP_* names into util-xen.c which I fixed up
> in my other tree-wide exception constant patch.)

Hmm, yes, I changed EXC_* -> X86_EXC_* but failed to pay attention to
TRAP_* (because there no build issue arose).

>> +            vcpu_must_have(lkgs);
>> +            fail_if(!ops->read_segment || !ops->read_msr ||
>> +                    !ops->write_segment || !ops->write_msr);
>> +            if ( (rc = ops->read_msr(MSR_SHADOW_GS_BASE, &msr_val,
>> +                                     ctxt)) != X86EMUL_OKAY ||
>> +                 (rc = ops->read_segment(x86_seg_gs, &sreg,
>> +                                         ctxt)) != X86EMUL_OKAY )
>> +                goto done;
>> +            dst.orig_val = sreg.base;
>> +            if ( (rc = protmode_load_seg(x86_seg_gs, src.val, false, &sreg,
>> +                                         ctxt, ops)) != X86EMUL_OKAY ||
>> +                 (rc = ops->write_msr(MSR_SHADOW_GS_BASE, sreg.base,
>> +                                      ctxt)) != X86EMUL_OKAY )
>> +                goto done;
>> +            sreg.base = dst.orig_val;
> 
> Honestly, I think a comment is needed here, because I'm struggling to
> work out if this is correct or not.
> 
> There is a 64->32 bit truncation of base with LGKS, just as there is
> with MOV GS.
> 
> Which I think does happen as a side effect of protmode_load_seg() only
> filling in the lower half of sreg.base,

I thought that's obvious, as protmode_load_seg() can't possibly have
any other behavior. But ...

> but I think it would be nicer to
> have:
> 
> +            dst.orig_val = sreg.base; /* Preserve full GS Base */
> +            if ( (rc = protmode_load_seg(x86_seg_gs, src.val, false, &sreg,
> +                                         ctxt, ops)) != X86EMUL_OKAY ||
> +                 /* Write truncated base into GS_SHADOW */
> +                 (rc = ops->write_msr(MSR_SHADOW_GS_BASE, sreg.base,
> +                                      ctxt)) != X86EMUL_OKAY )
> +                goto done;
> +            sreg.base = dst.orig_val; /* Reinstate full GS Base */
> 
> Or so, because it's weird not to see a (uint32_t) somewhere in this logic.

... sure, I've added comments. I don't think "truncated" is correct,
as there's no truncation - there's no more than 32 bits we can read
out of the GDT/LDT. I've used "32-bit" instead.

>> +            if ( (rc = ops->write_segment(x86_seg_gs, &sreg,
>> +                                          ctxt)) != X86EMUL_OKAY )
>> +            {
>> +                /* Best effort unwind (i.e. no error checking). */
>> +                ops->write_msr(MSR_SHADOW_GS_BASE, msr_val, ctxt);
> 
> write_segment() can't fail.  (The sanity checks are actually deferred
> until after emulation is complete, and I'm not sure if that's behaviour
> we want...)

Irrespective we shouldn't start omitting error checks. If we truly
mean write_segment() to always be successful, we should make it
return "void". Yet I wouldn't view such as a good move.

> However, more importantly, if we actually take this error path (for some
> future reason) then we've created a security vulnerability in the guest.

You mean in case additionally the write_msr() also fails? In any
event, ...

> It will be strictly better to crash the domain in this case, than to try
> and let it continue in this state.

... we don't return OKAY in this case, so in most cases the guest
will already be crashed, won't it? Otherwise it's not really clear
to me what action you propose to take to "crash the domain": Right
here we better wouldn't call domain_crash() (or else we'd need yet
another #ifdef __XEN__ around it).

Furthermore - what does what you say here mean for the (existing)
similar code path in SWAPGS handling?

>> --- a/xen/tools/gen-cpuid.py
>> +++ b/xen/tools/gen-cpuid.py
>> @@ -295,6 +295,9 @@ def crunch_numbers(state):
>>  
>>          # In principle the TSXLDTRK insns could also be considered independent.
>>          RTM: [TSXLDTRK],
>> +
>> +        # FRED builds on the LKGS instruction.
>> +        LKGS: [FRED],
> 
> Hmm...  This is the first case (I think) we've got where a dependency
> that goes back numerically in terms of feature number.
> 
> Obviously we need to support it, but I'm not sure if the deep_deps loop
> will cope in its current form.

As you know my Python isn't overly good, but looking at the loop
makes me think it deals with this fine; I can't see an ordering
dependency. Looking at the generated INIT_DEEP_DEPS appears to
confirm this - there is an entry for LKGS, and while I haven't
counted elements, the set bit there looks to be in a plausible
position. The only thing I'm not entirely certain about is whether
a further (hypothetical) transitive dependency would also be dealt
with correctly.

If there was an issue here, I'd really like to leave addressing it
to you, as your Python is surely much better than mine.

Jan
diff mbox series

Patch

--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -235,6 +235,8 @@  int libxl_cpuid_parse_config(libxl_cpuid
         {"fzrm",         0x00000007,  1, CPUID_REG_EAX, 10,  1},
         {"fsrs",         0x00000007,  1, CPUID_REG_EAX, 11,  1},
         {"fsrcs",        0x00000007,  1, CPUID_REG_EAX, 12,  1},
+        {"fred",         0x00000007,  1, CPUID_REG_EAX, 17,  1},
+        {"lkgs",         0x00000007,  1, CPUID_REG_EAX, 18,  1},
         {"wrmsrns",      0x00000007,  1, CPUID_REG_EAX, 19,  1},
 
         {"cet-sss",      0x00000007,  1, CPUID_REG_EDX, 18,  1},
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -190,7 +190,8 @@  static const char *const str_7a1[32] =
     [10] = "fzrm",          [11] = "fsrs",
     [12] = "fsrcs",
 
-    /* 18 */                [19] = "wrmsrns",
+    /* 16 */                [17] = "fred",
+    [18] = "lkgs",          [19] = "wrmsrns",
 };
 
 static const char *const str_e21a[32] =
--- a/tools/tests/x86_emulator/predicates.c
+++ b/tools/tests/x86_emulator/predicates.c
@@ -326,6 +326,7 @@  static const struct {
     { { 0x00, 0x18 }, { 2, 2 }, T, R }, /* ltr */
     { { 0x00, 0x20 }, { 2, 2 }, T, R }, /* verr */
     { { 0x00, 0x28 }, { 2, 2 }, T, R }, /* verw */
+    { { 0x00, 0x30 }, { 0, 2 }, T, R, pfx_f2 }, /* lkgs */
     { { 0x01, 0x00 }, { 2, 2 }, F, W }, /* sgdt */
     { { 0x01, 0x08 }, { 2, 2 }, F, W }, /* sidt */
     { { 0x01, 0x10 }, { 2, 2 }, F, R }, /* lgdt */
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -666,6 +666,10 @@  static int blk(
     return x86_emul_blk((void *)offset, p_data, bytes, eflags, state, ctxt);
 }
 
+#ifdef __x86_64__
+static unsigned long gs_base, gs_base_shadow;
+#endif
+
 static int read_segment(
     enum x86_segment seg,
     struct segment_register *reg,
@@ -675,8 +679,30 @@  static int read_segment(
         return X86EMUL_UNHANDLEABLE;
     memset(reg, 0, sizeof(*reg));
     reg->p = 1;
+
+#ifdef __x86_64__
+    if ( seg == x86_seg_gs )
+        reg->base = gs_base;
+#endif
+
+    return X86EMUL_OKAY;
+}
+
+#ifdef __x86_64__
+static int write_segment(
+    enum x86_segment seg,
+    const struct segment_register *reg,
+    struct x86_emulate_ctxt *ctxt)
+{
+    if ( !is_x86_user_segment(seg) )
+        return X86EMUL_UNHANDLEABLE;
+
+    if ( seg == x86_seg_gs )
+        gs_base = reg->base;
+
     return X86EMUL_OKAY;
 }
+#endif
 
 static int read_msr(
     unsigned int reg,
@@ -689,6 +715,20 @@  static int read_msr(
         *val = ctxt->addr_size > 32 ? 0x500 /* LME|LMA */ : 0;
         return X86EMUL_OKAY;
 
+#ifdef __x86_64__
+    case 0xc0000101: /* GS_BASE */
+        if ( ctxt->addr_size < 64 )
+            break;
+        *val = gs_base;
+        return X86EMUL_OKAY;
+
+    case 0xc0000102: /* SHADOW_GS_BASE */
+        if ( ctxt->addr_size < 64 )
+            break;
+        *val = gs_base_shadow;
+        return X86EMUL_OKAY;
+#endif
+
     case 0xc0000103: /* TSC_AUX */
 #define TSC_AUX_VALUE 0xCACACACA
         *val = TSC_AUX_VALUE;
@@ -698,6 +738,31 @@  static int read_msr(
     return X86EMUL_UNHANDLEABLE;
 }
 
+#ifdef __x86_64__
+static int write_msr(
+    unsigned int reg,
+    uint64_t val,
+    struct x86_emulate_ctxt *ctxt)
+{
+    switch ( reg )
+    {
+    case 0xc0000101: /* GS_BASE */
+        if ( ctxt->addr_size < 64 || !is_canonical_address(val) )
+            break;
+        gs_base = val;
+        return X86EMUL_OKAY;
+
+    case 0xc0000102: /* SHADOW_GS_BASE */
+        if ( ctxt->addr_size < 64 || !is_canonical_address(val) )
+            break;
+        gs_base_shadow = val;
+        return X86EMUL_OKAY;
+    }
+
+    return X86EMUL_UNHANDLEABLE;
+}
+#endif
+
 #define INVPCID_ADDR 0x12345678
 #define INVPCID_PCID 0x123
 
@@ -1331,6 +1396,41 @@  int main(int argc, char **argv)
         printf("%u bytes read - ", bytes_read);
         goto fail;
     }
+    printf("okay\n");
+
+    emulops.write_segment = write_segment;
+    emulops.write_msr     = write_msr;
+
+    printf("%-40s", "Testing swapgs...");
+    instr[0] = 0x0f; instr[1] = 0x01; instr[2] = 0xf8;
+    regs.eip = (unsigned long)&instr[0];
+    gs_base = 0xffffeeeecccc8888UL;
+    gs_base_shadow = 0x0000111122224444UL;
+    rc = x86_emulate(&ctxt, &emulops);
+    if ( (rc != X86EMUL_OKAY) ||
+         (regs.eip != (unsigned long)&instr[3]) ||
+         (gs_base != 0x0000111122224444UL) ||
+         (gs_base_shadow != 0xffffeeeecccc8888UL) )
+        goto fail;
+    printf("okay\n");
+
+    printf("%-40s", "Testing lkgs 2(%rdx)...");
+    instr[0] = 0xf2; instr[1] = 0x0f; instr[2] = 0x00; instr[3] = 0x72; instr[4] = 0x02;
+    regs.eip = (unsigned long)&instr[0];
+    regs.edx = (unsigned long)res;
+    res[0]   = 0x00004444;
+    res[1]   = 0x8888cccc;
+    i = cp.extd.nscb; cp.extd.nscb = true; /* for AMD */
+    rc = x86_emulate(&ctxt, &emulops);
+    if ( (rc != X86EMUL_OKAY) ||
+         (regs.eip != (unsigned long)&instr[5]) ||
+         (gs_base != 0x0000111122224444UL) ||
+         gs_base_shadow )
+        goto fail;
+
+    cp.extd.nscb = i;
+    emulops.write_segment = NULL;
+    emulops.write_msr     = NULL;
 #endif
     printf("okay\n");
 
--- a/tools/tests/x86_emulator/x86-emulate.c
+++ b/tools/tests/x86_emulator/x86-emulate.c
@@ -86,6 +86,7 @@  bool emul_test_init(void)
     cp.feat.adx = true;
     cp.feat.avx512pf = cp.feat.avx512f;
     cp.feat.rdpid = true;
+    cp.feat.lkgs = true;
     cp.extd.clzero = true;
 
     if ( cpu_has_xsave )
--- a/xen/arch/x86/x86_emulate/decode.c
+++ b/xen/arch/x86/x86_emulate/decode.c
@@ -744,8 +744,12 @@  decode_twobyte(struct x86_emulate_state
         case 0:
             s->desc |= DstMem | SrcImplicit | Mov;
             break;
+        case 6:
+            if ( !(s->modrm_reg & 1) && mode_64bit() )
+            {
         case 2: case 4:
-            s->desc |= SrcMem16;
+                s->desc |= SrcMem16;
+            }
             break;
         }
         break;
--- a/xen/arch/x86/x86_emulate/private.h
+++ b/xen/arch/x86/x86_emulate/private.h
@@ -594,6 +594,7 @@  amd_like(const struct x86_emulate_ctxt *
 #define vcpu_has_tsxldtrk()    (ctxt->cpuid->feat.tsxldtrk)
 #define vcpu_has_avx_vnni()    (ctxt->cpuid->feat.avx_vnni)
 #define vcpu_has_avx512_bf16() (ctxt->cpuid->feat.avx512_bf16)
+#define vcpu_has_lkgs()        (ctxt->cpuid->feat.lkgs)
 
 #define vcpu_must_have(feat) \
     generate_exception_if(!vcpu_has_##feat(), X86_EXC_UD)
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -2886,8 +2886,31 @@  x86_emulate(
                 break;
             }
             break;
-        default:
-            generate_exception_if(true, EXC_UD);
+        case 6: /* lkgs */
+            generate_exception_if((modrm_reg & 1) || vex.pfx != vex_f2, EXC_UD);
+            generate_exception_if(!mode_64bit() || !mode_ring0(), EXC_UD);
+            vcpu_must_have(lkgs);
+            fail_if(!ops->read_segment || !ops->read_msr ||
+                    !ops->write_segment || !ops->write_msr);
+            if ( (rc = ops->read_msr(MSR_SHADOW_GS_BASE, &msr_val,
+                                     ctxt)) != X86EMUL_OKAY ||
+                 (rc = ops->read_segment(x86_seg_gs, &sreg,
+                                         ctxt)) != X86EMUL_OKAY )
+                goto done;
+            dst.orig_val = sreg.base;
+            if ( (rc = protmode_load_seg(x86_seg_gs, src.val, false, &sreg,
+                                         ctxt, ops)) != X86EMUL_OKAY ||
+                 (rc = ops->write_msr(MSR_SHADOW_GS_BASE, sreg.base,
+                                      ctxt)) != X86EMUL_OKAY )
+                goto done;
+            sreg.base = dst.orig_val;
+            if ( (rc = ops->write_segment(x86_seg_gs, &sreg,
+                                          ctxt)) != X86EMUL_OKAY )
+            {
+                /* Best effort unwind (i.e. no error checking). */
+                ops->write_msr(MSR_SHADOW_GS_BASE, msr_val, ctxt);
+                goto done;
+            }
             break;
         }
         break;
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -281,6 +281,8 @@  XEN_CPUFEATURE(AVX512_BF16,  10*32+ 5) /
 XEN_CPUFEATURE(FZRM,         10*32+10) /*A  Fast Zero-length REP MOVSB */
 XEN_CPUFEATURE(FSRS,         10*32+11) /*A  Fast Short REP STOSB */
 XEN_CPUFEATURE(FSRCS,        10*32+12) /*A  Fast Short REP CMPSB/SCASB */
+XEN_CPUFEATURE(FRED,         10*32+17) /*   Flexible Return and Event Delivery */
+XEN_CPUFEATURE(LKGS,         10*32+18) /*S  Load Kernel GS Base */
 XEN_CPUFEATURE(WRMSRNS,      10*32+19) /*   WRMSR Non-Serialising */
 
 /* AMD-defined CPU features, CPUID level 0x80000021.eax, word 11 */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -295,6 +295,9 @@  def crunch_numbers(state):
 
         # In principle the TSXLDTRK insns could also be considered independent.
         RTM: [TSXLDTRK],
+
+        # FRED builds on the LKGS instruction.
+        LKGS: [FRED],
     }
 
     deep_features = tuple(sorted(deps.keys()))