diff mbox series

[v8,27/50] x86emul: support AVX512{F, ER} reciprocal insns

Message ID 5C8B84A8020000780021F23F@prv1-mh.provo.novell.com (mailing list archive)
State New, archived
Headers show
Series x86emul: remaining AVX512 support | expand

Commit Message

Jan Beulich March 15, 2019, 10:55 a.m. UTC
Also include the only other AVX512ER insn pair, VEXP2P{D,S}.

Note that despite the replacement of the SHA insns' table slots there's
no need to special case their decoding: Their insn-specific code already
sets op_bytes (as was required due to simd_other), and TwoOp is of no
relevance for legacy encoded SIMD insns.

The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
to be on the safe side. The SDM does not clarify behavior there, and
it's even more ambiguous here (without AVX512VL in the picture).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v7: Fix vector length check for AVX512ER insns. ea.type == OP_* ->
    ea.type != OP_*. Re-base.
v6: Re-base. AVX512ER tests now also successfully run.
v5: New.

Comments

Andrew Cooper May 23, 2019, 4:15 p.m. UTC | #1
On 15/03/2019 10:55, Jan Beulich wrote:
> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>
> Note that despite the replacement of the SHA insns' table slots there's
> no need to special case their decoding: Their insn-specific code already
> sets op_bytes (as was required due to simd_other), and TwoOp is of no
> relevance for legacy encoded SIMD insns.
>
> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
> to be on the safe side. The SDM does not clarify behavior there, and
> it's even more ambiguous here (without AVX512VL in the picture).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Seeing as I have some ER hardware, is there an easy way to get
GCC/binutils to emit a weird L'L field, or will this involve some manual
opcode generation to test?

~Andrew
Jan Beulich May 24, 2019, 6:43 a.m. UTC | #2
>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 10:55, Jan Beulich wrote:
>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>
>> Note that despite the replacement of the SHA insns' table slots there's
>> no need to special case their decoding: Their insn-specific code already
>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>> relevance for legacy encoded SIMD insns.
>>
>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>> to be on the safe side. The SDM does not clarify behavior there, and
>> it's even more ambiguous here (without AVX512VL in the picture).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks, also for the others.

> Seeing as I have some ER hardware, is there an easy way to get
> GCC/binutils to emit a weird L'L field, or will this involve some manual
> opcode generation to test?

gcc does not provide any control at all, afaict. binutils allows "weird"
VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
I'm afraid this will involve using .byte.

Jan
Andrew Cooper May 24, 2019, 8:48 p.m. UTC | #3
On 24/05/2019 07:43, Jan Beulich wrote:
>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>> On 15/03/2019 10:55, Jan Beulich wrote:
>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>
>>> Note that despite the replacement of the SHA insns' table slots there's
>>> no need to special case their decoding: Their insn-specific code already
>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>> relevance for legacy encoded SIMD insns.
>>>
>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>> to be on the safe side. The SDM does not clarify behavior there, and
>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Thanks, also for the others.
>
>> Seeing as I have some ER hardware, is there an easy way to get
>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>> opcode generation to test?
> gcc does not provide any control at all, afaict. binutils allows "weird"
> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
> I'm afraid this will involve using .byte.

Ok.  Given a test program of:

{
printf("Real:\n");
asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");

printf("Bytes:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");

printf("Bad 0x28:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");

printf("Bad 0x48:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");

printf("Bad 0x68:\n");
asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
}

Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
KNL and KNM.

~Andrew
Jan Beulich May 27, 2019, 8:02 a.m. UTC | #4
>>> On 24.05.19 at 22:48, <andrew.cooper3@citrix.com> wrote:
> On 24/05/2019 07:43, Jan Beulich wrote:
>>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>>> On 15/03/2019 10:55, Jan Beulich wrote:
>>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>>
>>>> Note that despite the replacement of the SHA insns' table slots there's
>>>> no need to special case their decoding: Their insn-specific code already
>>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>>> relevance for legacy encoded SIMD insns.
>>>>
>>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>>> to be on the safe side. The SDM does not clarify behavior there, and
>>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Thanks, also for the others.
>>
>>> Seeing as I have some ER hardware, is there an easy way to get
>>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>>> opcode generation to test?
>> gcc does not provide any control at all, afaict. binutils allows "weird"
>> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
>> I'm afraid this will involve using .byte.
> 
> Ok.  Given a test program of:
> 
> {
> printf("Real:\n");
> asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");
> 
> printf("Bytes:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");
> 
> printf("Bad 0x28:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");
> 
> printf("Bad 0x48:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");
> 
> printf("Bad 0x68:\n");
> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
> }
> 
> Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
> KNL and KNM.

And by implication I take it that the L'L=1 and L'L=2 cases indeed to not
#UD there?

Thanks for having tried this out,
Jan
Andrew Cooper May 29, 2019, 10 a.m. UTC | #5
On 27/05/2019 09:02, Jan Beulich wrote:
>>>> On 24.05.19 at 22:48, <andrew.cooper3@citrix.com> wrote:
>> On 24/05/2019 07:43, Jan Beulich wrote:
>>>>>> On 23.05.19 at 18:15, <andrew.cooper3@citrix.com> wrote:
>>>> On 15/03/2019 10:55, Jan Beulich wrote:
>>>>> Also include the only other AVX512ER insn pair, VEXP2P{D,S}.
>>>>>
>>>>> Note that despite the replacement of the SHA insns' table slots there's
>>>>> no need to special case their decoding: Their insn-specific code already
>>>>> sets op_bytes (as was required due to simd_other), and TwoOp is of no
>>>>> relevance for legacy encoded SIMD insns.
>>>>>
>>>>> The raising of #UD when EVEX.L'L is 3 for AVX512ER scalar insns is done
>>>>> to be on the safe side. The SDM does not clarify behavior there, and
>>>>> it's even more ambiguous here (without AVX512VL in the picture).
>>>>>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Thanks, also for the others.
>>>
>>>> Seeing as I have some ER hardware, is there an easy way to get
>>>> GCC/binutils to emit a weird L'L field, or will this involve some manual
>>>> opcode generation to test?
>>> gcc does not provide any control at all, afaict. binutils allows "weird"
>>> VEX.L or EVEX.L'L only for insns it believes ignore that field. So yes,
>>> I'm afraid this will involve using .byte.
>> Ok.  Given a test program of:
>>
>> {
>> printf("Real:\n");
>> asm volatile ("vrcp14sd %xmm0,%xmm0,%xmm0");
>>
>> printf("Bytes:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x08, 0x4d, 0xc0");
>>
>> printf("Bad 0x28:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x28, 0x4d, 0xc0");
>>
>> printf("Bad 0x48:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x48, 0x4d, 0xc0");
>>
>> printf("Bad 0x68:\n");
>> asm volatile (".byte 0x62, 0xf2, 0xfd, 0x68, 0x4d, 0xc0");
>> }
>>
>> Then the L'L = 3 case (0x68 at the end) does indeed take #UD for both
>> KNL and KNM.
> And by implication I take it that the L'L=1 and L'L=2 cases indeed to not
> #UD there?

Correct.  It would appear that, unhelpfully, the L'L==3 being reserved
rule takes higher precedence than the L'L-is-ignored rule for this class
of instruction.

~Andrew
diff mbox series

Patch

--- a/tools/tests/x86_emulator/Makefile
+++ b/tools/tests/x86_emulator/Makefile
@@ -16,7 +16,7 @@  vpath %.c $(XEN_ROOT)/xen/lib/x86
 
 CFLAGS += $(CFLAGS_xeninclude)
 
-SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq
+SIMD := 3dnow sse sse2 sse4 avx avx2 xop avx512f avx512bw avx512dq avx512er
 FMA := fma4 fma
 SG := avx2-sg
 TESTCASES := blowfish $(SIMD) $(FMA) $(SG)
@@ -72,6 +72,9 @@  avx512bw-flts :=
 avx512dq-vecs := $(avx512f-vecs)
 avx512dq-ints := $(avx512f-ints)
 avx512dq-flts := $(avx512f-flts)
+avx512er-vecs := 64
+avx512er-ints :=
+avx512er-flts := 4 8
 
 avx512f-opmask-vecs := 2
 avx512dq-opmask-vecs := 1 2
--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -278,10 +278,14 @@  static const struct test avx512f_all[] =
     INSN(punpckldq,    66,   0f, 62,    vl,      d, vl),
     INSN(punpcklqdq,   66,   0f, 6c,    vl,      q, vl),
     INSN(pxor,         66,   0f, ef,    vl,     dq, vl),
+    INSN(rcp14,        66, 0f38, 4c,    vl,     sd, vl),
+    INSN(rcp14,        66, 0f38, 4d,    el,     sd, el),
     INSN(rndscalepd,   66, 0f3a, 09,    vl,      q, vl),
     INSN(rndscaleps,   66, 0f3a, 08,    vl,      d, vl),
     INSN(rndscalesd,   66, 0f3a, 0b,    el,      q, el),
     INSN(rndscaless,   66, 0f3a, 0a,    el,      d, el),
+    INSN(rsqrt14,      66, 0f38, 4e,    vl,     sd, vl),
+    INSN(rsqrt14,      66, 0f38, 4f,    el,     sd, el),
     INSN_PFP(shuf,           0f, c6),
     INSN_FP(sqrt,            0f, 51),
     INSN_FP(sub,             0f, 5c),
@@ -477,6 +481,14 @@  static const struct test avx512dq_512[]
     INSN(inserti32x8,    66, 0f3a, 3a, el_8, d, vl),
 };
 
+static const struct test avx512er_512[] = {
+    INSN(exp2,    66, 0f38, c8, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, ca, vl, sd, vl),
+    INSN(rcp28,   66, 0f38, cb, el, sd, el),
+    INSN(rsqrt28, 66, 0f38, cc, vl, sd, vl),
+    INSN(rsqrt28, 66, 0f38, cd, el, sd, el),
+};
+
 static const struct test avx512_vbmi_all[] = {
     INSN(permb,         66, 0f38, 8d, vl, b, vl),
     INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -837,5 +849,6 @@  void evex_disp8_test(void *instr, struct
     RUN(avx512dq, 128);
     RUN(avx512dq, no128);
     RUN(avx512dq, 512);
+    RUN(avx512er, 512);
     RUN(avx512_vbmi, all);
 }
--- a/tools/tests/x86_emulator/simd.c
+++ b/tools/tests/x86_emulator/simd.c
@@ -210,9 +210,23 @@  static inline vec_t movlhps(vec_t x, vec
 })
 #elif defined(FLOAT_SIZE) && VEC_SIZE == FLOAT_SIZE && defined(__AVX512F__)
 # if FLOAT_SIZE == 4
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28ss %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14ss %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14ss %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtss %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscaless $0b1011, %[in], %[out], %[out]")
 # elif FLOAT_SIZE == 8
+#  ifdef __AVX512ER__
+#   define recip(x) scalar_1op(x, "vrcp28sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt28sd %[in], %[out], %[out]")
+#  else
+#   define recip(x) scalar_1op(x, "vrcp14sd %[in], %[out], %[out]")
+#   define rsqrt(x) scalar_1op(x, "vrsqrt14sd %[in], %[out], %[out]")
+#  endif
 #  define sqrt(x) scalar_1op(x, "vsqrtsd %[in], %[out], %[out]")
 #  define trunc(x) scalar_1op(x, "vrndscalesd $0b1011, %[in], %[out], %[out]")
 # endif
@@ -263,6 +277,13 @@  static inline vec_t movlhps(vec_t x, vec
 #  define max(x, y) BR_(maxps, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minps, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movaps, _mask, x, y, (0b0101010101010101 & ALL_TRUE))
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28ps, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14ps, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14ps, _mask, x, undef(), ~0)
+#  endif
 #  define shrink1(x) BR_(cvtpd2ps, _mask, (vdf_t)(x), (vsf_half_t){}, ~0)
 #  define sqrt(x) BR(sqrtps, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscaleps_, _mask, x, 0b1011, undef(), ~0)
@@ -318,6 +339,13 @@  static inline vec_t movlhps(vec_t x, vec
 #  define max(x, y) BR_(maxpd, _mask, x, y, undef(), ~0)
 #  define min(x, y) BR_(minpd, _mask, x, y, undef(), ~0)
 #  define mix(x, y) B(movapd, _mask, x, y, 0b01010101)
+#  if VEC_SIZE == 64 && defined(__AVX512ER__)
+#   define recip(x) BR(rcp28pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) BR(rsqrt28pd, _mask, x, undef(), ~0)
+#  else
+#   define recip(x) B(rcp14pd, _mask, x, undef(), ~0)
+#   define rsqrt(x) B(rsqrt14pd, _mask, x, undef(), ~0)
+#  endif
 #  define sqrt(x) BR(sqrtpd, _mask, x, undef(), ~0)
 #  define trunc(x) BR(rndscalepd_, _mask, x, 0b1011, undef(), ~0)
 #  if VEC_SIZE == 16
--- a/tools/tests/x86_emulator/simd.h
+++ b/tools/tests/x86_emulator/simd.h
@@ -178,14 +178,20 @@  DECL_OCTET(half);
 /* Sadly there are a few exceptions to the general naming rules. */
 # define __builtin_ia32_broadcastf32x4_512_mask __builtin_ia32_broadcastf32x4_512
 # define __builtin_ia32_broadcasti32x4_512_mask __builtin_ia32_broadcasti32x4_512
+# define __builtin_ia32_exp2pd512_mask __builtin_ia32_exp2pd_mask
+# define __builtin_ia32_exp2ps512_mask __builtin_ia32_exp2ps_mask
 # define __builtin_ia32_insertf32x4_512_mask __builtin_ia32_insertf32x4_mask
 # define __builtin_ia32_insertf32x8_512_mask __builtin_ia32_insertf32x8_mask
 # define __builtin_ia32_insertf64x4_512_mask __builtin_ia32_insertf64x4_mask
 # define __builtin_ia32_inserti32x4_512_mask __builtin_ia32_inserti32x4_mask
 # define __builtin_ia32_inserti32x8_512_mask __builtin_ia32_inserti32x8_mask
 # define __builtin_ia32_inserti64x4_512_mask __builtin_ia32_inserti64x4_mask
+# define __builtin_ia32_rcp28pd512_mask __builtin_ia32_rcp28pd_mask
+# define __builtin_ia32_rcp28ps512_mask __builtin_ia32_rcp28ps_mask
 # define __builtin_ia32_rndscalepd_512_mask __builtin_ia32_rndscalepd_mask
 # define __builtin_ia32_rndscaleps_512_mask __builtin_ia32_rndscaleps_mask
+# define __builtin_ia32_rsqrt28pd512_mask __builtin_ia32_rsqrt28pd_mask
+# define __builtin_ia32_rsqrt28ps512_mask __builtin_ia32_rsqrt28ps_mask
 # define __builtin_ia32_shuf_f32x4_512_mask __builtin_ia32_shuf_f32x4_mask
 # define __builtin_ia32_shuf_f64x2_512_mask __builtin_ia32_shuf_f64x2_mask
 # define __builtin_ia32_shuf_i32x4_512_mask __builtin_ia32_shuf_i32x4_mask
--- a/tools/tests/x86_emulator/test_x86_emulator.c
+++ b/tools/tests/x86_emulator/test_x86_emulator.c
@@ -24,6 +24,7 @@  asm ( ".pushsection .test, \"ax\", @prog
 #include "avx512f.h"
 #include "avx512bw.h"
 #include "avx512dq.h"
+#include "avx512er.h"
 
 #define verbose false /* Switch to true for far more logging. */
 
@@ -106,6 +107,11 @@  static bool simd_check_avx512dq_vl(void)
     return cpu_has_avx512dq && cpu_has_avx512vl;
 }
 
+static bool simd_check_avx512er(void)
+{
+    return cpu_has_avx512er;
+}
+
 static bool simd_check_avx512bw(void)
 {
     return cpu_has_avx512bw;
@@ -327,6 +333,10 @@  static const struct {
     AVX512VL(DQ+VL u64x2,    avx512dq,      16u8),
     AVX512VL(DQ+VL s64x4,    avx512dq,      32i8),
     AVX512VL(DQ+VL u64x4,    avx512dq,      32u8),
+    SIMD(AVX512ER f32 scalar,avx512er,        f4),
+    SIMD(AVX512ER f32x16,    avx512er,      64f4),
+    SIMD(AVX512ER f64 scalar,avx512er,        f8),
+    SIMD(AVX512ER f64x8,     avx512er,      64f8),
 #undef AVX512VL_
 #undef AVX512VL
 #undef SIMD_
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -134,6 +134,7 @@  static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_bmi2       cp.feat.bmi2
 #define cpu_has_avx512f   (cp.feat.avx512f  && xcr0_mask(0xe6))
 #define cpu_has_avx512dq  (cp.feat.avx512dq && xcr0_mask(0xe6))
+#define cpu_has_avx512er  (cp.feat.avx512er && xcr0_mask(0xe6))
 #define cpu_has_avx512bw  (cp.feat.avx512bw && xcr0_mask(0xe6))
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -471,6 +471,10 @@  static const struct ext0f38_table {
     [0x40] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x41] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0x45 ... 0x47] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
+    [0x4c] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
     [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
     [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -510,7 +514,12 @@  static const struct ext0f38_table {
     [0xbd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xbe] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
     [0xbf] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
-    [0xc8 ... 0xcd] = { .simd_size = simd_other },
+    [0xc8] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xc9] = { .simd_size = simd_other },
+    [0xca] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcb] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0xcc] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
+    [0xcd] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
     [0xdb] = { .simd_size = simd_packed_int, .two_op = 1 },
     [0xdc ... 0xdf] = { .simd_size = simd_packed_int },
     [0xf0] = { .two_op = 1 },
@@ -1873,6 +1882,7 @@  static bool vcpu_has(
 #define vcpu_has_smap()        vcpu_has(         7, EBX, 20, ctxt, ops)
 #define vcpu_has_clflushopt()  vcpu_has(         7, EBX, 23, ctxt, ops)
 #define vcpu_has_clwb()        vcpu_has(         7, EBX, 24, ctxt, ops)
+#define vcpu_has_avx512er()    vcpu_has(         7, EBX, 27, ctxt, ops)
 #define vcpu_has_sha()         vcpu_has(         7, EBX, 29, ctxt, ops)
 #define vcpu_has_avx512bw()    vcpu_has(         7, EBX, 30, ctxt, ops)
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
@@ -6168,6 +6178,8 @@  x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x45): /* vpsrlv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x46): /* vpsrav{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0x47): /* vpsllv{d,q} [xyz]mm/mem,[xyz]mm,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4c): /* vrcp14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4e): /* vrsqrt14p{s,d} [xyz]mm/mem,[xyz]mm{k} */
     avx512f_no_sae:
         host_and_vcpu_must_have(avx512f);
         generate_exception_if(ea.type != OP_MEM && evex.brs, EXC_UD);
@@ -8865,6 +8877,13 @@  x86_emulate(
         generate_exception_if(vex.w, EXC_UD);
         goto simd_0f_avx2;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4d): /* vrcp14s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x4f): /* vrsqrt14s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512f);
+        generate_exception_if(evex.brs, EXC_UD);
+        avx512_vlen_check(true);
+        goto simd_zmm;
+
     case X86EMUL_OPC_VEX_66(0x0f38, 0x5a): /* vbroadcasti128 m128,ymm */
         generate_exception_if(ea.type != OP_MEM || !vex.l || vex.w, EXC_UD);
         goto simd_0f_avx2;
@@ -9112,6 +9131,7 @@  x86_emulate(
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbd): /* vfnmadd231s{s,d} xmm/mem,xmm,xmm{k} */
     case X86EMUL_OPC_EVEX_66(0x0f38, 0xbf): /* vfnmsub231s{s,d} xmm/mem,xmm,xmm{k} */
         host_and_vcpu_must_have(avx512f);
+    simd_zmm_scalar_sae:
         generate_exception_if(ea.type != OP_REG && evex.brs, EXC_UD);
         if ( !evex.brs )
             avx512_vlen_check(true);
@@ -9127,6 +9147,19 @@  x86_emulate(
         op_bytes = 16;
         goto simd_0f38_common;
 
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xc8): /* vexp2p{s,d} zmm/mem,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xca): /* vrcp28p{s,d} zmm/mem,zmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcc): /* vrsqrt28p{s,d} zmm/mem,zmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        generate_exception_if((ea.type != OP_REG || !evex.brs) && evex.lr != 2,
+                              EXC_UD);
+        goto simd_zmm;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcb): /* vrcp28s{s,d} xmm/mem,xmm,xmm{k} */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0xcd): /* vrsqrt28s{s,d} xmm/mem,xmm,xmm{k} */
+        host_and_vcpu_must_have(avx512er);
+        goto simd_zmm_scalar_sae;
+
     case X86EMUL_OPC(0x0f38, 0xf0): /* movbe m,r */
     case X86EMUL_OPC(0x0f38, 0xf1): /* movbe r,m */
         vcpu_must_have(movbe);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -102,6 +102,7 @@ 
 #define cpu_has_avx512dq        boot_cpu_has(X86_FEATURE_AVX512DQ)
 #define cpu_has_rdseed          boot_cpu_has(X86_FEATURE_RDSEED)
 #define cpu_has_smap            boot_cpu_has(X86_FEATURE_SMAP)
+#define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)