diff mbox series

[v8,44/50] x86emul: support VPCLMULQDQ insns

Message ID 5C8B86FB020000780021F31D@prv1-mh.provo.novell.com (mailing list archive)
State New, archived
Headers show
Series x86emul: remaining AVX512 support | expand

Commit Message

Jan Beulich March 15, 2019, 11:05 a.m. UTC
As to the feature dependency adjustment, while strictly speaking AVX is
a sufficient prereq (to have YMM registers), 256-bit vectors of integers
have got fully introduced with AVX2 only. Sadly gcc can't be used as a
reference here: They don't provide any AVX512-independent built-in at
all.

Along the lines of PCLMULQDQ, since the insns here and in particular
their memory access patterns follow the usual scheme, I didn't think it
was necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: Should VPCLMULQDQ also depend on PCLMULQDQ?
---
v8: No need to set fault_suppression to false.
v7: New.

Comments

Andrew Cooper June 21, 2019, 12:52 p.m. UTC | #1
On 15/03/2019 11:05, Jan Beulich wrote:
> As to the feature dependency adjustment, while strictly speaking AVX is
> a sufficient prereq (to have YMM registers), 256-bit vectors of integers
> have got fully introduced with AVX2 only. Sadly gcc can't be used as a
> reference here: They don't provide any AVX512-independent built-in at
> all.
>
> Along the lines of PCLMULQDQ, since the insns here and in particular
> their memory access patterns follow the usual scheme, I didn't think it
> was necessary to add a contrived test specifically for them, beyond the
> Disp8 scaling one.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: Should VPCLMULQDQ also depend on PCLMULQDQ?

I think so, yes.  These are all 64 by 64 multiplies with a 128 bit
result, and an imm8 to choose which quadwords get used, so both these
features will be using the silicon block in the vector pipeline.  The
only difference is whether its wired through from the legacy SSE
instructions, or the [E]VEX instructions.

I certainly don't expect to ever see hardware with VPCLMULQDQ but
lacking PCLMULQDQ.

> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -255,8 +255,9 @@ def crunch_numbers(state):
>  
>          # This is just the dependency between AVX512 and AVX2 of XSTATE
>          # feature flags.  If want to use AVX512, AVX2 must be supported and
> -        # enabled.
> -        AVX2: [AVX512F],
> +        # enabled.  Certain later extensions, acting on 256-bit vectors of
> +        # integers, better depend on AVX2 than AVX.
> +        AVX2: [AVX512F, VPCLMULQDQ],

Hmm - this is awkward, because in practice, there won't be any hardware
in existence with VPCLMULQDQ and AVX2 but lacking AVX512.

However, the VEX encoding is legitimate in the absence of the EVEX
encoding, and doesn't depend on the AVX512 XCR0 state.

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich June 21, 2019, 1:44 p.m. UTC | #2
>>> On 21.06.19 at 14:52, <andrew.cooper3@citrix.com> wrote:
> On 15/03/2019 11:05, Jan Beulich wrote:
>> As to the feature dependency adjustment, while strictly speaking AVX is
>> a sufficient prereq (to have YMM registers), 256-bit vectors of integers
>> have got fully introduced with AVX2 only. Sadly gcc can't be used as a
>> reference here: They don't provide any AVX512-independent built-in at
>> all.
>>
>> Along the lines of PCLMULQDQ, since the insns here and in particular
>> their memory access patterns follow the usual scheme, I didn't think it
>> was necessary to add a contrived test specifically for them, beyond the
>> Disp8 scaling one.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> TBD: Should VPCLMULQDQ also depend on PCLMULQDQ?
> 
> I think so, yes.  These are all 64 by 64 multiplies with a 128 bit
> result, and an imm8 to choose which quadwords get used, so both these
> features will be using the silicon block in the vector pipeline.  The
> only difference is whether its wired through from the legacy SSE
> instructions, or the [E]VEX instructions.
> 
> I certainly don't expect to ever see hardware with VPCLMULQDQ but
> lacking PCLMULQDQ.

Okay, will do, and I'll assume that ...

>> --- a/xen/tools/gen-cpuid.py
>> +++ b/xen/tools/gen-cpuid.py
>> @@ -255,8 +255,9 @@ def crunch_numbers(state):
>>  
>>          # This is just the dependency between AVX512 and AVX2 of XSTATE
>>          # feature flags.  If want to use AVX512, AVX2 must be supported and
>> -        # enabled.
>> -        AVX2: [AVX512F],
>> +        # enabled.  Certain later extensions, acting on 256-bit vectors of
>> +        # integers, better depend on AVX2 than AVX.
>> +        AVX2: [AVX512F, VPCLMULQDQ],
> 
> Hmm - this is awkward, because in practice, there won't be any hardware
> in existence with VPCLMULQDQ and AVX2 but lacking AVX512.
> 
> However, the VEX encoding is legitimate in the absence of the EVEX
> encoding, and doesn't depend on the AVX512 XCR0 state.
> 
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

... this is meant for a patch with that addition to the dependency tree.

Btw - for GFNI and its VEX/EVEX forms this gets even more interesting.
It didn't even occur to me that there could be new hardware supporting
GFNI but no AVX* whatsoever, but I've been told the current split
between SDM and ISA extensions doc is to reflect exactly this.

Jan
diff mbox series

Patch

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -591,6 +591,10 @@  static const struct test avx512_vpopcntd
     INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
 };
 
+static const struct test vpclmulqdq_all[] = {
+    INSN(pclmulqdq, 66, 0f3a, 44, vl, q_nb, vl)
+};
+
 static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
 static const unsigned char vl_128[] = { VL_128 };
 static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -968,4 +972,9 @@  void evex_disp8_test(void *instr, struct
     RUN(avx512_vbmi2, all);
     RUN(avx512_vnni, all);
     RUN(avx512_vpopcntdq, all);
+
+    if ( cpu_has_avx512f )
+    {
+        RUN(vpclmulqdq, all);
+    }
 }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -144,6 +144,7 @@  static inline bool xcr0_mask(uint64_t ma
 #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
 #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_vpclmulqdq (cp.feat.vpclmulqdq && xcr0_mask(6))
 #define cpu_has_avx512_vnni (cp.feat.avx512_vnni && xcr0_mask(0xe6))
 #define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
 #define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -594,7 +594,7 @@  static const struct ext0f3a_table {
     [0x3e ... 0x3f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x40 ... 0x41] = { .simd_size = simd_packed_fp },
     [0x42 ... 0x43] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
-    [0x44] = { .simd_size = simd_packed_int },
+    [0x44] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
     [0x46] = { .simd_size = simd_packed_int },
     [0x48 ... 0x49] = { .simd_size = simd_packed_fp, .four_op = 1 },
     [0x4a ... 0x4b] = { .simd_size = simd_packed_fp, .four_op = 1 },
@@ -1922,6 +1922,7 @@  static bool vcpu_has(
 #define vcpu_has_avx512vl()    vcpu_has(         7, EBX, 31, ctxt, ops)
 #define vcpu_has_avx512_vbmi() vcpu_has(         7, ECX,  1, ctxt, ops)
 #define vcpu_has_avx512_vbmi2() vcpu_has(        7, ECX,  6, ctxt, ops)
+#define vcpu_has_vpclmulqdq()  vcpu_has(         7, ECX, 10, ctxt, ops)
 #define vcpu_has_avx512_vnni() vcpu_has(         7, ECX, 11, ctxt, ops)
 #define vcpu_has_avx512_bitalg() vcpu_has(       7, ECX, 12, ctxt, ops)
 #define vcpu_has_avx512_vpopcntdq() vcpu_has(    7, ECX, 14, ctxt, ops)
@@ -10219,13 +10220,19 @@  x86_emulate(
         goto opmask_shift_imm;
 
     case X86EMUL_OPC_66(0x0f3a, 0x44):     /* pclmulqdq $imm8,xmm/m128,xmm */
-    case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,xmm/m128,xmm,xmm */
+    case X86EMUL_OPC_VEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         host_and_vcpu_must_have(pclmulqdq);
         if ( vex.opcx == vex_none )
             goto simd_0f3a_common;
-        generate_exception_if(vex.l, EXC_UD);
+        if ( vex.l )
+            host_and_vcpu_must_have(vpclmulqdq);
         goto simd_0f_imm8_avx;
 
+    case X86EMUL_OPC_EVEX_66(0x0f3a, 0x44): /* vpclmulqdq $imm8,[xyz]mm/mem,[xyz]mm,[xyz]mm */
+        host_and_vcpu_must_have(vpclmulqdq);
+        generate_exception_if(evex.brs || evex.opmsk, EXC_UD);
+        goto avx512f_imm8_no_sae;
+
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x4a): /* vblendvps {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
     case X86EMUL_OPC_VEX_66(0x0f3a, 0x4b): /* vblendvpd {x,y}mm,{x,y}mm/mem,{x,y}mm,{x,y}mm */
         generate_exception_if(vex.w, EXC_UD);
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -112,6 +112,7 @@ 
 /* CPUID level 0x00000007:0.ecx */
 #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
 #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_vpclmulqdq      boot_cpu_has(X86_FEATURE_VPCLMULQDQ)
 #define cpu_has_avx512_vnni     boot_cpu_has(X86_FEATURE_AVX512_VNNI)
 #define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
 #define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -121,7 +121,7 @@  XEN_CPUFEATURE(PBE,           0*32+31) /
 
 /* Intel-defined CPU features, CPUID level 0x00000001.ecx, word 1 */
 XEN_CPUFEATURE(SSE3,          1*32+ 0) /*A  Streaming SIMD Extensions-3 */
-XEN_CPUFEATURE(PCLMULQDQ,     1*32+ 1) /*A  Carry-less mulitplication */
+XEN_CPUFEATURE(PCLMULQDQ,     1*32+ 1) /*A  Carry-less multiplication */
 XEN_CPUFEATURE(DTES64,        1*32+ 2) /*   64-bit Debug Store */
 XEN_CPUFEATURE(MONITOR,       1*32+ 3) /*   Monitor/Mwait support */
 XEN_CPUFEATURE(DSCPL,         1*32+ 4) /*   CPL Qualified Debug Store */
@@ -229,6 +229,7 @@  XEN_CPUFEATURE(UMIP,          6*32+ 2) /
 XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
 XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
 XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(VPCLMULQDQ,    6*32+10) /*A  Vector Carry-less Multiplication Instrs */
 XEN_CPUFEATURE(AVX512_VNNI,   6*32+11) /*A  Vector Neural Network Instrs */
 XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -255,8 +255,9 @@  def crunch_numbers(state):
 
         # This is just the dependency between AVX512 and AVX2 of XSTATE
         # feature flags.  If want to use AVX512, AVX2 must be supported and
-        # enabled.
-        AVX2: [AVX512F],
+        # enabled.  Certain later extensions, acting on 256-bit vectors of
+        # integers, better depend on AVX2 than AVX.
+        AVX2: [AVX512F, VPCLMULQDQ],
 
         # AVX512F is taken to mean hardware support for 512bit registers
         # (which in practice depends on the EVEX prefix to encode) as well