diff mbox series

[v9,11/23] x86emul: support of AVX512* population count insns

Message ID 745c685a-a614-6067-946d-c89fe98cb589@suse.com (mailing list archive)
State Superseded
Headers show
Series x86emul: remaining AVX512 support | expand

Commit Message

Jan Beulich July 1, 2019, 11:22 a.m. UTC
Plus the only other AVX512_BITALG one.

As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v9: Re-base.
v7: Re-base.
v6: New.

Comments

Andrew Cooper July 4, 2019, 2:47 p.m. UTC | #1
On 01/07/2019 12:22, Jan Beulich wrote:
> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -268,7 +268,7 @@ def crunch_numbers(state):
>           # AVX512 extensions acting on vectors of bytes/words are made
>           # dependents of AVX512BW (as to requiring wider than 16-bit mask
>           # registers), despite the SDM not formally making this connection.
> -        AVX512BW: [AVX512_BF16, AVX512_VBMI, AVX512_VBMI2],
> +        AVX512BW: [AVX512_BF16, AVX512_BITALG, AVX512_VBMI, AVX512_VBMI2],

BITALG should be after VBMI2, because everything in this table is
ordered by bit number.

With this fixed, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich July 4, 2019, 2:54 p.m. UTC | #2
On 04.07.2019 16:47, Andrew Cooper wrote:
> On 01/07/2019 12:22, Jan Beulich wrote:
>> --- a/xen/tools/gen-cpuid.py
>> +++ b/xen/tools/gen-cpuid.py
>> @@ -268,7 +268,7 @@ def crunch_numbers(state):
>>            # AVX512 extensions acting on vectors of bytes/words are made
>>            # dependents of AVX512BW (as to requiring wider than 16-bit mask
>>            # registers), despite the SDM not formally making this connection.
>> -        AVX512BW: [AVX512_BF16, AVX512_VBMI, AVX512_VBMI2],
>> +        AVX512BW: [AVX512_BF16, AVX512_BITALG, AVX512_VBMI, AVX512_VBMI2],
> 
> BITALG should be after VBMI2, because everything in this table is
> ordered by bit number.

As said before - there's no ordering by bit number possible here.
The individual features may live on different (sub)leaves. By
what you say BF16 shouldn't be first. The list here clearly is
sorted alphabetically, and imo that's the only future proof sorting
possible (and also for AVX512F, where I had previously offered to
put together a patch to switch to alphabetical ordering, if only we
could agree on that).

> With this fixed, Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

As per above I'm not going to apply this without hearing back from
you.

Jan
Andrew Cooper July 4, 2019, 6:38 p.m. UTC | #3
On 04/07/2019 15:54, Jan Beulich wrote:
> On 04.07.2019 16:47, Andrew Cooper wrote:
>> On 01/07/2019 12:22, Jan Beulich wrote:
>>> --- a/xen/tools/gen-cpuid.py
>>> +++ b/xen/tools/gen-cpuid.py
>>> @@ -268,7 +268,7 @@ def crunch_numbers(state):
>>>            # AVX512 extensions acting on vectors of bytes/words are made
>>>            # dependents of AVX512BW (as to requiring wider than 16-bit mask
>>>            # registers), despite the SDM not formally making this connection.
>>> -        AVX512BW: [AVX512_BF16, AVX512_VBMI, AVX512_VBMI2],
>>> +        AVX512BW: [AVX512_BF16, AVX512_BITALG, AVX512_VBMI, AVX512_VBMI2],
>> BITALG should be after VBMI2, because everything in this table is
>> ordered by bit number.
> As said before - there's no ordering by bit number possible here.

Its perfectly easy.  Each feature has a unique number.

> The individual features may live on different (sub)leaves. By
> what you say BF16 shouldn't be first. The list here clearly is
> sorted alphabetically, and imo that's the only future proof sorting
> possible (and also for AVX512F, where I had previously offered to
> put together a patch to switch to alphabetical ordering, if only we
> could agree on that).

In which case I missed it during review.

This feature matrix is deliberately sorted by feature number in an
effort to preserve chronology, which is a much more useful way of
reasoning about feature dependencies.

~Andrew
Jan Beulich July 5, 2019, 8:10 a.m. UTC | #4
On 04.07.2019 20:38, Andrew Cooper wrote:
> On 04/07/2019 15:54, Jan Beulich wrote:
>> On 04.07.2019 16:47, Andrew Cooper wrote:
>>> On 01/07/2019 12:22, Jan Beulich wrote:
>>>> --- a/xen/tools/gen-cpuid.py
>>>> +++ b/xen/tools/gen-cpuid.py
>>>> @@ -268,7 +268,7 @@ def crunch_numbers(state):
>>>>             # AVX512 extensions acting on vectors of bytes/words are made
>>>>             # dependents of AVX512BW (as to requiring wider than 16-bit mask
>>>>             # registers), despite the SDM not formally making this connection.
>>>> -        AVX512BW: [AVX512_BF16, AVX512_VBMI, AVX512_VBMI2],
>>>> +        AVX512BW: [AVX512_BF16, AVX512_BITALG, AVX512_VBMI, AVX512_VBMI2],
>>> BITALG should be after VBMI2, because everything in this table is
>>> ordered by bit number.
>> As said before - there's no ordering by bit number possible here.
> 
> Its perfectly easy.  Each feature has a unique number.

Well, okay, for sub-leaves of the same main leaf I can accept
this. But what sorting do you suggest between basic and extended
leaves?

>> The individual features may live on different (sub)leaves. By
>> what you say BF16 shouldn't be first. The list here clearly is
>> sorted alphabetically, and imo that's the only future proof sorting
>> possible (and also for AVX512F, where I had previously offered to
>> put together a patch to switch to alphabetical ordering, if only we
>> could agree on that).
> 
> In which case I missed it during review.
> 
> This feature matrix is deliberately sorted by feature number in an
> effort to preserve chronology, which is a much more useful way of
> reasoning about feature dependencies.

Except that bit numbers are often, but not always an indication of
chronological order.

While I clearly disagree, for there to be progress here, do you
expect me to re-arrange the dependency list above, i.e. going
beyond your initial request? I certainly object to doing _just_
the originally requested adjustment ...

Jan
diff mbox series

Patch

--- a/tools/tests/x86_emulator/evex-disp8.c
+++ b/tools/tests/x86_emulator/evex-disp8.c
@@ -538,6 +538,11 @@  static const struct test avx512pf_512[]
      INSNX(scatterpf1q, 66, 0f38, c7, 6, vl, sd, el),
  };
  
+static const struct test avx512_bitalg_all[] = {
+    INSN(popcnt,      66, 0f38, 54, vl, bw, vl),
+    INSN(pshufbitqmb, 66, 0f38, 8f, vl,  b, vl),
+};
+
  static const struct test avx512_vbmi_all[] = {
      INSN(permb,         66, 0f38, 8d, vl, b, vl),
      INSN(permi2b,       66, 0f38, 75, vl, b, vl),
@@ -550,6 +555,10 @@  static const struct test avx512_vbmi2_al
      INSN(pexpand,   66, 0f38, 62, vl, bw, el),
  };
  
+static const struct test avx512_vpopcntdq_all[] = {
+    INSN(popcnt, 66, 0f38, 55, vl, dq, vl)
+};
+
  static const unsigned char vl_all[] = { VL_512, VL_128, VL_256 };
  static const unsigned char vl_128[] = { VL_128 };
  static const unsigned char vl_no128[] = { VL_512, VL_256 };
@@ -919,6 +928,8 @@  void evex_disp8_test(void *instr, struct
      RUN(avx512er, 512);
  #define cpu_has_avx512pf cpu_has_avx512f
      RUN(avx512pf, 512);
+    RUN(avx512_bitalg, all);
      RUN(avx512_vbmi, all);
      RUN(avx512_vbmi2, all);
+    RUN(avx512_vpopcntdq, all);
  }
--- a/tools/tests/x86_emulator/x86-emulate.h
+++ b/tools/tests/x86_emulator/x86-emulate.h
@@ -143,6 +143,8 @@  static inline bool xcr0_mask(uint64_t ma
  #define cpu_has_avx512vl  (cp.feat.avx512vl && xcr0_mask(0xe6))
  #define cpu_has_avx512_vbmi (cp.feat.avx512_vbmi && xcr0_mask(0xe6))
  #define cpu_has_avx512_vbmi2 (cp.feat.avx512_vbmi2 && xcr0_mask(0xe6))
+#define cpu_has_avx512_bitalg (cp.feat.avx512_bitalg && xcr0_mask(0xe6))
+#define cpu_has_avx512_vpopcntdq (cp.feat.avx512_vpopcntdq && xcr0_mask(0xe6))
  
  #define cpu_has_xgetbv1   (cpu_has_xsave && cp.xstate.xgetbv1)
  
--- a/xen/arch/x86/x86_emulate/x86_emulate.c
+++ b/xen/arch/x86/x86_emulate/x86_emulate.c
@@ -479,6 +479,7 @@  static const struct ext0f38_table {
      [0x4d] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
      [0x4e] = { .simd_size = simd_packed_fp, .two_op = 1, .d8s = d8s_vl },
      [0x4f] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
+    [0x54 ... 0x55] = { .simd_size = simd_packed_int, .two_op = 1, .d8s = d8s_vl },
      [0x58] = { .simd_size = simd_other, .two_op = 1, .d8s = 2 },
      [0x59] = { .simd_size = simd_other, .two_op = 1, .d8s = 3 },
      [0x5a] = { .simd_size = simd_128, .two_op = 1, .d8s = 4 },
@@ -501,6 +502,7 @@  static const struct ext0f38_table {
      [0x8c] = { .simd_size = simd_packed_int },
      [0x8d] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
      [0x8e] = { .simd_size = simd_packed_int, .to_mem = 1 },
+    [0x8f] = { .simd_size = simd_packed_int, .d8s = d8s_vl },
      [0x90 ... 0x93] = { .simd_size = simd_other, .vsib = 1, .d8s = d8s_dq },
      [0x96 ... 0x98] = { .simd_size = simd_packed_fp, .d8s = d8s_vl },
      [0x99] = { .simd_size = simd_scalar_vexw, .d8s = d8s_dq },
@@ -1883,6 +1885,8 @@  in_protmode(
  #define vcpu_has_avx512vl()    (ctxt->cpuid->feat.avx512vl)
  #define vcpu_has_avx512_vbmi() (ctxt->cpuid->feat.avx512_vbmi)
  #define vcpu_has_avx512_vbmi2() (ctxt->cpuid->feat.avx512_vbmi2)
+#define vcpu_has_avx512_bitalg() (ctxt->cpuid->feat.avx512_bitalg)
+#define vcpu_has_avx512_vpopcntdq() (ctxt->cpuid->feat.avx512_vpopcntdq)
  #define vcpu_has_rdpid()       (ctxt->cpuid->feat.rdpid)
  
  #define vcpu_must_have(feat) \
@@ -8899,6 +8903,19 @@  x86_emulate(
          generate_exception_if(vex.l, EXC_UD);
          goto simd_0f_avx;
  
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x8f): /* vpshufbitqmb [xyz]mm/mem,[xyz]mm,k{k} */
+        generate_exception_if(evex.w || !evex.r || !evex.R || evex.z, EXC_UD);
+        /* fall through */
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x54): /* vpopcnt{b,w} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_bitalg);
+        generate_exception_if(evex.brs, EXC_UD);
+        elem_bytes = 1 << evex.w;
+        goto avx512f_no_sae;
+
+    case X86EMUL_OPC_EVEX_66(0x0f38, 0x55): /* vpopcnt{d,q} [xyz]mm/mem,[xyz]mm{k} */
+        host_and_vcpu_must_have(avx512_vpopcntdq);
+        goto avx512f_no_sae;
+
      case X86EMUL_OPC_VEX_66(0x0f38, 0x58): /* vpbroadcastd xmm/m32,{x,y}mm */
      case X86EMUL_OPC_VEX_66(0x0f38, 0x59): /* vpbroadcastq xmm/m64,{x,y}mm */
      case X86EMUL_OPC_VEX_66(0x0f38, 0x78): /* vpbroadcastb xmm/m8,{x,y}mm */
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -110,6 +110,8 @@ 
  /* CPUID level 0x00000007:0.ecx */
  #define cpu_has_avx512_vbmi     boot_cpu_has(X86_FEATURE_AVX512_VBMI)
  #define cpu_has_avx512_vbmi2    boot_cpu_has(X86_FEATURE_AVX512_VBMI2)
+#define cpu_has_avx512_bitalg   boot_cpu_has(X86_FEATURE_AVX512_BITALG)
+#define cpu_has_avx512_vpopcntdq boot_cpu_has(X86_FEATURE_AVX512_VPOPCNTDQ)
  #define cpu_has_rdpid           boot_cpu_has(X86_FEATURE_RDPID)
  
  /* CPUID level 0x80000007.edx */
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -229,6 +229,7 @@  XEN_CPUFEATURE(UMIP,          6*32+ 2) /
  XEN_CPUFEATURE(PKU,           6*32+ 3) /*H  Protection Keys for Userspace */
  XEN_CPUFEATURE(OSPKE,         6*32+ 4) /*!  OS Protection Keys Enable */
  XEN_CPUFEATURE(AVX512_VBMI2,  6*32+ 6) /*A  Additional AVX-512 Vector Byte Manipulation Instrs */
+XEN_CPUFEATURE(AVX512_BITALG, 6*32+12) /*A  Support for VPOPCNT[B,W] and VPSHUFBITQMB */
  XEN_CPUFEATURE(AVX512_VPOPCNTDQ, 6*32+14) /*A  POPCNT for vectors of DW/QW */
  XEN_CPUFEATURE(RDPID,         6*32+22) /*A  RDPID instruction */
  
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -268,7 +268,7 @@  def crunch_numbers(state):
          # AVX512 extensions acting on vectors of bytes/words are made
          # dependents of AVX512BW (as to requiring wider than 16-bit mask
          # registers), despite the SDM not formally making this connection.
-        AVX512BW: [AVX512_BF16, AVX512_VBMI, AVX512_VBMI2],
+        AVX512BW: [AVX512_BF16, AVX512_BITALG, AVX512_VBMI, AVX512_VBMI2],
  
          # The features:
          #   * Single Thread Indirect Branch Predictors