diff mbox series

[7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions

Message ID 1559816130-17113-8-git-send-email-stefan.brankovic@rt-rk.com (mailing list archive)
State New, archived
Headers show
Series Optimize emulation of ten Altivec instructions: lvsl, | expand

Commit Message

Stefan Brankovic June 6, 2019, 10:15 a.m. UTC
Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
This instruction counts the number of leading zeros of each halfword element
in source register and places result in the appropriate halfword element of
destination register.

In each iteration of outer for loop we perform count operation on one
doubleword elements of source register vB. In first iteration we place
higher doubleword element of vB in variable avr, then we perform count
for every halfword element using tcg_gen_clzi_i64. Since it counts
leading zeros on 64 bit lenght, we have to move ith byte element to
highest 16 bits of tmp, or it with mask(so we get all ones in lowest
48 bits), then perform tcg_gen_clzi_i64 and move it's result in
appropriate halfword element of result. We do this in inner for loop.
After operation is finished we save result in appropriate doubleword
element of destination register vD. We repeat this once again for
lower doubleword element of vB.

Optimize Altivec instruction vclzb (Vector Count Leading Zeros Byte).
This instruction counts the number of leading zeros of each byte element
in source register and places result in the appropriate byte element of
destination register.

In each iteration of outer for loop we perform count operation on one
doubleword elements of source register vB. In first iteration we place
higher doubleword element of vB in variable avr, then we perform count
for every byte element using tcg_gen_clzi_i64. Since it counts leading
zeros on 64 bit lenght, we have to move ith byte element to highest 8
bits of variable  tmp, or it with mask(so we get all ones in lowest 56
bits), then perform tcg_gen_clzi_i64 and move it's result in appropriate
byte element of result. We do this in inner for loop. After operation is
finished we save result in appropriate doubleword element of destination
register vD. We repeat this once again for lower doubleword element of
vB.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 122 +++++++++++++++++++++++++++++++++++-
 1 file changed, 120 insertions(+), 2 deletions(-)

Comments

Richard Henderson June 6, 2019, 8:38 p.m. UTC | #1
On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
> This instruction counts the number of leading zeros of each halfword element
> in source register and places result in the appropriate halfword element of
> destination register.
For halfword, you're generating 32 operations.  A loop over the halfwords,
similar to the word loop I suggested for the last patch, does not reduce this
total, since one has to adjust the clz32 result.

For byte, you're generating 64 operations.

These expansions are so big that without host vector support it's probably best
to leave them out-of-line.

I can imagine a byte clz expansion like

	t0 = input >> 4;
	t1 = input << 4;
	cmp = input == 0 ? -1 : 0;
	input = cmp ? t1 : input;
	output = cmp & 4;

	t0 = input >> 6;
	t1 = input << 2;
	cmp = input == 0 ? -1 : 0;
	input = cmp ? t1 : input;
	t0 = cmp & 2;
	output += t0;

	t1 = input << 1;
	cmp = input >= 0 ? -1 : 0;
	output -= cmp;

	cmp = input == 0 ? -1 : 0;
	output -= cmp;

which would expand to 20 x86_64 vector instructions.  A halfword expansion
would require one more round and thus 25 instructions.

I'll also note that ARM, Power8, and S390 all support this as a native vector
operation; only x86_64 would require the above expansion.  It probably makes
sense to add this operation to tcg.


r~
Stefan Brankovic June 17, 2019, 11:42 a.m. UTC | #2
On 6.6.19. 22:38, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
>> This instruction counts the number of leading zeros of each halfword element
>> in source register and places result in the appropriate halfword element of
>> destination register.
> For halfword, you're generating 32 operations.  A loop over the halfwords,
> similar to the word loop I suggested for the last patch, does not reduce this
> total, since one has to adjust the clz32 result.
>
> For byte, you're generating 64 operations.
>
> These expansions are so big that without host vector support it's probably best
> to leave them out-of-line.
>
> I can imagine a byte clz expansion like
>
> 	t0 = input >> 4;
> 	t1 = input << 4;
> 	cmp = input == 0 ? -1 : 0;
> 	input = cmp ? t1 : input;
> 	output = cmp & 4;
>
> 	t0 = input >> 6;
> 	t1 = input << 2;
> 	cmp = input == 0 ? -1 : 0;
> 	input = cmp ? t1 : input;
> 	t0 = cmp & 2;
> 	output += t0;
>
> 	t1 = input << 1;
> 	cmp = input >= 0 ? -1 : 0;
> 	output -= cmp;
>
> 	cmp = input == 0 ? -1 : 0;
> 	output -= cmp;
>
> which would expand to 20 x86_64 vector instructions.  A halfword expansion
> would require one more round and thus 25 instructions.

I based this patch on performance results and my measurements say that 
tcg implementation is still significantly superior to helper 
implementation, regardless of somewhat large number of instructions.

I can attach both performance measurements results and disassembly of 
both helper and tcg implementations, if you want me to do this.

>
> I'll also note that ARM, Power8, and S390 all support this as a native vector
> operation; only x86_64 would require the above expansion.  It probably makes
> sense to add this operation to tcg.

I agree with this, but currently we don't have this implemented in tcg, 
so I worked with what I have.

Kind Regards,

Stefan

> r~
diff mbox series

Patch

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 7689739..8535a31 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -878,6 +878,124 @@  static void trans_vgbbd(DisasContext *ctx)
 }
 
 /*
+ * vclzb VRT,VRB - Vector Count Leading Zeros Byte
+ *
+ * Counting the number of leading zero bits of each byte element in source
+ * register and placing result in appropriate byte element of destination
+ * register.
+ */
+static void trans_vclzb(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every byte element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 8 bits of tmp, or it with mask(so we get
+         * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate byte element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 56);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 7; j++) {
+            tcg_gen_shli_i64(tmp, avr, (7 - j) * 8);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 8, 8);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 56, 8);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            set_avr64(VT, result, true);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
+ * vclzh VRT,VRB - Vector Count Leading Zeros Halfword
+ *
+ * Counting the number of leading zero bits of each halfword element in source
+ * register and placing result in appropriate halfword element of destination
+ * register.
+ */
+static void trans_vclzh(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword element of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword element of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every halfword element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 16 bits of tmp, or it with mask(so we get
+         * all ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate halfword element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 48);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 3; j++) {
+            tcg_gen_shli_i64(tmp, avr, (3 - j) * 16);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 16, 16);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 48, 16);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            set_avr64(VT, result, true);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
  * vclzw VRT,VRB - Vector Count Leading Zeros Word
  *
  * Counting the number of leading zero bits of each word element in source
@@ -1466,8 +1584,8 @@  GEN_VAFORM_PAIRED(vmsumshm, vmsumshs, 20)
 GEN_VAFORM_PAIRED(vsel, vperm, 21)
 GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 
-GEN_VXFORM_NOA(vclzb, 1, 28)
-GEN_VXFORM_NOA(vclzh, 1, 29)
+GEN_VXFORM_TRANS(vclzb, 1, 28)
+GEN_VXFORM_TRANS(vclzh, 1, 29)
 GEN_VXFORM_TRANS(vclzw, 1, 30)
 GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)