diff mbox series

Add support for a helper with 7 arguments

Message ID 1580942510-2820-1-git-send-email-tsimpson@quicinc.com (mailing list archive)
State New, archived
Headers show
Series Add support for a helper with 7 arguments | expand

Commit Message

Taylor Simpson Feb. 5, 2020, 10:41 p.m. UTC
Currently, helpers can only take up to 6 arguments.  This patch adds the
capability for up to 7 arguments.  I have tested it with the Hexagon port
that I am preparing for submission.

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 include/exec/helper-gen.h   | 13 +++++++++++++
 include/exec/helper-head.h  |  2 ++
 include/exec/helper-proto.h |  6 ++++++
 include/exec/helper-tcg.h   |  7 +++++++
 4 files changed, 28 insertions(+)

Comments

Richard Henderson Feb. 6, 2020, 6:02 a.m. UTC | #1
On 2/5/20 10:41 PM, Taylor Simpson wrote:
> Currently, helpers can only take up to 6 arguments.  This patch adds the
> capability for up to 7 arguments.  I have tested it with the Hexagon port
> that I am preparing for submission.

This is not safe, in general, without other changes.

>From include/tcg/tcg.h:

> /* While we limit helpers to 6 arguments, for 32-bit hosts, with padding,
>    this imples a max of 6*2 (64-bit in) + 2 (64-bit out) = 14 operands.
>    There are never more than 2 outputs, which means that we can store all
>    dead + sync data within 16 bits.  */
> #define DEAD_ARG  4
> #define SYNC_ARG  1
> typedef uint16_t TCGLifeData;

Thus 7 uint64_t inputs, on a 32-bit host, will overflow TCGLifeData.

What are you doing that requires so many arguments?


r~
Richard Henderson Feb. 6, 2020, 10:28 a.m. UTC | #2
On 2/6/20 6:02 AM, Richard Henderson wrote:
> On 2/5/20 10:41 PM, Taylor Simpson wrote:
>> Currently, helpers can only take up to 6 arguments.  This patch adds the
>> capability for up to 7 arguments.  I have tested it with the Hexagon port
>> that I am preparing for submission.
> 
> This is not safe, in general, without other changes.
> 
>>From include/tcg/tcg.h:
> 
>> /* While we limit helpers to 6 arguments, for 32-bit hosts, with padding,
>>    this imples a max of 6*2 (64-bit in) + 2 (64-bit out) = 14 operands.
>>    There are never more than 2 outputs, which means that we can store all
>>    dead + sync data within 16 bits.  */
>> #define DEAD_ARG  4
>> #define SYNC_ARG  1
>> typedef uint16_t TCGLifeData;
> 
> Thus 7 uint64_t inputs, on a 32-bit host, will overflow TCGLifeData.

My bad, no it won't.  For some reason I had two outputs in my head, but they'll
both be uint32_t not two uint64_t.  7 uint64_t inputs with 1 uint64_t output
will *just* fit.


> What are you doing that requires so many arguments?

But I'd still like to know why you need so many.


r~
Taylor Simpson Feb. 6, 2020, 2:03 p.m. UTC | #3
Some of the more complex instructions need a lot of operands.  Here's an example
    if (Pv4) memb(Rs32 + Ru32 << #u2) = Rt32
This is a predicated store with 5 operands:
    Pv4predicate
    Rs32, Ru32, u2used to compute the effective address
    Rt32value to store
In addition, every helper gets an env argument, and predicated instructions get a "slot" argument.  The slot argument refers to the VLIW slot where the instruction is located within the packet.  It is used for predicated instructions to communicate to the end-of-packet handling to determine whether the instruction should commit.

So, the signature for the helper for this instruction is
    void HELPER(S4_pstorerbt_rr)(CPUHexagonState *env, int32_t PvV, int32_t RsV, int32_t RuV, int32_t RtV, int32_t uiV, uint32_t slot)

HTH,
Taylor

> -----Original Message-----
> From: Richard Henderson <rth7680@gmail.com> On Behalf Of Richard
> Henderson
> Sent: Thursday, February 6, 2020 4:29 AM
> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Subject: Re: [PATCH] Add support for a helper with 7 arguments
>
> -------------------------------------------------------------------------
> CAUTION: This email originated from outside of the organization.
> -------------------------------------------------------------------------
>
> On 2/6/20 6:02 AM, Richard Henderson wrote:
> > On 2/5/20 10:41 PM, Taylor Simpson wrote:
> >> Currently, helpers can only take up to 6 arguments.  This patch adds the
> >> capability for up to 7 arguments.  I have tested it with the Hexagon port
> >> that I am preparing for submission.
> >
> > This is not safe, in general, without other changes.
> >
> >>From include/tcg/tcg.h:
> >
> >> /* While we limit helpers to 6 arguments, for 32-bit hosts, with padding,
> >>    this imples a max of 6*2 (64-bit in) + 2 (64-bit out) = 14 operands.
> >>    There are never more than 2 outputs, which means that we can store all
> >>    dead + sync data within 16 bits.  */
> >> #define DEAD_ARG  4
> >> #define SYNC_ARG  1
> >> typedef uint16_t TCGLifeData;
> >
> > Thus 7 uint64_t inputs, on a 32-bit host, will overflow TCGLifeData.
>
> My bad, no it won't.  For some reason I had two outputs in my head, but
> they'll
> both be uint32_t not two uint64_t.  7 uint64_t inputs with 1 uint64_t output
> will *just* fit.
>
>
> > What are you doing that requires so many arguments?
>
> But I'd still like to know why you need so many.
>
>
> r~
Richard Henderson Feb. 6, 2020, 3:35 p.m. UTC | #4
On 2/6/20 2:03 PM, Taylor Simpson wrote:
> Some of the more complex instructions need a lot of operands.  Here's an example
>     if (Pv4) memb(Rs32 + Ru32 << #u2) = Rt32
> This is a predicated store with 5 operands:
>     Pv4predicate
>     Rs32, Ru32, u2used to compute the effective address
>     Rt32value to store
> In addition, every helper gets an env argument, and predicated instructions get a "slot" argument.  The slot argument refers to the VLIW slot where the instruction is located within the packet.  It is used for predicated instructions to communicate to the end-of-packet handling to determine whether the instruction should commit.
> 
> So, the signature for the helper for this instruction is
>     void HELPER(S4_pstorerbt_rr)(CPUHexagonState *env, int32_t PvV, int32_t RsV, int32_t RuV, int32_t RtV, int32_t uiV, uint32_t slot)


I think this is quite ugly.  I know you've been talking about auto-generating
everything but we ought to do better than this.

You should be passing values not regnos if you can possibly do so.  You should
be passing full virtual addresses not N separate components of an address.
Predicates should be evaluated earlier so that the helper isn't even called if
it's false.

Combine that with 3.3.1 Packet execution semantics, "dual stores, new-value
stores, and slot1 store with slot0 loads have non-parallel execution
semantics", and you need no special helper at all:

	and	t0, pv, 1
	brcondi	t0, 0, over

	shli	t0, ru, u2
	add	t0, t0, rs
	qemu_st	rt, t0, mmu_idx, MO_UB
 over:

But suppose this were something more complicated than a bare store, and the
point still stands about pre-computing full addresses.


r~
Taylor Simpson Feb. 6, 2020, 5:52 p.m. UTC | #5
> -----Original Message-----
> From: Richard Henderson <rth7680@gmail.com> On Behalf Of Richard
> Henderson
> Sent: Thursday, February 6, 2020 9:35 AM
> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Subject: Re: [PATCH] Add support for a helper with 7 arguments
>
>
> I think this is quite ugly.  I know you've been talking about auto-generating
> everything but we ought to do better than this.
>
> You should be passing values not regnos if you can possibly do so.  You
> should
> be passing full virtual addresses not N separate components of an address.
> Predicates should be evaluated earlier so that the helper isn't even called if
> it's false.

We are passing values, not reg numbers.  The generator doesn't know anything about the semantics of the instruction.  It only knows which operands are read, written, or both.  So, there's no way to combine the 3 operands into a single effective address until we are inside the helper.  Also, there's no way to know if the instruction is predicated or if it just has a predicate as an operand.  Also, there are instructions where the predicate is used in the false sense.

>
> Combine that with 3.3.1 Packet execution semantics, "dual stores, new-value
> stores, and slot1 store with slot0 loads have non-parallel execution
> semantics", and you need no special helper at all:
>
> andt0, pv, 1
> brcondit0, 0, over
>
> shlit0, ru, u2
> addt0, t0, rs
> qemu_strt, t0, mmu_idx, MO_UB
>  over:
>

We can't actually do the store here.  We have to record it in a side data structure in the env and perform the store when the packet commits.

There is a mechanism to override the call to the helper (fWRAP_<tag>).  If we did override this instruction, the TCG would be similar to what you have.  However, all instructions still get a helper because the generator doesn't know which instructions are overridden.

Currently, there are a total of 35 instructions that need a helper with 7 arguments.  32 of them are predicated stores, but there are also 3 vscatter instructions.  For example,
    if (Qs4) vscatter(Rt32, Mu2, Vv32.w).w = Vw32
In this case, the predicate is a vector used as a mask for each element of the store.  Even if we weren't using a generator, I would do a helper for these because of their complexity.

> But suppose this were something more complicated than a bare store, and
> the
> point still stands about pre-computing full addresses.
>
>
> r~
Richard Henderson Feb. 7, 2020, 12:27 a.m. UTC | #6
On 2/6/20 5:52 PM, Taylor Simpson wrote:
> 
> 
>> -----Original Message-----
>> From: Richard Henderson <rth7680@gmail.com> On Behalf Of Richard
>> Henderson
>> Sent: Thursday, February 6, 2020 9:35 AM
>> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Subject: Re: [PATCH] Add support for a helper with 7 arguments
>>
>>
>> I think this is quite ugly.  I know you've been talking about auto-generating
>> everything but we ought to do better than this.
>>
>> You should be passing values not regnos if you can possibly do so.  You
>> should
>> be passing full virtual addresses not N separate components of an address.
>> Predicates should be evaluated earlier so that the helper isn't even called if
>> it's false.
> 
> We are passing values, not reg numbers.  The generator doesn't know anything about the semantics of the instruction...

And this, I claim, is wrong.

If you can generate C for the operation out of line, then you can interpret it
inline as well.  Or, make a reasonable decision about what bit to be out of
line and what bit to be a helper.

> It only knows which operands are read, written, or both.  So, there's no way
> to combine the 3 operands into a single effective address until we are
> inside the helper.  Also, there's no way to know if the instruction is
> predicated or if it just has a predicate as an operand.  Also, there are
> instructions where the predicate is used in the false sense.
There certainly is a way to know what the effective address is.  From the
pseudo-code that I browsed today, "EA = foo" is a big clue.

There certainly is a way to know if the instruction is predicated.  From the
pseudo-code that I browsed today,

  if (pred) {
    operation;
  } else {
    NOP;
  }

is the clue, vs

  Rd = mux(Pu, Rs, Rt).

Of course there are insns where !Pd is used; that doesn't change anything.

>> Combine that with 3.3.1 Packet execution semantics, "dual stores, new-value
>> stores, and slot1 store with slot0 loads have non-parallel execution
>> semantics", and you need no special helper at all:
>>
>> andt0, pv, 1
>> brcondit0, 0, over
>>
>> shlit0, ru, u2
>> addt0, t0, rs
>> qemu_strt, t0, mmu_idx, MO_UB
>>  over:
>>
> 
> We can't actually do the store here.  We have to record it in a side data structure in the env and perform the store when the packet commits.

I think that we can do the store immediately -- I give specifics above.  Do you
have a counter-example?  Admittedly I'm new to browsing the architecture, but I
don't see a legal packet for which you can't just Store Now.

> Currently, there are a total of 35 instructions that need a helper with 7 arguments.  32 of them are predicated stores, but there are also 3 vscatter instructions.  For example,
>     if (Qs4) vscatter(Rt32, Mu2, Vv32.w).w = Vw32

Ok, vscatter is indeed complex.  But I still count 6 operands.  Is it the
"slot" one that you're concerned is the 7th?  I don't see that as strictly
necessary, since this can only be in slot 0.


r~
Taylor Simpson Feb. 7, 2020, 4:46 a.m. UTC | #7
> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Thursday, February 6, 2020 6:28 PM
> To: Taylor Simpson <tsimpson@quicinc.com>; Richard Henderson
> <rth@twiddle.net>; qemu-devel@nongnu.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Subject: Re: [PATCH] Add support for a helper with 7 arguments
>
> On 2/6/20 5:52 PM, Taylor Simpson wrote:
> >
> >
> >> -----Original Message-----
> >> From: Richard Henderson <rth7680@gmail.com> On Behalf Of Richard
> >> Henderson
> >> Sent: Thursday, February 6, 2020 9:35 AM
> >> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
> >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >> Subject: Re: [PATCH] Add support for a helper with 7 arguments
> >>
> >>
> >> I think this is quite ugly.  I know you've been talking about auto-
> generating
> >> everything but we ought to do better than this.
> >>
> >> You should be passing values not regnos if you can possibly do so.  You
> >> should
> >> be passing full virtual addresses not N separate components of an
> address.
> >> Predicates should be evaluated earlier so that the helper isn't even called
> if
> >> it's false.
> >
> > We are passing values, not reg numbers.  The generator doesn't know
> anything about the semantics of the instruction...
>
> And this, I claim, is wrong.
>
> If you can generate C for the operation out of line, then you can interpret it
> inline as well.  Or, make a reasonable decision about what bit to be out of
> line and what bit to be a helper.
>

The C code for the operation is not generated - it comes as the semantics of the instruction.  The generator only knows how to set up the call to the helper based on the signature of the instruction, and it pastes the C code into the body of the helper.

> > It only knows which operands are read, written, or both.  So, there's no
> way
> > to combine the 3 operands into a single effective address until we are
> > inside the helper.  Also, there's no way to know if the instruction is
> > predicated or if it just has a predicate as an operand.  Also, there are
> > instructions where the predicate is used in the false sense.
> There certainly is a way to know what the effective address is.  From the
> pseudo-code that I browsed today, "EA = foo" is a big clue.
>
> There certainly is a way to know if the instruction is predicated.  From the
> pseudo-code that I browsed today,
>
>   if (pred) {
>     operation;
>   } else {
>     NOP;
>   }
>

I don't see the value of having the generator generate TCG for part of the semantics and using a helper for another part.  We already have a mechanism for overriding the entire instruction.  For instructions where there's a reason not to use the helper, we replace it completely.

> is the clue, vs
>
>   Rd = mux(Pu, Rs, Rt).
>
> Of course there are insns where !Pd is used; that doesn't change anything.
>
> >> Combine that with 3.3.1 Packet execution semantics, "dual stores, new-
> value
> >> stores, and slot1 store with slot0 loads have non-parallel execution
> >> semantics", and you need no special helper at all:
> >>
> >> andt0, pv, 1
> >> brcondit0, 0, over
> >>
> >> shlit0, ru, u2
> >> addt0, t0, rs
> >> qemu_strt, t0, mmu_idx, MO_UB
> >>  over:
> >>
> >
> > We can't actually do the store here.  We have to record it in a side data
> structure in the env and perform the store when the packet commits.
>
> I think that we can do the store immediately -- I give specifics above.  Do you
> have a counter-example?  Admittedly I'm new to browsing the architecture,
> but I
> don't see a legal packet for which you can't just Store Now.

You can have two stores in a packet, and the second one could fault.  If anything in the packet faults, none of the instructions commit.

>
> > Currently, there are a total of 35 instructions that need a helper with 7
> arguments.  32 of them are predicated stores, but there are also 3 vscatter
> instructions.  For example,
> >     if (Qs4) vscatter(Rt32, Mu2, Vv32.w).w = Vw32
>
> Ok, vscatter is indeed complex.  But I still count 6 operands.  Is it the
> "slot" one that you're concerned is the 7th?  I don't see that as strictly
> necessary, since this can only be in slot 0.

Yes, the 7th operand is the slot.  Today, these instructions can only be in slot 0.  We could hard code that, but what happens down the road when the architects change it?  That's a bad tradeoff to make just to avoid allowing 7 arguments to a helper.

>
>
> r~
Richard Henderson Feb. 7, 2020, 8:53 a.m. UTC | #8
On 2/7/20 4:46 AM, Taylor Simpson wrote:
>> I think that we can do the store immediately -- I give specifics above.  Do you
>> have a counter-example?  Admittedly I'm new to browsing the architecture,
>> but I
>> don't see a legal packet for which you can't just Store Now.
> 
> You can have two stores in a packet, and the second one could fault.  If anything in the packet faults, none of the instructions commit.

Then what does the manual mean when it says "dual stores have non-parallel
semantics"?  Is that solely about the semantics of the bytes in memory?


r~
Taylor Simpson Feb. 7, 2020, 11:59 a.m. UTC | #9
> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Friday, February 7, 2020 2:53 AM
> To: Taylor Simpson <tsimpson@quicinc.com>; Richard Henderson
> <rth@twiddle.net>; qemu-devel@nongnu.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Subject: Re: [PATCH] Add support for a helper with 7 arguments
>
> On 2/7/20 4:46 AM, Taylor Simpson wrote:
> >> I think that we can do the store immediately -- I give specifics above.  Do
> you
> >> have a counter-example?  Admittedly I'm new to browsing the
> architecture,
> >> but I
> >> don't see a legal packet for which you can't just Store Now.
> >
> > You can have two stores in a packet, and the second one could fault.  If
> anything in the packet faults, none of the instructions commit.
>
> Then what does the manual mean when it says "dual stores have non-
> parallel
> semantics"?  Is that solely about the semantics of the bytes in memory?

Correct.  For example, this packet
    {
        memw(r5) = r6
        memb(r5) = r7
    }
Will store the word in memory with r6 and then overwrite the first byte with the byte from r7.
Richard Henderson Feb. 7, 2020, 12:14 p.m. UTC | #10
On 2/5/20 10:41 PM, Taylor Simpson wrote:
> Currently, helpers can only take up to 6 arguments.  This patch adds the
> capability for up to 7 arguments.  I have tested it with the Hexagon port
> that I am preparing for submission.
> 
> Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
> ---
>  include/exec/helper-gen.h   | 13 +++++++++++++
>  include/exec/helper-head.h  |  2 ++
>  include/exec/helper-proto.h |  6 ++++++
>  include/exec/helper-tcg.h   |  7 +++++++
>  4 files changed, 28 insertions(+)

Applied to tcg-next, since this is correct, and it does work.

But I encourage you to re-think your purely mechanical approach to the hexagon
port.  It seems to me that you should be doing much more during the translation
phase so that you can minimize the number of helpers that you require.


r~
Taylor Simpson Feb. 7, 2020, 12:43 p.m. UTC | #11
> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
>
> But I encourage you to re-think your purely mechanical approach to the
> hexagon
> port.  It seems to me that you should be doing much more during the
> translation
> phase so that you can minimize the number of helpers that you require.
>

There are a couple of things we could do
- Short term: Add #ifdef's to the generated code so that the helper isn't compiled when there is a fWRAP_<tag> defined.  There are currently ~500 instructions where this is the case.
- Long term: Integrate rev.ng's approach that uses flex/bison to parse the semantics and generate TCG code.
- Long long term: A much more general approach will be to turn the C semantics code into LLVM IR and generate TCG from the IR.

Also, note that being able to use either a helper or TCG code is extremely useful for debugging qemu-hexagon.  When a test is failing and you suspect a bug in TCG generation, you can switch to the helper version and see if the test passes.  There have been cases where I used binary search over the fWRAP definitions to figure out which one of them has the bug.  Also, it can be very handy to be able to set a breakpoint inside the helper and examine the CPU state in the middle of TCG execution.

Taylor
Richard Henderson Feb. 7, 2020, 3:49 p.m. UTC | #12
On 2/7/20 12:43 PM, Taylor Simpson wrote:
> 
>> -----Original Message-----
>> From: Richard Henderson <richard.henderson@linaro.org>
>>
>> But I encourage you to re-think your purely mechanical approach to the 
>> hexagon port.  It seems to me that you should be doing much more during
>> the translation phase so that you can minimize the number of helpers that
>> you require.
> 
> There are a couple of things we could do
> - Short term: Add #ifdef's to the generated code so that the helper isn't
>   compiled when there is a fWRAP_<tag> defined.  There are currently ~500
>   instructions where this is the case.

Definitely.


> - Long term: Integrate rev.ng's approach that uses flex/bison to parse the
> semantics and generate TCG code.
There is perhaps an intermediate step that merely special-cases the load/store
insns.  With rare exceptions (hah!) these are the cases that will most often
raise an exception.  Moreover, they are the *only* cases that can raise an
exception without requiring a helper call anyway.

There are a number of cases that I can think of:

	{
	  r6 = memb(r1)
	  r7 = memb(r2)
	}

	qemu_ld   t0, r1, MO_UB, mmu_idx
	qemu_ld   t1, r2, MO_UB, mmu_idx
	mov       r6, t0
	mov       r7, t1

	{
	  r6 = memb(r1)
	  memb(r2) = r7
	}

	qemu_ld   t0, r1, MO_UB, mmu_idx
	qemu_st   r7, r2, MO_UB, mmu_idx
	mov       r6, t0

These being the "normal" case wherein the memops are unconditional, and can
simply use a temp for semantics.  Similarly for MEMOP, NV, or SYSTEM insns in
slot0.

	{
	  r6 = memb(r1)
	  if (p0) r7 = memb(r7)
	}

	qemu_ld   l0, r1, MO_UB, mmu_idx
	andi      t1, p0, 1
        brcondi   t1, 0, L1
        qemu_ld   r7, r2, MO_UB, mmu_idx
 L1:
	mov       r6, l0

For a conditional load in slot 0, we can load directly into the final
destination register and skip the temporary.

Because TCG doesn't do global register allocation, any temporary crossing a
basic block boundary gets flushed to stack.  So this avoids sending the r7
value through an unnecessary round trip.

This works because (obviously) nothing can raise an exception after slot0, and
the only thing that comes after is the commit phase.  This can be extended to a
conditional load in slot1, when we notice that the insn in slot0 cannot raise
an exception.

	{
	  memb(r1) = r3
	  memb(r2) = r4
	}

	call     helper_probe_access, r1, MMU_DATA_STORE, 1
	call     helper_probe_access, r2, MMU_DATA_STORE, 1
	qemu_st  r3, r1, MO_UB, mmu_idx
	qemu_st  r4, r2, MO_UB, mmu_idx

	{
	  memb(r1) = r3
	  r4 = memb(r2)
	}

	call     helper_probe_access, r1, MMU_DATA_STORE, 1
	call     helper_probe_access, r2, MMU_DATA_LOAD, 1
	qemu_st  r3, r1, MO_UB, mmu_idx
	qemu_ld  r4, r2, MO_UB, mmu_idx

These cases with a store in slot1 are irritating, because I see that (1) all
exceptions must be recognized before anything commits, and (2) slot1 exceptions
must have preference over slot0 exceptions.  But we can probe them easily enough.


> - Long long term: A much more general approach will be to turn the C
> semantics code into LLVM IR and generate TCG from the IR.
Why would you imagine this to be more interesting than using flex/bison?


r~
Taylor Simpson Feb. 9, 2020, 5:08 a.m. UTC | #13
> On 2/7/20 12:43 PM, Taylor Simpson wrote:
> >
> >> -----Original Message-----
> >> From: Richard Henderson <richard.henderson@linaro.org>
> >>
> >> But I encourage you to re-think your purely mechanical approach to the
> >> hexagon port.  It seems to me that you should be doing much more
> during
> >> the translation phase so that you can minimize the number of helpers that
> >> you require.
> >
> > There are a couple of things we could do
> > - Short term: Add #ifdef's to the generated code so that the helper isn't
> >   compiled when there is a fWRAP_<tag> defined.  There are currently ~500
> >   instructions where this is the case.
>
> Definitely.

OK, done.

>
> > - Long term: Integrate rev.ng's approach that uses flex/bison to parse the
> > semantics and generate TCG code.
> There is perhaps an intermediate step that merely special-cases the
> load/store
> insns.  With rare exceptions (hah!) these are the cases that will most often
> raise an exception.  Moreover, they are the *only* cases that can raise an
> exception without requiring a helper call anyway.
>
> There are a number of cases that I can think of:
>
> {
>   r6 = memb(r1)
>   r7 = memb(r2)
> }
>
> qemu_ld   t0, r1, MO_UB, mmu_idx
> qemu_ld   t1, r2, MO_UB, mmu_idx
> mov       r6, t0
> mov       r7, t1
>

Here is the TCG we generate currently.
 movi_i32 tmp0,$0x0
 add_i32 loc2,r1,tmp0
 qemu_ld_i32 loc3,loc2,sb,0
 mov_i32 new_value,loc3
 movi_i32 tmp0,$0x0
 add_i32 loc2,r2,tmp0
 qemu_ld_i32 loc3,loc2,sb,0
 mov_i32 new_value,loc3
 mov_i32 r6,new_value
 mov_i32 r7,new_value
I could work on eliminating the add of zero and the extra copies.  Is TCG able to optimize these before emitting the host code?


> {
>   r6 = memb(r1)
>   memb(r2) = r7
> }
>
> qemu_ld   t0, r1, MO_UB, mmu_idx
> qemu_st   r7, r2, MO_UB, mmu_idx
> mov       r6, t0
>
> These being the "normal" case wherein the memops are unconditional, and
> can
> simply use a temp for semantics.  Similarly for MEMOP, NV, or SYSTEM insns
> in
> slot0.
>
> {
>   r6 = memb(r1)
>   if (p0) r7 = memb(r7)
> }
>
> qemu_ld   l0, r1, MO_UB, mmu_idx
> andi      t1, p0, 1
>         brcondi   t1, 0, L1
>         qemu_ld   r7, r2, MO_UB, mmu_idx
>  L1:
> mov       r6, l0
>
> For a conditional load in slot 0, we can load directly into the final
> destination register and skip the temporary.

In general, there will be lots of checks we would need to perform before concluding that an instruction can write directly into the destination.  For example, we have to make sure no other instruction later in the packet reads r7.  Also, we have to check if there are any .new references to r7 - these will read from new_value[7].
Richard Henderson Feb. 9, 2020, 6:17 p.m. UTC | #14
On 2/8/20 5:08 AM, Taylor Simpson wrote:
>> {
>>   r6 = memb(r1)
>>   r7 = memb(r2)
>> }
>>
>> qemu_ld   t0, r1, MO_UB, mmu_idx
>> qemu_ld   t1, r2, MO_UB, mmu_idx
>> mov       r6, t0
>> mov       r7, t1
>>
> 
> Here is the TCG we generate currently.
>  movi_i32 tmp0,$0x0
>  add_i32 loc2,r1,tmp0
>  qemu_ld_i32 loc3,loc2,sb,0
>  mov_i32 new_value,loc3
>  movi_i32 tmp0,$0x0
>  add_i32 loc2,r2,tmp0
>  qemu_ld_i32 loc3,loc2,sb,0
>  mov_i32 new_value,loc3
>  mov_i32 r6,new_value
>  mov_i32 r7,new_value
> I could work on eliminating the add of zero and the extra copies.  Is TCG
> able to optimize these before emitting the host code?
We can optimize them.

However, you should prefer to use tcg_gen_addi_* over tcg_gen_add_* (etc) when
you know that one operand is constant.  This will optimize away the add zero
immediately as opposed to allocating memory and walking the data structures to
eliminate it later.

Why are you using a local temporary for EA?  That should be dead immediately
after this slot is complete.

What's with two temporaries both named "new_value"?

>> For a conditional load in slot 0, we can load directly into the final
>> destination register and skip the temporary.
> 
> In general, there will be lots of checks we would need to perform before
> concluding that an instruction can write directly into the destination.  For
> example, we have to make sure no other instruction later in the packet reads
> r7.

Which is of course all trivial for slot 0, being last.


r~
Taylor Simpson Feb. 9, 2020, 8:51 p.m. UTC | #15
> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Sunday, February 9, 2020 12:18 PM
> To: Taylor Simpson <tsimpson@quicinc.com>; Richard Henderson
> <rth@twiddle.net>; qemu-devel@nongnu.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Subject: Re: [PATCH] Add support for a helper with 7 arguments
> We can optimize them.
>
> However, you should prefer to use tcg_gen_addi_* over tcg_gen_add_*
> (etc) when
> you know that one operand is constant.  This will optimize away the add zero
> immediately as opposed to allocating memory and walking the data
> structures to
> eliminate it later.

OK, will work on this.

>
> Why are you using a local temporary for EA?  That should be dead
> immediately
> after this slot is complete.

The declaration of EA is added by the generator.  It's declared as a local temp to be conservative in case there is control flow.  I'll work on making the generator smarter.  I think it will work to check if the instruction is predicated and use a temp if it isn't.


>
> What's with two temporaries both named "new_value"?
>

It's actually an array that parallels the GPRs.  I'm just passing the same string to each call to tcg_global_mem_new.  I'll change it to be a unique string for each.

> >> For a conditional load in slot 0, we can load directly into the final
> >> destination register and skip the temporary.
> >
> > In general, there will be lots of checks we would need to perform before
> > concluding that an instruction can write directly into the destination.  For
> > example, we have to make sure no other instruction later in the packet
> reads
> > r7.
>
> Which is of course all trivial for slot 0, being last.

Slot 0 might be last in the encoding, but that doesn't mean it is the last one to execute.  Remember that the packet gets reordered before TCG generation so that .new definitions are before their uses.  So, if the result of the slot 0 instruction is used by a .new reference, it won't be last.
Taylor Simpson Feb. 10, 2020, 4:54 a.m. UTC | #16
> -----Original Message-----
> From: Taylor Simpson
> Sent: Sunday, February 9, 2020 2:51 PM
> To: Richard Henderson <richard.henderson@linaro.org>; Richard Henderson
> <rth@twiddle.net>; qemu-devel@nongnu.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Subject: RE: [PATCH] Add support for a helper with 7 arguments
>
> > However, you should prefer to use tcg_gen_addi_* over tcg_gen_add_*
> > (etc) when
> > you know that one operand is constant.  This will optimize away the add
> zero
> > immediately as opposed to allocating memory and walking the data
> > structures to
> > eliminate it later.
>
> OK, will work on this.
>

One question I have from implementing this is
- Is there a way to pass a constant value to gen_helper_XXX?
It would be great if this would be possible instead of calling tcg_const_tl() and passing the TCGv.

AFAICT, the DEF_HELPER macros don't have an operand type that takes a constant.  I think s32, i32, and int all expect TCGv at the call site.

Thanks,
Taylor
Richard Henderson Feb. 10, 2020, 4:33 p.m. UTC | #17
On 2/9/20 4:54 AM, Taylor Simpson wrote:
> One question I have from implementing this is
> - Is there a way to pass a constant value to gen_helper_XXX?

No.

> It would be great if this would be possible instead of calling
> tcg_const_tl() and passing the TCGv.
You have to use tcg_const_{i32,i64,tl}.


r~
diff mbox series

Patch

diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
index 236ff40..29c02f8 100644
--- a/include/exec/helper-gen.h
+++ b/include/exec/helper-gen.h
@@ -66,6 +66,18 @@  static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
   tcg_gen_callN(HELPER(name), dh_retvar(ret), 6, args);                 \
 }
 
+#define DEF_HELPER_FLAGS_7(name, flags, ret, t1, t2, t3, t4, t5, t6, t7)\
+static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
+    dh_arg_decl(t1, 1),  dh_arg_decl(t2, 2), dh_arg_decl(t3, 3),        \
+    dh_arg_decl(t4, 4), dh_arg_decl(t5, 5), dh_arg_decl(t6, 6),         \
+    dh_arg_decl(t7, 7))                                                 \
+{                                                                       \
+  TCGTemp *args[7] = { dh_arg(t1, 1), dh_arg(t2, 2), dh_arg(t3, 3),     \
+                     dh_arg(t4, 4), dh_arg(t5, 5), dh_arg(t6, 6),       \
+                     dh_arg(t7, 7) };                                   \
+  tcg_gen_callN(HELPER(name), dh_retvar(ret), 7, args);                 \
+}
+
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "trace/generated-helpers-wrappers.h"
@@ -79,6 +91,7 @@  static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
 #undef DEF_HELPER_FLAGS_4
 #undef DEF_HELPER_FLAGS_5
 #undef DEF_HELPER_FLAGS_6
+#undef DEF_HELPER_FLAGS_7
 #undef GEN_HELPER
 
 #endif /* HELPER_GEN_H */
diff --git a/include/exec/helper-head.h b/include/exec/helper-head.h
index f2519c9..3094c79 100644
--- a/include/exec/helper-head.h
+++ b/include/exec/helper-head.h
@@ -148,6 +148,8 @@ 
     DEF_HELPER_FLAGS_5(name, 0, ret, t1, t2, t3, t4, t5)
 #define DEF_HELPER_6(name, ret, t1, t2, t3, t4, t5, t6) \
     DEF_HELPER_FLAGS_6(name, 0, ret, t1, t2, t3, t4, t5, t6)
+#define DEF_HELPER_7(name, ret, t1, t2, t3, t4, t5, t6, t7) \
+    DEF_HELPER_FLAGS_7(name, 0, ret, t1, t2, t3, t4, t5, t6, t7)
 
 /* MAX_OPC_PARAM_IARGS must be set to n if last entry is DEF_HELPER_FLAGS_n. */
 
diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h
index 1c4ba9b..a0a8d9a 100644
--- a/include/exec/helper-proto.h
+++ b/include/exec/helper-proto.h
@@ -30,6 +30,11 @@  dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
 dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
                             dh_ctype(t4), dh_ctype(t5), dh_ctype(t6));
 
+#define DEF_HELPER_FLAGS_7(name, flags, ret, t1, t2, t3, t4, t5, t6, t7) \
+dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
+                            dh_ctype(t4), dh_ctype(t5), dh_ctype(t6), \
+                            dh_ctype(t7));
+
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
@@ -42,5 +47,6 @@  dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
 #undef DEF_HELPER_FLAGS_4
 #undef DEF_HELPER_FLAGS_5
 #undef DEF_HELPER_FLAGS_6
+#undef DEF_HELPER_FLAGS_7
 
 #endif /* HELPER_PROTO_H */
diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h
index 573c2ce..2787050 100644
--- a/include/exec/helper-tcg.h
+++ b/include/exec/helper-tcg.h
@@ -52,6 +52,12 @@ 
     | dh_sizemask(t2, 2) | dh_sizemask(t3, 3) | dh_sizemask(t4, 4) \
     | dh_sizemask(t5, 5) | dh_sizemask(t6, 6) },
 
+#define DEF_HELPER_FLAGS_7(NAME, FLAGS, ret, t1, t2, t3, t4, t5, t6, t7) \
+  { .func = HELPER(NAME), .name = str(NAME), .flags = FLAGS, \
+    .sizemask = dh_sizemask(ret, 0) | dh_sizemask(t1, 1) \
+    | dh_sizemask(t2, 2) | dh_sizemask(t3, 3) | dh_sizemask(t4, 4) \
+    | dh_sizemask(t5, 5) | dh_sizemask(t6, 6) | dh_sizemask(t7, 7) },
+
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
@@ -65,5 +71,6 @@ 
 #undef DEF_HELPER_FLAGS_4
 #undef DEF_HELPER_FLAGS_5
 #undef DEF_HELPER_FLAGS_6
+#undef DEF_HELPER_FLAGS_7
 
 #endif /* HELPER_TCG_H */