[RFC,v2,01/13] Introduce TCGOpcode for memory barrier

Message ID	20160531183928.29406-2-bobby.prani@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Pranith Kumar <bobby.prani@gmail.com> To: Richard Henderson <rth@twiddle.net>, qemu-devel@nongnu.org (open list:All patches CC here) Date: Tue, 31 May 2016 14:39:16 -0400 Message-Id: <20160531183928.29406-2-bobby.prani@gmail.com> In-Reply-To: <20160531183928.29406-1-bobby.prani@gmail.com> References: <20160531183928.29406-1-bobby.prani@gmail.com> Subject: [Qemu-devel] [RFC v2 PATCH 01/13] Introduce TCGOpcode for memory barrier Precedence: list Cc: serge.fdrv@linaro.org, alex.bennee@linaro.org, qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Pranith Kumar May 31, 2016, 6:39 p.m. UTC

This commit introduces the TCGOpcode for memory barrier instruction.

This opcode takes an argument which is the type of memory barrier
which should be generated.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 tcg/README    | 17 +++++++++++++++++
 tcg/tcg-op.c  |  6 ++++++
 tcg/tcg-op.h  |  2 ++
 tcg/tcg-opc.h |  2 ++
 tcg/tcg.h     |  8 ++++++++
 5 files changed, 35 insertions(+)

Richard Henderson May 31, 2016, 8:24 p.m. UTC | #1

On 05/31/2016 11:39 AM, Pranith Kumar wrote:
> +********* Memory Barrier support
> +
> +* mb <$arg>

Document what $arg should be.

> +Generate a target memory barrier instruction to ensure memory ordering as being
> +enforced by a corresponding guest memory barrier instruction. The ordering
> +enforced by the backend may be stricter than the ordering required by the guest.
> +It cannot be weaker. This opcode takes an optional constant argument if required
> +to generate the appropriate barrier instruction. The backend should take care to

The argument is *not* optional.

> +void tcg_gen_mb(TCGArg a)
> +{
> +    /* ??? Enable only when MTTCG is enabled.  */
> +    tcg_gen_op1(&tcg_ctx, INDEX_op_mb, 0);

Pass A to tcg_gen_op1, not 0.

> +/* TCGOpmb args */
> +#define TCG_MB_FULL             ((TCGArg)(0))
> +#define TCG_MB_READ             ((TCGArg)(1))
> +#define TCG_MB_WRITE            ((TCGArg)(2))
> +#define TCG_MB_ACQUIRE          ((TCGArg)(3))
> +#define TCG_MB_RELEASE          ((TCGArg)(4))

This is, IMO, confused.  Either we should use the C++11 barrier types, or the 
Linux barrier types, but not both.


r~

Pranith Kumar June 1, 2016, 6:43 p.m. UTC | #2

Hi Richard,

Thanks for the review. I will make the changes you pointed out. One point below:

On Tue, May 31, 2016 at 4:24 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 05/31/2016 11:39 AM, Pranith Kumar wrote:

>> +/* TCGOpmb args */
>> +#define TCG_MB_FULL             ((TCGArg)(0))
>> +#define TCG_MB_READ             ((TCGArg)(1))
>> +#define TCG_MB_WRITE            ((TCGArg)(2))
>> +#define TCG_MB_ACQUIRE          ((TCGArg)(3))
>> +#define TCG_MB_RELEASE          ((TCGArg)(4))
>
>
> This is, IMO, confused.  Either we should use the C++11 barrier types, or
> the Linux barrier types, but not both.

This part of the design is still being fleshed out. The above listing
is all the kinds of barriers we can encounter during translation. May
be I am missing something, but C++11 and Linux barrier types have
higher level semantics which is difficult to infer at the instruction
level.

All we want to do here is map a barrier instruction from guest to a
barrier instruction on hist. This mapping is 1:1 if the host has
barrier instructions with matching semantics. Otherwise we generate a
multi-op instruction sequence. Some examples are: load acquire(ldar)
on ARM64 guest will map to dmb+load on ARMv7 target, store
fence(sfence) on x86 guest will map to dmb on ARMv7 target.

So we need to support all kinds of barrier types we expect the guest
to have, then map them to the host.

Thanks,

Richard Henderson June 1, 2016, 9:35 p.m. UTC | #3

On 06/01/2016 11:43 AM, Pranith Kumar wrote:
>> This is, IMO, confused.  Either we should use the C++11 barrier types, or
>> the Linux barrier types, but not both.
>
> This part of the design is still being fleshed out. The above listing
> is all the kinds of barriers we can encounter during translation. May
> be I am missing something, but C++11 and Linux barrier types have
> higher level semantics which is difficult to infer at the instruction
> level.

Fair enough.

> All we want to do here is map a barrier instruction from guest to a
> barrier instruction on hist. This mapping is 1:1 if the host has
> barrier instructions with matching semantics. Otherwise we generate a
> multi-op instruction sequence. Some examples are: load acquire(ldar)
> on ARM64 guest will map to dmb+load on ARMv7 target, store
> fence(sfence) on x86 guest will map to dmb on ARMv7 target.

Perhaps we should model this like the Sparc membar instruction, with individual 
bits for all combinations of Load x Store.  One can then describe exactly what 
the target semantics are for each barrier.


r~

Sergey Fedorov June 2, 2016, 4:18 p.m. UTC | #4

On 02/06/16 00:35, Richard Henderson wrote:
> On 06/01/2016 11:43 AM, Pranith Kumar wrote:
>> All we want to do here is map a barrier instruction from guest to a
>> barrier instruction on hist. This mapping is 1:1 if the host has
>> barrier instructions with matching semantics. Otherwise we generate a
>> multi-op instruction sequence. Some examples are: load acquire(ldar)
>> on ARM64 guest will map to dmb+load on ARMv7 target, store
>> fence(sfence) on x86 guest will map to dmb on ARMv7 target.
>
> Perhaps we should model this like the Sparc membar instruction, with
> individual bits for all combinations of Load x Store.  One can then
> describe exactly what the target semantics are for each barrier.

Seconded. Given that we don't support Alpha host we can ignore its
unique "Dependent loads reordered" feature. I suppose we're not going to
add Alpha host support :)

Kind regards,
Sergey

Sergey Fedorov June 2, 2016, 4:30 p.m. UTC | #5

On 31/05/16 21:39, Pranith Kumar wrote:
> diff --git a/tcg/README b/tcg/README
> index f4a8ac1..cfe79d7 100644
> --- a/tcg/README
> +++ b/tcg/README
> @@ -402,6 +402,23 @@ double-word product T0.  The later is returned in two single-word outputs.
>  
>  Similar to mulu2, except the two inputs T1 and T2 are signed.
>  
> +********* Memory Barrier support
> +
> +* mb <$arg>
> +
> +Generate a target memory barrier instruction to ensure memory ordering as being
> +enforced by a corresponding guest memory barrier instruction. The ordering
> +enforced by the backend may be stricter than the ordering required by the guest.
> +It cannot be weaker. This opcode takes an optional constant argument if required
> +to generate the appropriate barrier instruction. The backend should take care to
> +emit the target barrier instruction only when necessary i.e., for SMP guests and
> +when MTTCG is enabled.
> +
> +The guest translators should generate this opcode for all guest instructions
> +which have ordering side effects.

I think we need to extend TCG load/store instruction attributes to
provide information about guest ordering requirements and leave this TCG
operation only for explicit barrier instruction translation.

> +
> +Please see docs/atomics.txt for more information on memory barriers.
> +
>  ********* 64-bit guest on 32-bit host support
>  
>  The following opcodes are internal to TCG.  Thus they are to be implemented by
> diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
> index f554b86..a6f01a7 100644
> --- a/tcg/tcg-op.c
> +++ b/tcg/tcg-op.c
> @@ -143,6 +143,12 @@ void tcg_gen_op6(TCGContext *ctx, TCGOpcode opc, TCGArg a1, TCGArg a2,
>      tcg_emit_op(ctx, opc, pi);
>  }
>  
> +void tcg_gen_mb(TCGArg a)
> +{
> +    /* ??? Enable only when MTTCG is enabled.  */
> +    tcg_gen_op1(&tcg_ctx, INDEX_op_mb, 0);

Yes, this could be a right place to check for MTTCG enabled and number
of CPUs emulated. If we do it here, then we should adjust the
documentation stating that the backend should take care of it.

Kind regards,
Sergey

Pranith Kumar June 2, 2016, 6:42 p.m. UTC | #6

Sergey Fedorov writes:

> On 31/05/16 21:39, Pranith Kumar wrote:
>> diff --git a/tcg/README b/tcg/README
>> index f4a8ac1..cfe79d7 100644
>> --- a/tcg/README
>> +++ b/tcg/README
>> @@ -402,6 +402,23 @@ double-word product T0.  The later is returned in two single-word outputs.
>>  
>>  Similar to mulu2, except the two inputs T1 and T2 are signed.
>>  
>> +********* Memory Barrier support
>> +
>> +* mb <$arg>
>> +
>> +Generate a target memory barrier instruction to ensure memory ordering as being
>> +enforced by a corresponding guest memory barrier instruction. The ordering
>> +enforced by the backend may be stricter than the ordering required by the guest.
>> +It cannot be weaker. This opcode takes an optional constant argument if required
>> +to generate the appropriate barrier instruction. The backend should take care to
>> +emit the target barrier instruction only when necessary i.e., for SMP guests and
>> +when MTTCG is enabled.
>> +
>> +The guest translators should generate this opcode for all guest instructions
>> +which have ordering side effects.
>
> I think we need to extend TCG load/store instruction attributes to
> provide information about guest ordering requirements and leave this TCG
> operation only for explicit barrier instruction translation.
>

Yes, I am working on adding flag argument to the TCG MemOp which indicates
this.

>> +
>> +Please see docs/atomics.txt for more information on memory barriers.
>> +
>>  ********* 64-bit guest on 32-bit host support
>>  
>>  The following opcodes are internal to TCG.  Thus they are to be implemented by
>> diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
>> index f554b86..a6f01a7 100644
>> --- a/tcg/tcg-op.c
>> +++ b/tcg/tcg-op.c
>> @@ -143,6 +143,12 @@ void tcg_gen_op6(TCGContext *ctx, TCGOpcode opc, TCGArg a1, TCGArg a2,
>>      tcg_emit_op(ctx, opc, pi);
>>  }
>>  
>> +void tcg_gen_mb(TCGArg a)
>> +{
>> +    /* ??? Enable only when MTTCG is enabled.  */
>> +    tcg_gen_op1(&tcg_ctx, INDEX_op_mb, 0);
>
> Yes, this could be a right place to check for MTTCG enabled and number
> of CPUs emulated. If we do it here, then we should adjust the
> documentation stating that the backend should take care of it.
>

I added the check in Patch 13. I will update the documentation to reflect this.

Thanks,

Richard Henderson June 2, 2016, 8:36 p.m. UTC | #7

On 06/02/2016 09:30 AM, Sergey Fedorov wrote:
> I think we need to extend TCG load/store instruction attributes to
> provide information about guest ordering requirements and leave this TCG
> operation only for explicit barrier instruction translation.

I do not agree.  I think separate barriers are much cleaner and easier to 
manage and reason with.


r~

Richard Henderson June 2, 2016, 8:36 p.m. UTC | #8

On 06/02/2016 11:42 AM, Pranith Kumar wrote:
> Yes, I am working on adding flag argument to the TCG MemOp which indicates
> this.

Please don't.  I don't think this is the right way to go.


r~

Sergey Fedorov June 2, 2016, 8:38 p.m. UTC | #9

On 02/06/16 23:36, Richard Henderson wrote:
> On 06/02/2016 09:30 AM, Sergey Fedorov wrote:
>> I think we need to extend TCG load/store instruction attributes to
>> provide information about guest ordering requirements and leave this TCG
>> operation only for explicit barrier instruction translation.
>
> I do not agree.  I think separate barriers are much cleaner and easier
> to manage and reason with.
>

How are we going to emulate strongly-ordered guests on weakly-ordered
hosts then? I think if every load/store operation must specify which
ordering it implies then this task would be quite simple.

Kind regards,
Sergey

Richard Henderson June 2, 2016, 9:18 p.m. UTC | #10

On 06/02/2016 01:38 PM, Sergey Fedorov wrote:
> On 02/06/16 23:36, Richard Henderson wrote:
>> On 06/02/2016 09:30 AM, Sergey Fedorov wrote:
>>> I think we need to extend TCG load/store instruction attributes to
>>> provide information about guest ordering requirements and leave this TCG
>>> operation only for explicit barrier instruction translation.
>>
>> I do not agree.  I think separate barriers are much cleaner and easier
>> to manage and reason with.
>>
>
> How are we going to emulate strongly-ordered guests on weakly-ordered
> hosts then? I think if every load/store operation must specify which
> ordering it implies then this task would be quite simple.

Hum.  That does seem helpful-ish.  But I'm not certain how helpful it is to 
complicate the helper functions even further.

What if we have tcg_canonicalize_memop (or some such) split off the barriers 
into separate opcodes.  E.g.

MO_BAR_LD_B = 32	// prevent earlier loads from crossing current op
MO_BAR_ST_B = 64	// prevent earlier stores from crossing current op
MO_BAR_LD_A = 128	// prevent later loads from crossing current op
MO_BAR_ST_A = 256	// prevent later stores from crossing current op
MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A

// Match Sparc MEMBAR as the most flexible host.
TCG_BAR_LD_LD = 1	// #LoadLoad barrier
TCG_BAR_ST_LD = 2	// #StoreLoad barrier
TCG_BAR_LD_ST = 4	// #LoadStore barrier
TCG_BAR_ST_ST = 8	// #StoreStore barrier
TCG_BAR_SYNC  = 64	// SEQ_CST barrier

where

   tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER)

emits

   mb		TCG_BAR_LD_LD
   qemu_ld_i32	x, y, i, m
   mb		TCG_BAR_LD_ST

We can then add an optimization pass which folds barriers with no memory 
operations in between, so that duplicates are eliminated.


r~

Sergey Fedorov June 2, 2016, 9:37 p.m. UTC | #11

On 03/06/16 00:18, Richard Henderson wrote:
> On 06/02/2016 01:38 PM, Sergey Fedorov wrote:
>> On 02/06/16 23:36, Richard Henderson wrote:
>>> On 06/02/2016 09:30 AM, Sergey Fedorov wrote:
>>>> I think we need to extend TCG load/store instruction attributes to
>>>> provide information about guest ordering requirements and leave
>>>> this TCG
>>>> operation only for explicit barrier instruction translation.
>>>
>>> I do not agree.  I think separate barriers are much cleaner and easier
>>> to manage and reason with.
>>>
>>
>> How are we going to emulate strongly-ordered guests on weakly-ordered
>> hosts then? I think if every load/store operation must specify which
>> ordering it implies then this task would be quite simple.
>
> Hum.  That does seem helpful-ish.  But I'm not certain how helpful it
> is to complicate the helper functions even further.
>
> What if we have tcg_canonicalize_memop (or some such) split off the
> barriers into separate opcodes.  E.g.
>
> MO_BAR_LD_B = 32    // prevent earlier loads from crossing current op
> MO_BAR_ST_B = 64    // prevent earlier stores from crossing current op
> MO_BAR_LD_A = 128    // prevent later loads from crossing current op
> MO_BAR_ST_A = 256    // prevent later stores from crossing current op
> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>
> // Match Sparc MEMBAR as the most flexible host.
> TCG_BAR_LD_LD = 1    // #LoadLoad barrier
> TCG_BAR_ST_LD = 2    // #StoreLoad barrier
> TCG_BAR_LD_ST = 4    // #LoadStore barrier
> TCG_BAR_ST_ST = 8    // #StoreStore barrier
> TCG_BAR_SYNC  = 64    // SEQ_CST barrier
>
> where
>
>   tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER)
>
> emits
>
>   mb        TCG_BAR_LD_LD
>   qemu_ld_i32    x, y, i, m
>   mb        TCG_BAR_LD_ST
>
> We can then add an optimization pass which folds barriers with no
> memory operations in between, so that duplicates are eliminated.

It would give us three TCG operations for each memory operation instead
of one. But then we might like to combine these barrier operations back
with memory operations in each backend. If we propagate memory ordering
semantics up to the backend, it can decide itself what instructions are
best to generate.

So I would just focus on translating only explicit memory barrier
operations for now.

Kind regards,
Sergey

Richard Henderson June 3, 2016, 1:08 a.m. UTC | #12

On 06/02/2016 02:37 PM, Sergey Fedorov wrote:
> On 03/06/16 00:18, Richard Henderson wrote:
>> On 06/02/2016 01:38 PM, Sergey Fedorov wrote:
>>> On 02/06/16 23:36, Richard Henderson wrote:
>>>> On 06/02/2016 09:30 AM, Sergey Fedorov wrote:
>>>>> I think we need to extend TCG load/store instruction attributes to
>>>>> provide information about guest ordering requirements and leave
>>>>> this TCG
>>>>> operation only for explicit barrier instruction translation.
>>>>
>>>> I do not agree.  I think separate barriers are much cleaner and easier
>>>> to manage and reason with.
>>>>
>>>
>>> How are we going to emulate strongly-ordered guests on weakly-ordered
>>> hosts then? I think if every load/store operation must specify which
>>> ordering it implies then this task would be quite simple.
>>
>> Hum.  That does seem helpful-ish.  But I'm not certain how helpful it
>> is to complicate the helper functions even further.
>>
>> What if we have tcg_canonicalize_memop (or some such) split off the
>> barriers into separate opcodes.  E.g.
>>
>> MO_BAR_LD_B = 32    // prevent earlier loads from crossing current op
>> MO_BAR_ST_B = 64    // prevent earlier stores from crossing current op
>> MO_BAR_LD_A = 128    // prevent later loads from crossing current op
>> MO_BAR_ST_A = 256    // prevent later stores from crossing current op
>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>>
>> // Match Sparc MEMBAR as the most flexible host.
>> TCG_BAR_LD_LD = 1    // #LoadLoad barrier
>> TCG_BAR_ST_LD = 2    // #StoreLoad barrier
>> TCG_BAR_LD_ST = 4    // #LoadStore barrier
>> TCG_BAR_ST_ST = 8    // #StoreStore barrier
>> TCG_BAR_SYNC  = 64    // SEQ_CST barrier
>>
>> where
>>
>>   tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER)
>>
>> emits
>>
>>   mb        TCG_BAR_LD_LD
>>   qemu_ld_i32    x, y, i, m
>>   mb        TCG_BAR_LD_ST
>>
>> We can then add an optimization pass which folds barriers with no
>> memory operations in between, so that duplicates are eliminated.
>
> It would give us three TCG operations for each memory operation instead
> of one. But then we might like to combine these barrier operations back
> with memory operations in each backend. If we propagate memory ordering
> semantics up to the backend, it can decide itself what instructions are
> best to generate.

A strongly ordered target would generally only set BEFORE bits or AFTER bits, 
but not both (and I suggest we canonicalize on AFTER for all such targets). 
Thus a strongly ordered target would produce only 2 opcodes per memory op.

I supplied both to make it easier to handle a weakly ordered target with 
acquire/release bits.

I would *not* combine the barrier operations back with memory operations in the 
backend.  Only armv8 and ia64 can do that, and given the optimization level at 
which we generate code, I doubt it would really make much difference above 
separate barriers.

> So I would just focus on translating only explicit memory barrier
> operations for now.

Then why did you bring it up?


r~

Sergey Fedorov June 3, 2016, 3:16 p.m. UTC | #13

On 03/06/16 04:08, Richard Henderson wrote:
> On 06/02/2016 02:37 PM, Sergey Fedorov wrote:
>> On 03/06/16 00:18, Richard Henderson wrote:
>>> On 06/02/2016 01:38 PM, Sergey Fedorov wrote:
>>>> On 02/06/16 23:36, Richard Henderson wrote:
>>>>> On 06/02/2016 09:30 AM, Sergey Fedorov wrote:
>>>>>> I think we need to extend TCG load/store instruction attributes to
>>>>>> provide information about guest ordering requirements and leave
>>>>>> this TCG
>>>>>> operation only for explicit barrier instruction translation.
>>>>>
>>>>> I do not agree.  I think separate barriers are much cleaner and
>>>>> easier
>>>>> to manage and reason with.
>>>>>
>>>>
>>>> How are we going to emulate strongly-ordered guests on weakly-ordered
>>>> hosts then? I think if every load/store operation must specify which
>>>> ordering it implies then this task would be quite simple.
>>>
>>> Hum.  That does seem helpful-ish.  But I'm not certain how helpful it
>>> is to complicate the helper functions even further.
>>>
>>> What if we have tcg_canonicalize_memop (or some such) split off the
>>> barriers into separate opcodes.  E.g.
>>>
>>> MO_BAR_LD_B = 32    // prevent earlier loads from crossing current op
>>> MO_BAR_ST_B = 64    // prevent earlier stores from crossing current op
>>> MO_BAR_LD_A = 128    // prevent later loads from crossing current op
>>> MO_BAR_ST_A = 256    // prevent later stores from crossing current op
>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>>>
>>> // Match Sparc MEMBAR as the most flexible host.
>>> TCG_BAR_LD_LD = 1    // #LoadLoad barrier
>>> TCG_BAR_ST_LD = 2    // #StoreLoad barrier
>>> TCG_BAR_LD_ST = 4    // #LoadStore barrier
>>> TCG_BAR_ST_ST = 8    // #StoreStore barrier
>>> TCG_BAR_SYNC  = 64    // SEQ_CST barrier
>>>
>>> where
>>>
>>>   tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER)
>>>
>>> emits
>>>
>>>   mb        TCG_BAR_LD_LD
>>>   qemu_ld_i32    x, y, i, m
>>>   mb        TCG_BAR_LD_ST
>>>
>>> We can then add an optimization pass which folds barriers with no
>>> memory operations in between, so that duplicates are eliminated.
>>
>> It would give us three TCG operations for each memory operation instead
>> of one. But then we might like to combine these barrier operations back
>> with memory operations in each backend. If we propagate memory ordering
>> semantics up to the backend, it can decide itself what instructions are
>> best to generate.
>
> A strongly ordered target would generally only set BEFORE bits or
> AFTER bits, but not both (and I suggest we canonicalize on AFTER for
> all such targets). Thus a strongly ordered target would produce only 2
> opcodes per memory op.
>
> I supplied both to make it easier to handle a weakly ordered target
> with acquire/release bits.
>
> I would *not* combine the barrier operations back with memory
> operations in the backend.  Only armv8 and ia64 can do that, and given
> the optimization level at which we generate code, I doubt it would
> really make much difference above separate barriers.

So your suggestion is to generate different TCG opcode sequences
depending on the underlying target architecture? And you are against
forwarding this task further, to the backend code?

>
>> So I would just focus on translating only explicit memory barrier
>> operations for now.
>
> Then why did you bring it up?

I'm not sure I got the question right. I suggested to avoid using this
TCG operation to emulate guest's memory ordering requirements for
loads/stores that can be supplied with memory ordering requirement
information which each backend can decide how to translate together with
the load/store (possible just ignore it as it is the case for
strongly-ordered hosts). I think we just need to translate explicit
memory barrier instructions.

For example, emulating ARM guest on x86 host requires ARM dmb
instruction to be translated to x86 mfence instruction to prevent
store-after-load reordering. At the same time, we don't have to generate
anything special for loads/stores since x86 is a strongly-ordered
architecture.

Kind regards,
Sergey

Richard Henderson June 3, 2016, 3:45 p.m. UTC | #14

On 06/03/2016 08:16 AM, Sergey Fedorov wrote:
> On 03/06/16 04:08, Richard Henderson wrote:
> So your suggestion is to generate different TCG opcode sequences
> depending on the underlying target architecture? And you are against
> forwarding this task further, to the backend code?

Yes, I would prefer to have, in the opcode stream, a separate opcode for 
barriers.  This aids both the common case, where most of our hosts require 
separate barriers, as well as simplicity.

I am not opposed to letting the translators describe the memory model with 
barrier data along with memory operations, but I'd really prefer that those be 
split apart during initial opcode generation.

>>> So I would just focus on translating only explicit memory barrier
>>> operations for now.
>>
>> Then why did you bring it up?
>
> I'm not sure I got the question right. I suggested to avoid using this
> TCG operation to emulate guest's memory ordering requirements for
> loads/stores that can be supplied with memory ordering requirement
> information which each backend can decide how to translate together with
> the load/store (possible just ignore it as it is the case for
> strongly-ordered hosts). I think we just need to translate explicit
> memory barrier instructions.
>
> For example, emulating ARM guest on x86 host requires ARM dmb
> instruction to be translated to x86 mfence instruction to prevent
> store-after-load reordering. At the same time, we don't have to generate
> anything special for loads/stores since x86 is a strongly-ordered
> architecture.

Ah, so you'd prefer that we not think about optimizing barriers at the moment. 
Fine, but I'd prefer to think about *how* they might be optimized now, so that 
we *can* later.

r~

Sergey Fedorov June 3, 2016, 4:06 p.m. UTC | #15

On 03/06/16 18:45, Richard Henderson wrote:
> On 06/03/2016 08:16 AM, Sergey Fedorov wrote:
>> On 03/06/16 04:08, Richard Henderson wrote:
>> So your suggestion is to generate different TCG opcode sequences
>> depending on the underlying target architecture? And you are against
>> forwarding this task further, to the backend code?
>
> Yes, I would prefer to have, in the opcode stream, a separate opcode
> for barriers.  This aids both the common case, where most of our hosts
> require separate barriers, as well as simplicity.
>
> I am not opposed to letting the translators describe the memory model
> with barrier data along with memory operations, but I'd really prefer
> that those be split apart during initial opcode generation.
>
>>>> So I would just focus on translating only explicit memory barrier
>>>> operations for now.
>>>
>>> Then why did you bring it up?
>>
>> I'm not sure I got the question right. I suggested to avoid using this
>> TCG operation to emulate guest's memory ordering requirements for
>> loads/stores that can be supplied with memory ordering requirement
>> information which each backend can decide how to translate together with
>> the load/store (possible just ignore it as it is the case for
>> strongly-ordered hosts). I think we just need to translate explicit
>> memory barrier instructions.
>>
>> For example, emulating ARM guest on x86 host requires ARM dmb
>> instruction to be translated to x86 mfence instruction to prevent
>> store-after-load reordering. At the same time, we don't have to generate
>> anything special for loads/stores since x86 is a strongly-ordered
>> architecture.
>
> Ah, so you'd prefer that we not think about optimizing barriers at the
> moment. Fine, but I'd prefer to think about *how* they might be
> optimized now, so that we *can* later.

Not exactly. We need to have a TCG operation for various types of
explicit barriers in order to translate guest explicit barrier
instructions. I like your idea to follow Sparc's way to specify membar
instruction attributes which can be used by the backed for generating
optimal instructions. I think we also need to associate memory ordering
attributes with load/store TCG operations. I'm not sure how would be
best to handle load/store implicit memory ordering requirements, but it
is probably out of the scope of this series. I'd propagate this
attribute up to the backend and let it decide what kind of instructions
to generate. I'd prefer to see only explicit barrier operations in the
TCG opcode stream.

Kind regards,
Sergey

Pranith Kumar June 3, 2016, 6:27 p.m. UTC | #16

On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote:
>
> Hum.  That does seem helpful-ish.  But I'm not certain how helpful it is to
> complicate the helper functions even further.
>
> What if we have tcg_canonicalize_memop (or some such) split off the barriers
> into separate opcodes.  E.g.
>
> MO_BAR_LD_B = 32        // prevent earlier loads from crossing current op
> MO_BAR_ST_B = 64        // prevent earlier stores from crossing current op
> MO_BAR_LD_A = 128       // prevent later loads from crossing current op
> MO_BAR_ST_A = 256       // prevent later stores from crossing current op
> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>
> // Match Sparc MEMBAR as the most flexible host.
> TCG_BAR_LD_LD = 1       // #LoadLoad barrier
> TCG_BAR_ST_LD = 2       // #StoreLoad barrier
> TCG_BAR_LD_ST = 4       // #LoadStore barrier
> TCG_BAR_ST_ST = 8       // #StoreStore barrier
> TCG_BAR_SYNC  = 64      // SEQ_CST barrier

I really like this format. I would also like to add to the frontend:

MO_BAR_ACQUIRE
MO_BAR_RELEASE

and the following to the backend:

TCG_BAR_ACQUIRE
TCG_BAR_RELEASE

since these are one-way barriers and the previous barrier types do not
cover them.

>
> where
>
>   tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER)
>
> emits
>
>   mb            TCG_BAR_LD_LD
>   qemu_ld_i32   x, y, i, m
>   mb            TCG_BAR_LD_ST
>
> We can then add an optimization pass which folds barriers with no memory
> operations in between, so that duplicates are eliminated.
>

Yes, folding/eliding these barriers in an optimization pass sounds
like a good idea.

Thanks,

Pranith Kumar June 3, 2016, 6:30 p.m. UTC | #17

On Thu, Jun 2, 2016 at 9:08 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 06/02/2016 02:37 PM, Sergey Fedorov wrote:
>>
>>
>> It would give us three TCG operations for each memory operation instead
>> of one. But then we might like to combine these barrier operations back
>> with memory operations in each backend. If we propagate memory ordering
>> semantics up to the backend, it can decide itself what instructions are
>> best to generate.
>
>
> A strongly ordered target would generally only set BEFORE bits or AFTER
> bits, but not both (and I suggest we canonicalize on AFTER for all such
> targets). Thus a strongly ordered target would produce only 2 opcodes per
> memory op.
>
> I supplied both to make it easier to handle a weakly ordered target with
> acquire/release bits.
>
> I would *not* combine the barrier operations back with memory operations in
> the backend.  Only armv8 and ia64 can do that, and given the optimization
> level at which we generate code, I doubt it would really make much
> difference above separate barriers.
>

On armv8, using load_acquire/store_release instructions makes a
significant difference in performance when compared to plain
dmb+memory instruction sequence. So I would really like to keep the
option of generating acq/rel instructions(by combining barrier and
memory or some other way) open.

Thanks,

Sergey Fedorov June 3, 2016, 7:49 p.m. UTC | #18

On 03/06/16 21:30, Pranith Kumar wrote:
> On Thu, Jun 2, 2016 at 9:08 PM, Richard Henderson <rth@twiddle.net> wrote:
>> On 06/02/2016 02:37 PM, Sergey Fedorov wrote:
>>>
>>> It would give us three TCG operations for each memory operation instead
>>> of one. But then we might like to combine these barrier operations back
>>> with memory operations in each backend. If we propagate memory ordering
>>> semantics up to the backend, it can decide itself what instructions are
>>> best to generate.
>>
>> A strongly ordered target would generally only set BEFORE bits or AFTER
>> bits, but not both (and I suggest we canonicalize on AFTER for all such
>> targets). Thus a strongly ordered target would produce only 2 opcodes per
>> memory op.
>>
>> I supplied both to make it easier to handle a weakly ordered target with
>> acquire/release bits.
>>
>> I would *not* combine the barrier operations back with memory operations in
>> the backend.  Only armv8 and ia64 can do that, and given the optimization
>> level at which we generate code, I doubt it would really make much
>> difference above separate barriers.
>>
> On armv8, using load_acquire/store_release instructions makes a
> significant difference in performance when compared to plain
> dmb+memory instruction sequence. So I would really like to keep the
> option of generating acq/rel instructions(by combining barrier and
> memory or some other way) open.

I'm not so sure about acq/rel flags. Is there any architecture which has
explicit acq/rel barriers? I suppose acq/rel memory access instructions
are always load-link and store-conditional and thus rely on exclusive
memory monitor to support that "conditional" behaviour. To emulate this
behaviour we need something more special like "Slow-path for atomic
instruction translation" [1].

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/407419

Kind regards,
Sergey

Sergey Fedorov June 3, 2016, 7:52 p.m. UTC | #19

On 03/06/16 21:27, Pranith Kumar wrote:
> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote:
>> >
>> > Hum.  That does seem helpful-ish.  But I'm not certain how helpful it is to
>> > complicate the helper functions even further.
>> >
>> > What if we have tcg_canonicalize_memop (or some such) split off the barriers
>> > into separate opcodes.  E.g.
>> >
>> > MO_BAR_LD_B = 32        // prevent earlier loads from crossing current op
>> > MO_BAR_ST_B = 64        // prevent earlier stores from crossing current op
>> > MO_BAR_LD_A = 128       // prevent later loads from crossing current op
>> > MO_BAR_ST_A = 256       // prevent later stores from crossing current op
>> > MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>> > MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>> > MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>> >
>> > // Match Sparc MEMBAR as the most flexible host.
>> > TCG_BAR_LD_LD = 1       // #LoadLoad barrier
>> > TCG_BAR_ST_LD = 2       // #StoreLoad barrier
>> > TCG_BAR_LD_ST = 4       // #LoadStore barrier
>> > TCG_BAR_ST_ST = 8       // #StoreStore barrier
>> > TCG_BAR_SYNC  = 64      // SEQ_CST barrier
> I really like this format. I would also like to add to the frontend:
>
> MO_BAR_ACQUIRE
> MO_BAR_RELEASE
>
> and the following to the backend:
>
> TCG_BAR_ACQUIRE
> TCG_BAR_RELEASE
>
> since these are one-way barriers and the previous barrier types do not
> cover them.
>

Well, my last comment about acq/rel barriers should really go here
[Message-Id: 5751DF2D.5040709@gmail.com]

Regards,
Sergey

Peter Maydell June 3, 2016, 8:43 p.m. UTC | #20

On 3 June 2016 at 20:49, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> I'm not so sure about acq/rel flags. Is there any architecture which has
> explicit acq/rel barriers? I suppose acq/rel memory access instructions
> are always load-link and store-conditional and thus rely on exclusive
> memory monitor to support that "conditional" behaviour.

This doesn't sound right (at least not for ARM). You can have
load-acquire and store-release insns which aren't exclusives,
you can have exclusives which aren't acq/rel, and you can
have accesses which are both exclusives and acq/rel (and
just to complete the set, obviously there are accesses which
are neither). The exclusive semantics require the monitor, but
acq/rel is just an ordering constraint (sort of like an implicit
barrier, but not quite).

thanks
-- PMM

Sergey Fedorov June 3, 2016, 9:33 p.m. UTC | #21

On 03/06/16 23:43, Peter Maydell wrote:
> On 3 June 2016 at 20:49, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>> I'm not so sure about acq/rel flags. Is there any architecture which has
>> explicit acq/rel barriers? I suppose acq/rel memory access instructions
>> are always load-link and store-conditional and thus rely on exclusive
>> memory monitor to support that "conditional" behaviour.
> This doesn't sound right (at least not for ARM). You can have
> load-acquire and store-release insns which aren't exclusives,
> you can have exclusives which aren't acq/rel, and you can
> have accesses which are both exclusives and acq/rel (and
> just to complete the set, obviously there are accesses which
> are neither). The exclusive semantics require the monitor, but
> acq/rel is just an ordering constraint (sort of like an implicit
> barrier, but not quite).

Thanks for clarifying this, Peter.

Regards,
Sergey

Sergey Fedorov June 6, 2016, 3:44 p.m. UTC | #22

On 03/06/16 21:27, Pranith Kumar wrote:
> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote:
>> Hum.  That does seem helpful-ish.  But I'm not certain how helpful it is to
>> complicate the helper functions even further.
>>
>> What if we have tcg_canonicalize_memop (or some such) split off the barriers
>> into separate opcodes.  E.g.
>>
>> MO_BAR_LD_B = 32        // prevent earlier loads from crossing current op
>> MO_BAR_ST_B = 64        // prevent earlier stores from crossing current op
>> MO_BAR_LD_A = 128       // prevent later loads from crossing current op
>> MO_BAR_ST_A = 256       // prevent later stores from crossing current op
>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>>
>> // Match Sparc MEMBAR as the most flexible host.
>> TCG_BAR_LD_LD = 1       // #LoadLoad barrier
>> TCG_BAR_ST_LD = 2       // #StoreLoad barrier
>> TCG_BAR_LD_ST = 4       // #LoadStore barrier
>> TCG_BAR_ST_ST = 8       // #StoreStore barrier
>> TCG_BAR_SYNC  = 64      // SEQ_CST barrier
> I really like this format. I would also like to add to the frontend:
>
> MO_BAR_ACQUIRE
> MO_BAR_RELEASE
>
> and the following to the backend:
>
> TCG_BAR_ACQUIRE
> TCG_BAR_RELEASE
>
> since these are one-way barriers and the previous barrier types do not
> cover them.

Actually, the acquire barrier is a combined load-load + load-store
barrier; and the release barrier is a combo of load-store + store-store
barriers.

Kind regards,
Sergey

>
>> where
>>
>>   tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER)
>>
>> emits
>>
>>   mb            TCG_BAR_LD_LD
>>   qemu_ld_i32   x, y, i, m
>>   mb            TCG_BAR_LD_ST
>>
>> We can then add an optimization pass which folds barriers with no memory
>> operations in between, so that duplicates are eliminated.
>>
> Yes, folding/eliding these barriers in an optimization pass sounds
> like a good idea.
>
> Thanks,

Pranith Kumar June 6, 2016, 3:47 p.m. UTC | #23

On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> On 03/06/16 21:27, Pranith Kumar wrote:
>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote:
>>>
>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers
>>> into separate opcodes.  E.g.
>>>
>>> MO_BAR_LD_B = 32        // prevent earlier loads from crossing current op
>>> MO_BAR_ST_B = 64        // prevent earlier stores from crossing current op
>>> MO_BAR_LD_A = 128       // prevent later loads from crossing current op
>>> MO_BAR_ST_A = 256       // prevent later stores from crossing current op
>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>>>
>>> // Match Sparc MEMBAR as the most flexible host.
>>> TCG_BAR_LD_LD = 1       // #LoadLoad barrier
>>> TCG_BAR_ST_LD = 2       // #StoreLoad barrier
>>> TCG_BAR_LD_ST = 4       // #LoadStore barrier
>>> TCG_BAR_ST_ST = 8       // #StoreStore barrier
>>> TCG_BAR_SYNC  = 64      // SEQ_CST barrier
>> I really like this format. I would also like to add to the frontend:
>>
>
> Actually, the acquire barrier is a combined load-load + load-store
> barrier; and the release barrier is a combo of load-store + store-store
> barriers.
>

All the above are two-way barriers. Acquire/Release are one-way
barriers. So we cannot combine the above to get acquire/release
semantics without being conservative.

Thanks,

Sergey Fedorov June 6, 2016, 3:49 p.m. UTC | #24

On 06/06/16 18:47, Pranith Kumar wrote:
> On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>> On 03/06/16 21:27, Pranith Kumar wrote:
>>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote:
>>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers
>>>> into separate opcodes.  E.g.
>>>>
>>>> MO_BAR_LD_B = 32        // prevent earlier loads from crossing current op
>>>> MO_BAR_ST_B = 64        // prevent earlier stores from crossing current op
>>>> MO_BAR_LD_A = 128       // prevent later loads from crossing current op
>>>> MO_BAR_ST_A = 256       // prevent later stores from crossing current op
>>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>>>>
>>>> // Match Sparc MEMBAR as the most flexible host.
>>>> TCG_BAR_LD_LD = 1       // #LoadLoad barrier
>>>> TCG_BAR_ST_LD = 2       // #StoreLoad barrier
>>>> TCG_BAR_LD_ST = 4       // #LoadStore barrier
>>>> TCG_BAR_ST_ST = 8       // #StoreStore barrier
>>>> TCG_BAR_SYNC  = 64      // SEQ_CST barrier
>>> I really like this format. I would also like to add to the frontend:
>>>
>> Actually, the acquire barrier is a combined load-load + load-store
>> barrier; and the release barrier is a combo of load-store + store-store
>> barriers.
>>
> All the above are two-way barriers. Acquire/Release are one-way
> barriers. So we cannot combine the above to get acquire/release
> semantics without being conservative.

Do you mean *barriers* or *memory access* operations implying memory
ordering?

Regards,
Sergey

Pranith Kumar June 6, 2016, 3:58 p.m. UTC | #25

On Mon, Jun 6, 2016 at 11:49 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> On 06/06/16 18:47, Pranith Kumar wrote:
>> On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>> On 03/06/16 21:27, Pranith Kumar wrote:
>>>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote:
>>>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers
>>>>> into separate opcodes.  E.g.
>>>>>
>>>>> MO_BAR_LD_B = 32        // prevent earlier loads from crossing current op
>>>>> MO_BAR_ST_B = 64        // prevent earlier stores from crossing current op
>>>>> MO_BAR_LD_A = 128       // prevent later loads from crossing current op
>>>>> MO_BAR_ST_A = 256       // prevent later stores from crossing current op
>>>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>>>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>>>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>>>>>
>>>>> // Match Sparc MEMBAR as the most flexible host.
>>>>> TCG_BAR_LD_LD = 1       // #LoadLoad barrier
>>>>> TCG_BAR_ST_LD = 2       // #StoreLoad barrier
>>>>> TCG_BAR_LD_ST = 4       // #LoadStore barrier
>>>>> TCG_BAR_ST_ST = 8       // #StoreStore barrier
>>>>> TCG_BAR_SYNC  = 64      // SEQ_CST barrier
>>>> I really like this format. I would also like to add to the frontend:
>>>>
>>> Actually, the acquire barrier is a combined load-load + load-store
>>> barrier; and the release barrier is a combo of load-store + store-store
>>> barriers.
>>>
>> All the above are two-way barriers. Acquire/Release are one-way
>> barriers. So we cannot combine the above to get acquire/release
>> semantics without being conservative.
>
> Do you mean *barriers* or *memory access* operations implying memory
> ordering?

I meant the latter. I know no arch which has acquire/release barriers.
Sorry for the confusion.

Thanks,

Sergey Fedorov June 6, 2016, 4:14 p.m. UTC | #26

On 06/06/16 18:58, Pranith Kumar wrote:
> On Mon, Jun 6, 2016 at 11:49 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>> On 06/06/16 18:47, Pranith Kumar wrote:
>>> On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>>> On 03/06/16 21:27, Pranith Kumar wrote:
>>>>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote:
>>>>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers
>>>>>> into separate opcodes.  E.g.
>>>>>>
>>>>>> MO_BAR_LD_B = 32        // prevent earlier loads from crossing current op
>>>>>> MO_BAR_ST_B = 64        // prevent earlier stores from crossing current op
>>>>>> MO_BAR_LD_A = 128       // prevent later loads from crossing current op
>>>>>> MO_BAR_ST_A = 256       // prevent later stores from crossing current op
>>>>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>>>>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>>>>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>>>>>>
>>>>>> // Match Sparc MEMBAR as the most flexible host.
>>>>>> TCG_BAR_LD_LD = 1       // #LoadLoad barrier
>>>>>> TCG_BAR_ST_LD = 2       // #StoreLoad barrier
>>>>>> TCG_BAR_LD_ST = 4       // #LoadStore barrier
>>>>>> TCG_BAR_ST_ST = 8       // #StoreStore barrier
>>>>>> TCG_BAR_SYNC  = 64      // SEQ_CST barrier
>>>>> I really like this format. I would also like to add to the frontend:
>>>>>
>>>> Actually, the acquire barrier is a combined load-load + load-store
>>>> barrier; and the release barrier is a combo of load-store + store-store
>>>> barriers.
>>>>
>>> All the above are two-way barriers. Acquire/Release are one-way
>>> barriers. So we cannot combine the above to get acquire/release
>>> semantics without being conservative.
>> Do you mean *barriers* or *memory access* operations implying memory
>> ordering?
> I meant the latter. I know no arch which has acquire/release barriers.
> Sorry for the confusion.

So I am. By the way, what's the difference between sequential consistent
*barrier* and a combination of all the other barriers?

Kind regards,
Sergey

Alex Bennée June 6, 2016, 4:19 p.m. UTC | #27

Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 03/06/16 21:30, Pranith Kumar wrote:
>> On Thu, Jun 2, 2016 at 9:08 PM, Richard Henderson <rth@twiddle.net> wrote:
>>> On 06/02/2016 02:37 PM, Sergey Fedorov wrote:
>>>>
>>>> It would give us three TCG operations for each memory operation instead
>>>> of one. But then we might like to combine these barrier operations back
>>>> with memory operations in each backend. If we propagate memory ordering
>>>> semantics up to the backend, it can decide itself what instructions are
>>>> best to generate.
>>>
>>> A strongly ordered target would generally only set BEFORE bits or AFTER
>>> bits, but not both (and I suggest we canonicalize on AFTER for all such
>>> targets). Thus a strongly ordered target would produce only 2 opcodes per
>>> memory op.
>>>
>>> I supplied both to make it easier to handle a weakly ordered target with
>>> acquire/release bits.
>>>
>>> I would *not* combine the barrier operations back with memory operations in
>>> the backend.  Only armv8 and ia64 can do that, and given the optimization
>>> level at which we generate code, I doubt it would really make much
>>> difference above separate barriers.
>>>
>> On armv8, using load_acquire/store_release instructions makes a
>> significant difference in performance when compared to plain
>> dmb+memory instruction sequence. So I would really like to keep the
>> option of generating acq/rel instructions(by combining barrier and
>> memory or some other way) open.
>
> I'm not so sure about acq/rel flags. Is there any architecture which has
> explicit acq/rel barriers? I suppose acq/rel memory access instructions
> are always load-link and store-conditional and thus rely on exclusive
> memory monitor to support that "conditional" behaviour.

Nope, you can have acq/rel memory operations without exclusive
operations (see ARMv8 ldar and stlr). The exclusive operations also have
ordered and non-ordered variants (ldxr, strx).

> To emulate this
> behaviour we need something more special like "Slow-path for atomic
> instruction translation" [1].
>
> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/407419
>
> Kind regards,
> Sergey


--
Alex Bennée

Pranith Kumar June 6, 2016, 5:11 p.m. UTC | #28

On Mon, Jun 6, 2016 at 12:14 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> On 06/06/16 18:58, Pranith Kumar wrote:
>> On Mon, Jun 6, 2016 at 11:49 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>> On 06/06/16 18:47, Pranith Kumar wrote:
>>>> On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>>>>> On 03/06/16 21:27, Pranith Kumar wrote:
>>>>>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote:
>>>>>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers
>>>>>>> into separate opcodes.  E.g.
>>>>>>>
>>>>>>> MO_BAR_LD_B = 32        // prevent earlier loads from crossing current op
>>>>>>> MO_BAR_ST_B = 64        // prevent earlier stores from crossing current op
>>>>>>> MO_BAR_LD_A = 128       // prevent later loads from crossing current op
>>>>>>> MO_BAR_ST_A = 256       // prevent later stores from crossing current op
>>>>>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B
>>>>>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A
>>>>>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A
>>>>>>>
>>>>>>> // Match Sparc MEMBAR as the most flexible host.
>>>>>>> TCG_BAR_LD_LD = 1       // #LoadLoad barrier
>>>>>>> TCG_BAR_ST_LD = 2       // #StoreLoad barrier
>>>>>>> TCG_BAR_LD_ST = 4       // #LoadStore barrier
>>>>>>> TCG_BAR_ST_ST = 8       // #StoreStore barrier
>>>>>>> TCG_BAR_SYNC  = 64      // SEQ_CST barrier
>>>>>> I really like this format. I would also like to add to the frontend:
>>>>>>
>>>>> Actually, the acquire barrier is a combined load-load + load-store
>>>>> barrier; and the release barrier is a combo of load-store + store-store
>>>>> barriers.
>>>>>
>>>> All the above are two-way barriers. Acquire/Release are one-way
>>>> barriers. So we cannot combine the above to get acquire/release
>>>> semantics without being conservative.
>>> Do you mean *barriers* or *memory access* operations implying memory
>>> ordering?
>> I meant the latter. I know no arch which has acquire/release barriers.
>> Sorry for the confusion.
>
> So I am. By the way, what's the difference between sequential consistent
> *barrier* and a combination of all the other barriers?


If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the
other four barriers. I am not sure if we can just construct SYNC like
this or if we need to define it explicitly though.

Richard Henderson June 6, 2016, 7:23 p.m. UTC | #29

On 06/06/2016 10:11 AM, Pranith Kumar wrote:
> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the
> other four barriers. I am not sure if we can just construct SYNC like
> this or if we need to define it explicitly though.

AFAICS, sparc membar #sync is stronger.

Compare PowerPC hwsync and lwsync.  I would define lwsync as the OR of the 
other 4 barriers, but hwsync as TCG_BAR_SYNC.

r~

Pranith Kumar June 6, 2016, 7:28 p.m. UTC | #30

On Mon, Jun 6, 2016 at 3:23 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 06/06/2016 10:11 AM, Pranith Kumar wrote:
>>
>> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the
>> other four barriers. I am not sure if we can just construct SYNC like
>> this or if we need to define it explicitly though.
>
>
> AFAICS, sparc membar #sync is stronger.

I tried looking it up but it's not clear. How is it stronger? And do
we need those strong guarantees in our front-end/back-end?

Thanks,

Sergey Fedorov June 6, 2016, 8:30 p.m. UTC | #31

On 06/06/16 22:28, Pranith Kumar wrote:
> On Mon, Jun 6, 2016 at 3:23 PM, Richard Henderson <rth@twiddle.net> wrote:
>> On 06/06/2016 10:11 AM, Pranith Kumar wrote:
>>> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the
>>> other four barriers. I am not sure if we can just construct SYNC like
>>> this or if we need to define it explicitly though.
>>
>> AFAICS, sparc membar #sync is stronger.
> I tried looking it up but it's not clear. How is it stronger? And do
> we need those strong guarantees in our front-end/back-end?

That is not clear for me either :( AFAIU, PPC's lwsync does allow stores
to be reordered after loads but hwsync - not. I suspect Sparc's membar
#Sync is used to ensure that some system operations are complete before
proceeding with execution. I'm not sure we need to introduce this into
TCG. It needs to be clear what is it and how to use it.

Kind regards,
Sergey

Peter Maydell June 6, 2016, 9 p.m. UTC | #32

On 6 June 2016 at 21:30, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> On 06/06/16 22:28, Pranith Kumar wrote:
>> On Mon, Jun 6, 2016 at 3:23 PM, Richard Henderson <rth@twiddle.net> wrote:
>>> On 06/06/2016 10:11 AM, Pranith Kumar wrote:
>>>> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the
>>>> other four barriers. I am not sure if we can just construct SYNC like
>>>> this or if we need to define it explicitly though.
>>>
>>> AFAICS, sparc membar #sync is stronger.
>> I tried looking it up but it's not clear. How is it stronger? And do
>> we need those strong guarantees in our front-end/back-end?
>
> That is not clear for me either :( AFAIU, PPC's lwsync does allow stores
> to be reordered after loads but hwsync - not.

Yes, from the PoV of the other CPU. That is, for write-then-read by
CPU 0, CPU 0 will always read what it wrote, but other CPUs don't
necessarily see the write before the read is satisfied.
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
describes the difference in sections 3.2 and 3.3, and has an
example in section 6 of a situation which requires a full
(hw)sync and an lwsync is not sufficient.

> I suspect Sparc's membar
> #Sync is used to ensure that some system operations are complete before
> proceeding with execution. I'm not sure we need to introduce this into
> TCG. It needs to be clear what is it and how to use it.

My reading of the manual is that the SPARC "membar #Sync" is like ARM
ISB -- it enforces that any instruction (whether a memory access or not)
before it must finish before anything after it can start. It only
affects the CPU that issues it (assuming you didn't also specify
any of the bits requesting memory barriers!) Since TCG doesn't attempt
to reorder instructions, we likely don't need to do anything except
maybe end the current TB. Also if we're still trying to do TLB
operations on other CPUs asynchronously we need to wait for them to
finish; I forget what the conclusion was on that idea.
PPC equivalent insn is isync I think.

thanks
-- PMM

Sergey Fedorov June 6, 2016, 9:49 p.m. UTC | #33

On 07/06/16 00:00, Peter Maydell wrote:
> On 6 June 2016 at 21:30, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>> On 06/06/16 22:28, Pranith Kumar wrote:
>>> On Mon, Jun 6, 2016 at 3:23 PM, Richard Henderson <rth@twiddle.net> wrote:
>>>> On 06/06/2016 10:11 AM, Pranith Kumar wrote:
>>>>> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the
>>>>> other four barriers. I am not sure if we can just construct SYNC like
>>>>> this or if we need to define it explicitly though.
>>>> AFAICS, sparc membar #sync is stronger.
>>> I tried looking it up but it's not clear. How is it stronger? And do
>>> we need those strong guarantees in our front-end/back-end?
>> That is not clear for me either :( AFAIU, PPC's lwsync does allow stores
>> to be reordered after loads but hwsync - not.
> Yes, from the PoV of the other CPU. That is, for write-then-read by
> CPU 0, CPU 0 will always read what it wrote, but other CPUs don't
> necessarily see the write before the read is satisfied.
> https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
> describes the difference in sections 3.2 and 3.3, and has an
> example in section 6 of a situation which requires a full
> (hw)sync and an lwsync is not sufficient.
>
>> I suspect Sparc's membar
>> #Sync is used to ensure that some system operations are complete before
>> proceeding with execution. I'm not sure we need to introduce this into
>> TCG. It needs to be clear what is it and how to use it.
> My reading of the manual is that the SPARC "membar #Sync" is like ARM
> ISB -- it enforces that any instruction (whether a memory access or not)
> before it must finish before anything after it can start. It only
> affects the CPU that issues it (assuming you didn't also specify
> any of the bits requesting memory barriers!) Since TCG doesn't attempt
> to reorder instructions, we likely don't need to do anything except
> maybe end the current TB. Also if we're still trying to do TLB
> operations on other CPUs asynchronously we need to wait for them to
> finish; I forget what the conclusion was on that idea.
> PPC equivalent insn is isync I think.
>

Thanks for commenting this, Peter. AFAIU, a sequential consistency
barrier is stronger than a release-aquire barrier because it provides
"transitivity/commutativity" [1]. This is what general barrier
guarantees in Linux [2]. I especially like this piece of description
from [2]:

    ... if this example runs on a system where CPUs 1 and 2 share a
    store buffer or a level of cache, CPU 2 might have early access to
    CPU 1's writes. General barriers are therefore required to ensure
    that all CPUs agree on the combined order of CPU 1's and CPU 2's
    accesses.

Current Linux kernel implements Sparc's smp_mb()/__smp_mb()/mb() with
"membar #StoreLoad" [3]. So we'll probably be fine with just RR, RW, WR,
and WW bits in the TCG memory barrier operation attribute.

Kind regards,
Sergey

[1] https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync#Overall_Summary
[2]
http://lxr.free-electrons.com/source/Documentation/memory-barriers.txt#L1268
[3]
http://lxr.free-electrons.com/source/arch/sparc/include/asm/barrier_64.h#L36

[RFC,v2,01/13] Introduce TCGOpcode for memory barrier

Commit Message

Comments

Patch