Message ID | 20160531183928.29406-2-bobby.prani@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 05/31/2016 11:39 AM, Pranith Kumar wrote: > +********* Memory Barrier support > + > +* mb <$arg> Document what $arg should be. > +Generate a target memory barrier instruction to ensure memory ordering as being > +enforced by a corresponding guest memory barrier instruction. The ordering > +enforced by the backend may be stricter than the ordering required by the guest. > +It cannot be weaker. This opcode takes an optional constant argument if required > +to generate the appropriate barrier instruction. The backend should take care to The argument is *not* optional. > +void tcg_gen_mb(TCGArg a) > +{ > + /* ??? Enable only when MTTCG is enabled. */ > + tcg_gen_op1(&tcg_ctx, INDEX_op_mb, 0); Pass A to tcg_gen_op1, not 0. > +/* TCGOpmb args */ > +#define TCG_MB_FULL ((TCGArg)(0)) > +#define TCG_MB_READ ((TCGArg)(1)) > +#define TCG_MB_WRITE ((TCGArg)(2)) > +#define TCG_MB_ACQUIRE ((TCGArg)(3)) > +#define TCG_MB_RELEASE ((TCGArg)(4)) This is, IMO, confused. Either we should use the C++11 barrier types, or the Linux barrier types, but not both. r~
Hi Richard, Thanks for the review. I will make the changes you pointed out. One point below: On Tue, May 31, 2016 at 4:24 PM, Richard Henderson <rth@twiddle.net> wrote: > On 05/31/2016 11:39 AM, Pranith Kumar wrote: >> +/* TCGOpmb args */ >> +#define TCG_MB_FULL ((TCGArg)(0)) >> +#define TCG_MB_READ ((TCGArg)(1)) >> +#define TCG_MB_WRITE ((TCGArg)(2)) >> +#define TCG_MB_ACQUIRE ((TCGArg)(3)) >> +#define TCG_MB_RELEASE ((TCGArg)(4)) > > > This is, IMO, confused. Either we should use the C++11 barrier types, or > the Linux barrier types, but not both. This part of the design is still being fleshed out. The above listing is all the kinds of barriers we can encounter during translation. May be I am missing something, but C++11 and Linux barrier types have higher level semantics which is difficult to infer at the instruction level. All we want to do here is map a barrier instruction from guest to a barrier instruction on hist. This mapping is 1:1 if the host has barrier instructions with matching semantics. Otherwise we generate a multi-op instruction sequence. Some examples are: load acquire(ldar) on ARM64 guest will map to dmb+load on ARMv7 target, store fence(sfence) on x86 guest will map to dmb on ARMv7 target. So we need to support all kinds of barrier types we expect the guest to have, then map them to the host. Thanks,
On 06/01/2016 11:43 AM, Pranith Kumar wrote: >> This is, IMO, confused. Either we should use the C++11 barrier types, or >> the Linux barrier types, but not both. > > This part of the design is still being fleshed out. The above listing > is all the kinds of barriers we can encounter during translation. May > be I am missing something, but C++11 and Linux barrier types have > higher level semantics which is difficult to infer at the instruction > level. Fair enough. > All we want to do here is map a barrier instruction from guest to a > barrier instruction on hist. This mapping is 1:1 if the host has > barrier instructions with matching semantics. Otherwise we generate a > multi-op instruction sequence. Some examples are: load acquire(ldar) > on ARM64 guest will map to dmb+load on ARMv7 target, store > fence(sfence) on x86 guest will map to dmb on ARMv7 target. Perhaps we should model this like the Sparc membar instruction, with individual bits for all combinations of Load x Store. One can then describe exactly what the target semantics are for each barrier. r~
On 02/06/16 00:35, Richard Henderson wrote: > On 06/01/2016 11:43 AM, Pranith Kumar wrote: >> All we want to do here is map a barrier instruction from guest to a >> barrier instruction on hist. This mapping is 1:1 if the host has >> barrier instructions with matching semantics. Otherwise we generate a >> multi-op instruction sequence. Some examples are: load acquire(ldar) >> on ARM64 guest will map to dmb+load on ARMv7 target, store >> fence(sfence) on x86 guest will map to dmb on ARMv7 target. > > Perhaps we should model this like the Sparc membar instruction, with > individual bits for all combinations of Load x Store. One can then > describe exactly what the target semantics are for each barrier. Seconded. Given that we don't support Alpha host we can ignore its unique "Dependent loads reordered" feature. I suppose we're not going to add Alpha host support :) Kind regards, Sergey
On 31/05/16 21:39, Pranith Kumar wrote: > diff --git a/tcg/README b/tcg/README > index f4a8ac1..cfe79d7 100644 > --- a/tcg/README > +++ b/tcg/README > @@ -402,6 +402,23 @@ double-word product T0. The later is returned in two single-word outputs. > > Similar to mulu2, except the two inputs T1 and T2 are signed. > > +********* Memory Barrier support > + > +* mb <$arg> > + > +Generate a target memory barrier instruction to ensure memory ordering as being > +enforced by a corresponding guest memory barrier instruction. The ordering > +enforced by the backend may be stricter than the ordering required by the guest. > +It cannot be weaker. This opcode takes an optional constant argument if required > +to generate the appropriate barrier instruction. The backend should take care to > +emit the target barrier instruction only when necessary i.e., for SMP guests and > +when MTTCG is enabled. > + > +The guest translators should generate this opcode for all guest instructions > +which have ordering side effects. I think we need to extend TCG load/store instruction attributes to provide information about guest ordering requirements and leave this TCG operation only for explicit barrier instruction translation. > + > +Please see docs/atomics.txt for more information on memory barriers. > + > ********* 64-bit guest on 32-bit host support > > The following opcodes are internal to TCG. Thus they are to be implemented by > diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c > index f554b86..a6f01a7 100644 > --- a/tcg/tcg-op.c > +++ b/tcg/tcg-op.c > @@ -143,6 +143,12 @@ void tcg_gen_op6(TCGContext *ctx, TCGOpcode opc, TCGArg a1, TCGArg a2, > tcg_emit_op(ctx, opc, pi); > } > > +void tcg_gen_mb(TCGArg a) > +{ > + /* ??? Enable only when MTTCG is enabled. */ > + tcg_gen_op1(&tcg_ctx, INDEX_op_mb, 0); Yes, this could be a right place to check for MTTCG enabled and number of CPUs emulated. If we do it here, then we should adjust the documentation stating that the backend should take care of it. Kind regards, Sergey
Sergey Fedorov writes: > On 31/05/16 21:39, Pranith Kumar wrote: >> diff --git a/tcg/README b/tcg/README >> index f4a8ac1..cfe79d7 100644 >> --- a/tcg/README >> +++ b/tcg/README >> @@ -402,6 +402,23 @@ double-word product T0. The later is returned in two single-word outputs. >> >> Similar to mulu2, except the two inputs T1 and T2 are signed. >> >> +********* Memory Barrier support >> + >> +* mb <$arg> >> + >> +Generate a target memory barrier instruction to ensure memory ordering as being >> +enforced by a corresponding guest memory barrier instruction. The ordering >> +enforced by the backend may be stricter than the ordering required by the guest. >> +It cannot be weaker. This opcode takes an optional constant argument if required >> +to generate the appropriate barrier instruction. The backend should take care to >> +emit the target barrier instruction only when necessary i.e., for SMP guests and >> +when MTTCG is enabled. >> + >> +The guest translators should generate this opcode for all guest instructions >> +which have ordering side effects. > > I think we need to extend TCG load/store instruction attributes to > provide information about guest ordering requirements and leave this TCG > operation only for explicit barrier instruction translation. > Yes, I am working on adding flag argument to the TCG MemOp which indicates this. >> + >> +Please see docs/atomics.txt for more information on memory barriers. >> + >> ********* 64-bit guest on 32-bit host support >> >> The following opcodes are internal to TCG. Thus they are to be implemented by >> diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c >> index f554b86..a6f01a7 100644 >> --- a/tcg/tcg-op.c >> +++ b/tcg/tcg-op.c >> @@ -143,6 +143,12 @@ void tcg_gen_op6(TCGContext *ctx, TCGOpcode opc, TCGArg a1, TCGArg a2, >> tcg_emit_op(ctx, opc, pi); >> } >> >> +void tcg_gen_mb(TCGArg a) >> +{ >> + /* ??? Enable only when MTTCG is enabled. */ >> + tcg_gen_op1(&tcg_ctx, INDEX_op_mb, 0); > > Yes, this could be a right place to check for MTTCG enabled and number > of CPUs emulated. If we do it here, then we should adjust the > documentation stating that the backend should take care of it. > I added the check in Patch 13. I will update the documentation to reflect this. Thanks,
On 06/02/2016 09:30 AM, Sergey Fedorov wrote: > I think we need to extend TCG load/store instruction attributes to > provide information about guest ordering requirements and leave this TCG > operation only for explicit barrier instruction translation. I do not agree. I think separate barriers are much cleaner and easier to manage and reason with. r~
On 06/02/2016 11:42 AM, Pranith Kumar wrote: > Yes, I am working on adding flag argument to the TCG MemOp which indicates > this. Please don't. I don't think this is the right way to go. r~
On 02/06/16 23:36, Richard Henderson wrote: > On 06/02/2016 09:30 AM, Sergey Fedorov wrote: >> I think we need to extend TCG load/store instruction attributes to >> provide information about guest ordering requirements and leave this TCG >> operation only for explicit barrier instruction translation. > > I do not agree. I think separate barriers are much cleaner and easier > to manage and reason with. > How are we going to emulate strongly-ordered guests on weakly-ordered hosts then? I think if every load/store operation must specify which ordering it implies then this task would be quite simple. Kind regards, Sergey
On 06/02/2016 01:38 PM, Sergey Fedorov wrote: > On 02/06/16 23:36, Richard Henderson wrote: >> On 06/02/2016 09:30 AM, Sergey Fedorov wrote: >>> I think we need to extend TCG load/store instruction attributes to >>> provide information about guest ordering requirements and leave this TCG >>> operation only for explicit barrier instruction translation. >> >> I do not agree. I think separate barriers are much cleaner and easier >> to manage and reason with. >> > > How are we going to emulate strongly-ordered guests on weakly-ordered > hosts then? I think if every load/store operation must specify which > ordering it implies then this task would be quite simple. Hum. That does seem helpful-ish. But I'm not certain how helpful it is to complicate the helper functions even further. What if we have tcg_canonicalize_memop (or some such) split off the barriers into separate opcodes. E.g. MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op MO_BAR_LD_A = 128 // prevent later loads from crossing current op MO_BAR_ST_A = 256 // prevent later stores from crossing current op MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A // Match Sparc MEMBAR as the most flexible host. TCG_BAR_LD_LD = 1 // #LoadLoad barrier TCG_BAR_ST_LD = 2 // #StoreLoad barrier TCG_BAR_LD_ST = 4 // #LoadStore barrier TCG_BAR_ST_ST = 8 // #StoreStore barrier TCG_BAR_SYNC = 64 // SEQ_CST barrier where tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER) emits mb TCG_BAR_LD_LD qemu_ld_i32 x, y, i, m mb TCG_BAR_LD_ST We can then add an optimization pass which folds barriers with no memory operations in between, so that duplicates are eliminated. r~
On 03/06/16 00:18, Richard Henderson wrote: > On 06/02/2016 01:38 PM, Sergey Fedorov wrote: >> On 02/06/16 23:36, Richard Henderson wrote: >>> On 06/02/2016 09:30 AM, Sergey Fedorov wrote: >>>> I think we need to extend TCG load/store instruction attributes to >>>> provide information about guest ordering requirements and leave >>>> this TCG >>>> operation only for explicit barrier instruction translation. >>> >>> I do not agree. I think separate barriers are much cleaner and easier >>> to manage and reason with. >>> >> >> How are we going to emulate strongly-ordered guests on weakly-ordered >> hosts then? I think if every load/store operation must specify which >> ordering it implies then this task would be quite simple. > > Hum. That does seem helpful-ish. But I'm not certain how helpful it > is to complicate the helper functions even further. > > What if we have tcg_canonicalize_memop (or some such) split off the > barriers into separate opcodes. E.g. > > MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op > MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op > MO_BAR_LD_A = 128 // prevent later loads from crossing current op > MO_BAR_ST_A = 256 // prevent later stores from crossing current op > MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B > MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A > MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A > > // Match Sparc MEMBAR as the most flexible host. > TCG_BAR_LD_LD = 1 // #LoadLoad barrier > TCG_BAR_ST_LD = 2 // #StoreLoad barrier > TCG_BAR_LD_ST = 4 // #LoadStore barrier > TCG_BAR_ST_ST = 8 // #StoreStore barrier > TCG_BAR_SYNC = 64 // SEQ_CST barrier > > where > > tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER) > > emits > > mb TCG_BAR_LD_LD > qemu_ld_i32 x, y, i, m > mb TCG_BAR_LD_ST > > We can then add an optimization pass which folds barriers with no > memory operations in between, so that duplicates are eliminated. It would give us three TCG operations for each memory operation instead of one. But then we might like to combine these barrier operations back with memory operations in each backend. If we propagate memory ordering semantics up to the backend, it can decide itself what instructions are best to generate. So I would just focus on translating only explicit memory barrier operations for now. Kind regards, Sergey
On 06/02/2016 02:37 PM, Sergey Fedorov wrote: > On 03/06/16 00:18, Richard Henderson wrote: >> On 06/02/2016 01:38 PM, Sergey Fedorov wrote: >>> On 02/06/16 23:36, Richard Henderson wrote: >>>> On 06/02/2016 09:30 AM, Sergey Fedorov wrote: >>>>> I think we need to extend TCG load/store instruction attributes to >>>>> provide information about guest ordering requirements and leave >>>>> this TCG >>>>> operation only for explicit barrier instruction translation. >>>> >>>> I do not agree. I think separate barriers are much cleaner and easier >>>> to manage and reason with. >>>> >>> >>> How are we going to emulate strongly-ordered guests on weakly-ordered >>> hosts then? I think if every load/store operation must specify which >>> ordering it implies then this task would be quite simple. >> >> Hum. That does seem helpful-ish. But I'm not certain how helpful it >> is to complicate the helper functions even further. >> >> What if we have tcg_canonicalize_memop (or some such) split off the >> barriers into separate opcodes. E.g. >> >> MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >> MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >> MO_BAR_LD_A = 128 // prevent later loads from crossing current op >> MO_BAR_ST_A = 256 // prevent later stores from crossing current op >> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >> >> // Match Sparc MEMBAR as the most flexible host. >> TCG_BAR_LD_LD = 1 // #LoadLoad barrier >> TCG_BAR_ST_LD = 2 // #StoreLoad barrier >> TCG_BAR_LD_ST = 4 // #LoadStore barrier >> TCG_BAR_ST_ST = 8 // #StoreStore barrier >> TCG_BAR_SYNC = 64 // SEQ_CST barrier >> >> where >> >> tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER) >> >> emits >> >> mb TCG_BAR_LD_LD >> qemu_ld_i32 x, y, i, m >> mb TCG_BAR_LD_ST >> >> We can then add an optimization pass which folds barriers with no >> memory operations in between, so that duplicates are eliminated. > > It would give us three TCG operations for each memory operation instead > of one. But then we might like to combine these barrier operations back > with memory operations in each backend. If we propagate memory ordering > semantics up to the backend, it can decide itself what instructions are > best to generate. A strongly ordered target would generally only set BEFORE bits or AFTER bits, but not both (and I suggest we canonicalize on AFTER for all such targets). Thus a strongly ordered target would produce only 2 opcodes per memory op. I supplied both to make it easier to handle a weakly ordered target with acquire/release bits. I would *not* combine the barrier operations back with memory operations in the backend. Only armv8 and ia64 can do that, and given the optimization level at which we generate code, I doubt it would really make much difference above separate barriers. > So I would just focus on translating only explicit memory barrier > operations for now. Then why did you bring it up? r~
On 03/06/16 04:08, Richard Henderson wrote: > On 06/02/2016 02:37 PM, Sergey Fedorov wrote: >> On 03/06/16 00:18, Richard Henderson wrote: >>> On 06/02/2016 01:38 PM, Sergey Fedorov wrote: >>>> On 02/06/16 23:36, Richard Henderson wrote: >>>>> On 06/02/2016 09:30 AM, Sergey Fedorov wrote: >>>>>> I think we need to extend TCG load/store instruction attributes to >>>>>> provide information about guest ordering requirements and leave >>>>>> this TCG >>>>>> operation only for explicit barrier instruction translation. >>>>> >>>>> I do not agree. I think separate barriers are much cleaner and >>>>> easier >>>>> to manage and reason with. >>>>> >>>> >>>> How are we going to emulate strongly-ordered guests on weakly-ordered >>>> hosts then? I think if every load/store operation must specify which >>>> ordering it implies then this task would be quite simple. >>> >>> Hum. That does seem helpful-ish. But I'm not certain how helpful it >>> is to complicate the helper functions even further. >>> >>> What if we have tcg_canonicalize_memop (or some such) split off the >>> barriers into separate opcodes. E.g. >>> >>> MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >>> MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >>> MO_BAR_LD_A = 128 // prevent later loads from crossing current op >>> MO_BAR_ST_A = 256 // prevent later stores from crossing current op >>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >>> >>> // Match Sparc MEMBAR as the most flexible host. >>> TCG_BAR_LD_LD = 1 // #LoadLoad barrier >>> TCG_BAR_ST_LD = 2 // #StoreLoad barrier >>> TCG_BAR_LD_ST = 4 // #LoadStore barrier >>> TCG_BAR_ST_ST = 8 // #StoreStore barrier >>> TCG_BAR_SYNC = 64 // SEQ_CST barrier >>> >>> where >>> >>> tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER) >>> >>> emits >>> >>> mb TCG_BAR_LD_LD >>> qemu_ld_i32 x, y, i, m >>> mb TCG_BAR_LD_ST >>> >>> We can then add an optimization pass which folds barriers with no >>> memory operations in between, so that duplicates are eliminated. >> >> It would give us three TCG operations for each memory operation instead >> of one. But then we might like to combine these barrier operations back >> with memory operations in each backend. If we propagate memory ordering >> semantics up to the backend, it can decide itself what instructions are >> best to generate. > > A strongly ordered target would generally only set BEFORE bits or > AFTER bits, but not both (and I suggest we canonicalize on AFTER for > all such targets). Thus a strongly ordered target would produce only 2 > opcodes per memory op. > > I supplied both to make it easier to handle a weakly ordered target > with acquire/release bits. > > I would *not* combine the barrier operations back with memory > operations in the backend. Only armv8 and ia64 can do that, and given > the optimization level at which we generate code, I doubt it would > really make much difference above separate barriers. So your suggestion is to generate different TCG opcode sequences depending on the underlying target architecture? And you are against forwarding this task further, to the backend code? > >> So I would just focus on translating only explicit memory barrier >> operations for now. > > Then why did you bring it up? I'm not sure I got the question right. I suggested to avoid using this TCG operation to emulate guest's memory ordering requirements for loads/stores that can be supplied with memory ordering requirement information which each backend can decide how to translate together with the load/store (possible just ignore it as it is the case for strongly-ordered hosts). I think we just need to translate explicit memory barrier instructions. For example, emulating ARM guest on x86 host requires ARM dmb instruction to be translated to x86 mfence instruction to prevent store-after-load reordering. At the same time, we don't have to generate anything special for loads/stores since x86 is a strongly-ordered architecture. Kind regards, Sergey
On 06/03/2016 08:16 AM, Sergey Fedorov wrote: > On 03/06/16 04:08, Richard Henderson wrote: > So your suggestion is to generate different TCG opcode sequences > depending on the underlying target architecture? And you are against > forwarding this task further, to the backend code? Yes, I would prefer to have, in the opcode stream, a separate opcode for barriers. This aids both the common case, where most of our hosts require separate barriers, as well as simplicity. I am not opposed to letting the translators describe the memory model with barrier data along with memory operations, but I'd really prefer that those be split apart during initial opcode generation. >>> So I would just focus on translating only explicit memory barrier >>> operations for now. >> >> Then why did you bring it up? > > I'm not sure I got the question right. I suggested to avoid using this > TCG operation to emulate guest's memory ordering requirements for > loads/stores that can be supplied with memory ordering requirement > information which each backend can decide how to translate together with > the load/store (possible just ignore it as it is the case for > strongly-ordered hosts). I think we just need to translate explicit > memory barrier instructions. > > For example, emulating ARM guest on x86 host requires ARM dmb > instruction to be translated to x86 mfence instruction to prevent > store-after-load reordering. At the same time, we don't have to generate > anything special for loads/stores since x86 is a strongly-ordered > architecture. Ah, so you'd prefer that we not think about optimizing barriers at the moment. Fine, but I'd prefer to think about *how* they might be optimized now, so that we *can* later. r~
On 03/06/16 18:45, Richard Henderson wrote: > On 06/03/2016 08:16 AM, Sergey Fedorov wrote: >> On 03/06/16 04:08, Richard Henderson wrote: >> So your suggestion is to generate different TCG opcode sequences >> depending on the underlying target architecture? And you are against >> forwarding this task further, to the backend code? > > Yes, I would prefer to have, in the opcode stream, a separate opcode > for barriers. This aids both the common case, where most of our hosts > require separate barriers, as well as simplicity. > > I am not opposed to letting the translators describe the memory model > with barrier data along with memory operations, but I'd really prefer > that those be split apart during initial opcode generation. > >>>> So I would just focus on translating only explicit memory barrier >>>> operations for now. >>> >>> Then why did you bring it up? >> >> I'm not sure I got the question right. I suggested to avoid using this >> TCG operation to emulate guest's memory ordering requirements for >> loads/stores that can be supplied with memory ordering requirement >> information which each backend can decide how to translate together with >> the load/store (possible just ignore it as it is the case for >> strongly-ordered hosts). I think we just need to translate explicit >> memory barrier instructions. >> >> For example, emulating ARM guest on x86 host requires ARM dmb >> instruction to be translated to x86 mfence instruction to prevent >> store-after-load reordering. At the same time, we don't have to generate >> anything special for loads/stores since x86 is a strongly-ordered >> architecture. > > Ah, so you'd prefer that we not think about optimizing barriers at the > moment. Fine, but I'd prefer to think about *how* they might be > optimized now, so that we *can* later. Not exactly. We need to have a TCG operation for various types of explicit barriers in order to translate guest explicit barrier instructions. I like your idea to follow Sparc's way to specify membar instruction attributes which can be used by the backed for generating optimal instructions. I think we also need to associate memory ordering attributes with load/store TCG operations. I'm not sure how would be best to handle load/store implicit memory ordering requirements, but it is probably out of the scope of this series. I'd propagate this attribute up to the backend and let it decide what kind of instructions to generate. I'd prefer to see only explicit barrier operations in the TCG opcode stream. Kind regards, Sergey
On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote: > > Hum. That does seem helpful-ish. But I'm not certain how helpful it is to > complicate the helper functions even further. > > What if we have tcg_canonicalize_memop (or some such) split off the barriers > into separate opcodes. E.g. > > MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op > MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op > MO_BAR_LD_A = 128 // prevent later loads from crossing current op > MO_BAR_ST_A = 256 // prevent later stores from crossing current op > MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B > MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A > MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A > > // Match Sparc MEMBAR as the most flexible host. > TCG_BAR_LD_LD = 1 // #LoadLoad barrier > TCG_BAR_ST_LD = 2 // #StoreLoad barrier > TCG_BAR_LD_ST = 4 // #LoadStore barrier > TCG_BAR_ST_ST = 8 // #StoreStore barrier > TCG_BAR_SYNC = 64 // SEQ_CST barrier I really like this format. I would also like to add to the frontend: MO_BAR_ACQUIRE MO_BAR_RELEASE and the following to the backend: TCG_BAR_ACQUIRE TCG_BAR_RELEASE since these are one-way barriers and the previous barrier types do not cover them. > > where > > tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER) > > emits > > mb TCG_BAR_LD_LD > qemu_ld_i32 x, y, i, m > mb TCG_BAR_LD_ST > > We can then add an optimization pass which folds barriers with no memory > operations in between, so that duplicates are eliminated. > Yes, folding/eliding these barriers in an optimization pass sounds like a good idea. Thanks,
On Thu, Jun 2, 2016 at 9:08 PM, Richard Henderson <rth@twiddle.net> wrote: > On 06/02/2016 02:37 PM, Sergey Fedorov wrote: >> >> >> It would give us three TCG operations for each memory operation instead >> of one. But then we might like to combine these barrier operations back >> with memory operations in each backend. If we propagate memory ordering >> semantics up to the backend, it can decide itself what instructions are >> best to generate. > > > A strongly ordered target would generally only set BEFORE bits or AFTER > bits, but not both (and I suggest we canonicalize on AFTER for all such > targets). Thus a strongly ordered target would produce only 2 opcodes per > memory op. > > I supplied both to make it easier to handle a weakly ordered target with > acquire/release bits. > > I would *not* combine the barrier operations back with memory operations in > the backend. Only armv8 and ia64 can do that, and given the optimization > level at which we generate code, I doubt it would really make much > difference above separate barriers. > On armv8, using load_acquire/store_release instructions makes a significant difference in performance when compared to plain dmb+memory instruction sequence. So I would really like to keep the option of generating acq/rel instructions(by combining barrier and memory or some other way) open. Thanks,
On 03/06/16 21:30, Pranith Kumar wrote: > On Thu, Jun 2, 2016 at 9:08 PM, Richard Henderson <rth@twiddle.net> wrote: >> On 06/02/2016 02:37 PM, Sergey Fedorov wrote: >>> >>> It would give us three TCG operations for each memory operation instead >>> of one. But then we might like to combine these barrier operations back >>> with memory operations in each backend. If we propagate memory ordering >>> semantics up to the backend, it can decide itself what instructions are >>> best to generate. >> >> A strongly ordered target would generally only set BEFORE bits or AFTER >> bits, but not both (and I suggest we canonicalize on AFTER for all such >> targets). Thus a strongly ordered target would produce only 2 opcodes per >> memory op. >> >> I supplied both to make it easier to handle a weakly ordered target with >> acquire/release bits. >> >> I would *not* combine the barrier operations back with memory operations in >> the backend. Only armv8 and ia64 can do that, and given the optimization >> level at which we generate code, I doubt it would really make much >> difference above separate barriers. >> > On armv8, using load_acquire/store_release instructions makes a > significant difference in performance when compared to plain > dmb+memory instruction sequence. So I would really like to keep the > option of generating acq/rel instructions(by combining barrier and > memory or some other way) open. I'm not so sure about acq/rel flags. Is there any architecture which has explicit acq/rel barriers? I suppose acq/rel memory access instructions are always load-link and store-conditional and thus rely on exclusive memory monitor to support that "conditional" behaviour. To emulate this behaviour we need something more special like "Slow-path for atomic instruction translation" [1]. [1] http://thread.gmane.org/gmane.comp.emulators.qemu/407419 Kind regards, Sergey
On 03/06/16 21:27, Pranith Kumar wrote: > On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote: >> > >> > Hum. That does seem helpful-ish. But I'm not certain how helpful it is to >> > complicate the helper functions even further. >> > >> > What if we have tcg_canonicalize_memop (or some such) split off the barriers >> > into separate opcodes. E.g. >> > >> > MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >> > MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >> > MO_BAR_LD_A = 128 // prevent later loads from crossing current op >> > MO_BAR_ST_A = 256 // prevent later stores from crossing current op >> > MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >> > MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >> > MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >> > >> > // Match Sparc MEMBAR as the most flexible host. >> > TCG_BAR_LD_LD = 1 // #LoadLoad barrier >> > TCG_BAR_ST_LD = 2 // #StoreLoad barrier >> > TCG_BAR_LD_ST = 4 // #LoadStore barrier >> > TCG_BAR_ST_ST = 8 // #StoreStore barrier >> > TCG_BAR_SYNC = 64 // SEQ_CST barrier > I really like this format. I would also like to add to the frontend: > > MO_BAR_ACQUIRE > MO_BAR_RELEASE > > and the following to the backend: > > TCG_BAR_ACQUIRE > TCG_BAR_RELEASE > > since these are one-way barriers and the previous barrier types do not > cover them. > Well, my last comment about acq/rel barriers should really go here [Message-Id: 5751DF2D.5040709@gmail.com] Regards, Sergey
On 3 June 2016 at 20:49, Sergey Fedorov <serge.fdrv@gmail.com> wrote: > I'm not so sure about acq/rel flags. Is there any architecture which has > explicit acq/rel barriers? I suppose acq/rel memory access instructions > are always load-link and store-conditional and thus rely on exclusive > memory monitor to support that "conditional" behaviour. This doesn't sound right (at least not for ARM). You can have load-acquire and store-release insns which aren't exclusives, you can have exclusives which aren't acq/rel, and you can have accesses which are both exclusives and acq/rel (and just to complete the set, obviously there are accesses which are neither). The exclusive semantics require the monitor, but acq/rel is just an ordering constraint (sort of like an implicit barrier, but not quite). thanks -- PMM
On 03/06/16 23:43, Peter Maydell wrote: > On 3 June 2016 at 20:49, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >> I'm not so sure about acq/rel flags. Is there any architecture which has >> explicit acq/rel barriers? I suppose acq/rel memory access instructions >> are always load-link and store-conditional and thus rely on exclusive >> memory monitor to support that "conditional" behaviour. > This doesn't sound right (at least not for ARM). You can have > load-acquire and store-release insns which aren't exclusives, > you can have exclusives which aren't acq/rel, and you can > have accesses which are both exclusives and acq/rel (and > just to complete the set, obviously there are accesses which > are neither). The exclusive semantics require the monitor, but > acq/rel is just an ordering constraint (sort of like an implicit > barrier, but not quite). Thanks for clarifying this, Peter. Regards, Sergey
On 03/06/16 21:27, Pranith Kumar wrote: > On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote: >> Hum. That does seem helpful-ish. But I'm not certain how helpful it is to >> complicate the helper functions even further. >> >> What if we have tcg_canonicalize_memop (or some such) split off the barriers >> into separate opcodes. E.g. >> >> MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >> MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >> MO_BAR_LD_A = 128 // prevent later loads from crossing current op >> MO_BAR_ST_A = 256 // prevent later stores from crossing current op >> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >> >> // Match Sparc MEMBAR as the most flexible host. >> TCG_BAR_LD_LD = 1 // #LoadLoad barrier >> TCG_BAR_ST_LD = 2 // #StoreLoad barrier >> TCG_BAR_LD_ST = 4 // #LoadStore barrier >> TCG_BAR_ST_ST = 8 // #StoreStore barrier >> TCG_BAR_SYNC = 64 // SEQ_CST barrier > I really like this format. I would also like to add to the frontend: > > MO_BAR_ACQUIRE > MO_BAR_RELEASE > > and the following to the backend: > > TCG_BAR_ACQUIRE > TCG_BAR_RELEASE > > since these are one-way barriers and the previous barrier types do not > cover them. Actually, the acquire barrier is a combined load-load + load-store barrier; and the release barrier is a combo of load-store + store-store barriers. Kind regards, Sergey > >> where >> >> tcg_gen_qemu_ld_i32(x, y, i, m | MO_BAR_LD_BEFORE | MO_BAR_ST_AFTER) >> >> emits >> >> mb TCG_BAR_LD_LD >> qemu_ld_i32 x, y, i, m >> mb TCG_BAR_LD_ST >> >> We can then add an optimization pass which folds barriers with no memory >> operations in between, so that duplicates are eliminated. >> > Yes, folding/eliding these barriers in an optimization pass sounds > like a good idea. > > Thanks,
On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: > On 03/06/16 21:27, Pranith Kumar wrote: >> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote: >>> >>> What if we have tcg_canonicalize_memop (or some such) split off the barriers >>> into separate opcodes. E.g. >>> >>> MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >>> MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >>> MO_BAR_LD_A = 128 // prevent later loads from crossing current op >>> MO_BAR_ST_A = 256 // prevent later stores from crossing current op >>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >>> >>> // Match Sparc MEMBAR as the most flexible host. >>> TCG_BAR_LD_LD = 1 // #LoadLoad barrier >>> TCG_BAR_ST_LD = 2 // #StoreLoad barrier >>> TCG_BAR_LD_ST = 4 // #LoadStore barrier >>> TCG_BAR_ST_ST = 8 // #StoreStore barrier >>> TCG_BAR_SYNC = 64 // SEQ_CST barrier >> I really like this format. I would also like to add to the frontend: >> > > Actually, the acquire barrier is a combined load-load + load-store > barrier; and the release barrier is a combo of load-store + store-store > barriers. > All the above are two-way barriers. Acquire/Release are one-way barriers. So we cannot combine the above to get acquire/release semantics without being conservative. Thanks,
On 06/06/16 18:47, Pranith Kumar wrote: > On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >> On 03/06/16 21:27, Pranith Kumar wrote: >>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote: >>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers >>>> into separate opcodes. E.g. >>>> >>>> MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >>>> MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >>>> MO_BAR_LD_A = 128 // prevent later loads from crossing current op >>>> MO_BAR_ST_A = 256 // prevent later stores from crossing current op >>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >>>> >>>> // Match Sparc MEMBAR as the most flexible host. >>>> TCG_BAR_LD_LD = 1 // #LoadLoad barrier >>>> TCG_BAR_ST_LD = 2 // #StoreLoad barrier >>>> TCG_BAR_LD_ST = 4 // #LoadStore barrier >>>> TCG_BAR_ST_ST = 8 // #StoreStore barrier >>>> TCG_BAR_SYNC = 64 // SEQ_CST barrier >>> I really like this format. I would also like to add to the frontend: >>> >> Actually, the acquire barrier is a combined load-load + load-store >> barrier; and the release barrier is a combo of load-store + store-store >> barriers. >> > All the above are two-way barriers. Acquire/Release are one-way > barriers. So we cannot combine the above to get acquire/release > semantics without being conservative. Do you mean *barriers* or *memory access* operations implying memory ordering? Regards, Sergey
On Mon, Jun 6, 2016 at 11:49 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: > On 06/06/16 18:47, Pranith Kumar wrote: >> On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >>> On 03/06/16 21:27, Pranith Kumar wrote: >>>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote: >>>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers >>>>> into separate opcodes. E.g. >>>>> >>>>> MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >>>>> MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >>>>> MO_BAR_LD_A = 128 // prevent later loads from crossing current op >>>>> MO_BAR_ST_A = 256 // prevent later stores from crossing current op >>>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >>>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >>>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >>>>> >>>>> // Match Sparc MEMBAR as the most flexible host. >>>>> TCG_BAR_LD_LD = 1 // #LoadLoad barrier >>>>> TCG_BAR_ST_LD = 2 // #StoreLoad barrier >>>>> TCG_BAR_LD_ST = 4 // #LoadStore barrier >>>>> TCG_BAR_ST_ST = 8 // #StoreStore barrier >>>>> TCG_BAR_SYNC = 64 // SEQ_CST barrier >>>> I really like this format. I would also like to add to the frontend: >>>> >>> Actually, the acquire barrier is a combined load-load + load-store >>> barrier; and the release barrier is a combo of load-store + store-store >>> barriers. >>> >> All the above are two-way barriers. Acquire/Release are one-way >> barriers. So we cannot combine the above to get acquire/release >> semantics without being conservative. > > Do you mean *barriers* or *memory access* operations implying memory > ordering? I meant the latter. I know no arch which has acquire/release barriers. Sorry for the confusion. Thanks,
On 06/06/16 18:58, Pranith Kumar wrote: > On Mon, Jun 6, 2016 at 11:49 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >> On 06/06/16 18:47, Pranith Kumar wrote: >>> On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >>>> On 03/06/16 21:27, Pranith Kumar wrote: >>>>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote: >>>>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers >>>>>> into separate opcodes. E.g. >>>>>> >>>>>> MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >>>>>> MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >>>>>> MO_BAR_LD_A = 128 // prevent later loads from crossing current op >>>>>> MO_BAR_ST_A = 256 // prevent later stores from crossing current op >>>>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >>>>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >>>>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >>>>>> >>>>>> // Match Sparc MEMBAR as the most flexible host. >>>>>> TCG_BAR_LD_LD = 1 // #LoadLoad barrier >>>>>> TCG_BAR_ST_LD = 2 // #StoreLoad barrier >>>>>> TCG_BAR_LD_ST = 4 // #LoadStore barrier >>>>>> TCG_BAR_ST_ST = 8 // #StoreStore barrier >>>>>> TCG_BAR_SYNC = 64 // SEQ_CST barrier >>>>> I really like this format. I would also like to add to the frontend: >>>>> >>>> Actually, the acquire barrier is a combined load-load + load-store >>>> barrier; and the release barrier is a combo of load-store + store-store >>>> barriers. >>>> >>> All the above are two-way barriers. Acquire/Release are one-way >>> barriers. So we cannot combine the above to get acquire/release >>> semantics without being conservative. >> Do you mean *barriers* or *memory access* operations implying memory >> ordering? > I meant the latter. I know no arch which has acquire/release barriers. > Sorry for the confusion. So I am. By the way, what's the difference between sequential consistent *barrier* and a combination of all the other barriers? Kind regards, Sergey
Sergey Fedorov <serge.fdrv@gmail.com> writes: > On 03/06/16 21:30, Pranith Kumar wrote: >> On Thu, Jun 2, 2016 at 9:08 PM, Richard Henderson <rth@twiddle.net> wrote: >>> On 06/02/2016 02:37 PM, Sergey Fedorov wrote: >>>> >>>> It would give us three TCG operations for each memory operation instead >>>> of one. But then we might like to combine these barrier operations back >>>> with memory operations in each backend. If we propagate memory ordering >>>> semantics up to the backend, it can decide itself what instructions are >>>> best to generate. >>> >>> A strongly ordered target would generally only set BEFORE bits or AFTER >>> bits, but not both (and I suggest we canonicalize on AFTER for all such >>> targets). Thus a strongly ordered target would produce only 2 opcodes per >>> memory op. >>> >>> I supplied both to make it easier to handle a weakly ordered target with >>> acquire/release bits. >>> >>> I would *not* combine the barrier operations back with memory operations in >>> the backend. Only armv8 and ia64 can do that, and given the optimization >>> level at which we generate code, I doubt it would really make much >>> difference above separate barriers. >>> >> On armv8, using load_acquire/store_release instructions makes a >> significant difference in performance when compared to plain >> dmb+memory instruction sequence. So I would really like to keep the >> option of generating acq/rel instructions(by combining barrier and >> memory or some other way) open. > > I'm not so sure about acq/rel flags. Is there any architecture which has > explicit acq/rel barriers? I suppose acq/rel memory access instructions > are always load-link and store-conditional and thus rely on exclusive > memory monitor to support that "conditional" behaviour. Nope, you can have acq/rel memory operations without exclusive operations (see ARMv8 ldar and stlr). The exclusive operations also have ordered and non-ordered variants (ldxr, strx). > To emulate this > behaviour we need something more special like "Slow-path for atomic > instruction translation" [1]. > > [1] http://thread.gmane.org/gmane.comp.emulators.qemu/407419 > > Kind regards, > Sergey -- Alex Bennée
On Mon, Jun 6, 2016 at 12:14 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: > On 06/06/16 18:58, Pranith Kumar wrote: >> On Mon, Jun 6, 2016 at 11:49 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >>> On 06/06/16 18:47, Pranith Kumar wrote: >>>> On Mon, Jun 6, 2016 at 11:44 AM, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >>>>> On 03/06/16 21:27, Pranith Kumar wrote: >>>>>> On Thu, Jun 2, 2016 at 5:18 PM, Richard Henderson <rth@twiddle.net> wrote: >>>>>>> What if we have tcg_canonicalize_memop (or some such) split off the barriers >>>>>>> into separate opcodes. E.g. >>>>>>> >>>>>>> MO_BAR_LD_B = 32 // prevent earlier loads from crossing current op >>>>>>> MO_BAR_ST_B = 64 // prevent earlier stores from crossing current op >>>>>>> MO_BAR_LD_A = 128 // prevent later loads from crossing current op >>>>>>> MO_BAR_ST_A = 256 // prevent later stores from crossing current op >>>>>>> MO_BAR_LDST_B = MO_BAR_LD_B | MO_BAR_ST_B >>>>>>> MO_BAR_LDST_A = MO_BAR_LD_A | MO_BAR_ST_A >>>>>>> MO_BAR_MASK = MO_BAR_LDST_B | MO_BAR_LDST_A >>>>>>> >>>>>>> // Match Sparc MEMBAR as the most flexible host. >>>>>>> TCG_BAR_LD_LD = 1 // #LoadLoad barrier >>>>>>> TCG_BAR_ST_LD = 2 // #StoreLoad barrier >>>>>>> TCG_BAR_LD_ST = 4 // #LoadStore barrier >>>>>>> TCG_BAR_ST_ST = 8 // #StoreStore barrier >>>>>>> TCG_BAR_SYNC = 64 // SEQ_CST barrier >>>>>> I really like this format. I would also like to add to the frontend: >>>>>> >>>>> Actually, the acquire barrier is a combined load-load + load-store >>>>> barrier; and the release barrier is a combo of load-store + store-store >>>>> barriers. >>>>> >>>> All the above are two-way barriers. Acquire/Release are one-way >>>> barriers. So we cannot combine the above to get acquire/release >>>> semantics without being conservative. >>> Do you mean *barriers* or *memory access* operations implying memory >>> ordering? >> I meant the latter. I know no arch which has acquire/release barriers. >> Sorry for the confusion. > > So I am. By the way, what's the difference between sequential consistent > *barrier* and a combination of all the other barriers? If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the other four barriers. I am not sure if we can just construct SYNC like this or if we need to define it explicitly though.
On 06/06/2016 10:11 AM, Pranith Kumar wrote: > If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the > other four barriers. I am not sure if we can just construct SYNC like > this or if we need to define it explicitly though. AFAICS, sparc membar #sync is stronger. Compare PowerPC hwsync and lwsync. I would define lwsync as the OR of the other 4 barriers, but hwsync as TCG_BAR_SYNC. r~
On Mon, Jun 6, 2016 at 3:23 PM, Richard Henderson <rth@twiddle.net> wrote: > On 06/06/2016 10:11 AM, Pranith Kumar wrote: >> >> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the >> other four barriers. I am not sure if we can just construct SYNC like >> this or if we need to define it explicitly though. > > > AFAICS, sparc membar #sync is stronger. I tried looking it up but it's not clear. How is it stronger? And do we need those strong guarantees in our front-end/back-end? Thanks,
On 06/06/16 22:28, Pranith Kumar wrote: > On Mon, Jun 6, 2016 at 3:23 PM, Richard Henderson <rth@twiddle.net> wrote: >> On 06/06/2016 10:11 AM, Pranith Kumar wrote: >>> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the >>> other four barriers. I am not sure if we can just construct SYNC like >>> this or if we need to define it explicitly though. >> >> AFAICS, sparc membar #sync is stronger. > I tried looking it up but it's not clear. How is it stronger? And do > we need those strong guarantees in our front-end/back-end? That is not clear for me either :( AFAIU, PPC's lwsync does allow stores to be reordered after loads but hwsync - not. I suspect Sparc's membar #Sync is used to ensure that some system operations are complete before proceeding with execution. I'm not sure we need to introduce this into TCG. It needs to be clear what is it and how to use it. Kind regards, Sergey
On 6 June 2016 at 21:30, Sergey Fedorov <serge.fdrv@gmail.com> wrote: > On 06/06/16 22:28, Pranith Kumar wrote: >> On Mon, Jun 6, 2016 at 3:23 PM, Richard Henderson <rth@twiddle.net> wrote: >>> On 06/06/2016 10:11 AM, Pranith Kumar wrote: >>>> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the >>>> other four barriers. I am not sure if we can just construct SYNC like >>>> this or if we need to define it explicitly though. >>> >>> AFAICS, sparc membar #sync is stronger. >> I tried looking it up but it's not clear. How is it stronger? And do >> we need those strong guarantees in our front-end/back-end? > > That is not clear for me either :( AFAIU, PPC's lwsync does allow stores > to be reordered after loads but hwsync - not. Yes, from the PoV of the other CPU. That is, for write-then-read by CPU 0, CPU 0 will always read what it wrote, but other CPUs don't necessarily see the write before the read is satisfied. https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf describes the difference in sections 3.2 and 3.3, and has an example in section 6 of a situation which requires a full (hw)sync and an lwsync is not sufficient. > I suspect Sparc's membar > #Sync is used to ensure that some system operations are complete before > proceeding with execution. I'm not sure we need to introduce this into > TCG. It needs to be clear what is it and how to use it. My reading of the manual is that the SPARC "membar #Sync" is like ARM ISB -- it enforces that any instruction (whether a memory access or not) before it must finish before anything after it can start. It only affects the CPU that issues it (assuming you didn't also specify any of the bits requesting memory barriers!) Since TCG doesn't attempt to reorder instructions, we likely don't need to do anything except maybe end the current TB. Also if we're still trying to do TLB operations on other CPUs asynchronously we need to wait for them to finish; I forget what the conclusion was on that idea. PPC equivalent insn is isync I think. thanks -- PMM
On 07/06/16 00:00, Peter Maydell wrote: > On 6 June 2016 at 21:30, Sergey Fedorov <serge.fdrv@gmail.com> wrote: >> On 06/06/16 22:28, Pranith Kumar wrote: >>> On Mon, Jun 6, 2016 at 3:23 PM, Richard Henderson <rth@twiddle.net> wrote: >>>> On 06/06/2016 10:11 AM, Pranith Kumar wrote: >>>>> If I read it correctly TCG_BAR_SYNC is equivalent to OR of all the >>>>> other four barriers. I am not sure if we can just construct SYNC like >>>>> this or if we need to define it explicitly though. >>>> AFAICS, sparc membar #sync is stronger. >>> I tried looking it up but it's not clear. How is it stronger? And do >>> we need those strong guarantees in our front-end/back-end? >> That is not clear for me either :( AFAIU, PPC's lwsync does allow stores >> to be reordered after loads but hwsync - not. > Yes, from the PoV of the other CPU. That is, for write-then-read by > CPU 0, CPU 0 will always read what it wrote, but other CPUs don't > necessarily see the write before the read is satisfied. > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf > describes the difference in sections 3.2 and 3.3, and has an > example in section 6 of a situation which requires a full > (hw)sync and an lwsync is not sufficient. > >> I suspect Sparc's membar >> #Sync is used to ensure that some system operations are complete before >> proceeding with execution. I'm not sure we need to introduce this into >> TCG. It needs to be clear what is it and how to use it. > My reading of the manual is that the SPARC "membar #Sync" is like ARM > ISB -- it enforces that any instruction (whether a memory access or not) > before it must finish before anything after it can start. It only > affects the CPU that issues it (assuming you didn't also specify > any of the bits requesting memory barriers!) Since TCG doesn't attempt > to reorder instructions, we likely don't need to do anything except > maybe end the current TB. Also if we're still trying to do TLB > operations on other CPUs asynchronously we need to wait for them to > finish; I forget what the conclusion was on that idea. > PPC equivalent insn is isync I think. > Thanks for commenting this, Peter. AFAIU, a sequential consistency barrier is stronger than a release-aquire barrier because it provides "transitivity/commutativity" [1]. This is what general barrier guarantees in Linux [2]. I especially like this piece of description from [2]: ... if this example runs on a system where CPUs 1 and 2 share a store buffer or a level of cache, CPU 2 might have early access to CPU 1's writes. General barriers are therefore required to ensure that all CPUs agree on the combined order of CPU 1's and CPU 2's accesses. Current Linux kernel implements Sparc's smp_mb()/__smp_mb()/mb() with "membar #StoreLoad" [3]. So we'll probably be fine with just RR, RW, WR, and WW bits in the TCG memory barrier operation attribute. Kind regards, Sergey [1] https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync#Overall_Summary [2] http://lxr.free-electrons.com/source/Documentation/memory-barriers.txt#L1268 [3] http://lxr.free-electrons.com/source/arch/sparc/include/asm/barrier_64.h#L36
diff --git a/tcg/README b/tcg/README index f4a8ac1..cfe79d7 100644 --- a/tcg/README +++ b/tcg/README @@ -402,6 +402,23 @@ double-word product T0. The later is returned in two single-word outputs. Similar to mulu2, except the two inputs T1 and T2 are signed. +********* Memory Barrier support + +* mb <$arg> + +Generate a target memory barrier instruction to ensure memory ordering as being +enforced by a corresponding guest memory barrier instruction. The ordering +enforced by the backend may be stricter than the ordering required by the guest. +It cannot be weaker. This opcode takes an optional constant argument if required +to generate the appropriate barrier instruction. The backend should take care to +emit the target barrier instruction only when necessary i.e., for SMP guests and +when MTTCG is enabled. + +The guest translators should generate this opcode for all guest instructions +which have ordering side effects. + +Please see docs/atomics.txt for more information on memory barriers. + ********* 64-bit guest on 32-bit host support The following opcodes are internal to TCG. Thus they are to be implemented by diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c index f554b86..a6f01a7 100644 --- a/tcg/tcg-op.c +++ b/tcg/tcg-op.c @@ -143,6 +143,12 @@ void tcg_gen_op6(TCGContext *ctx, TCGOpcode opc, TCGArg a1, TCGArg a2, tcg_emit_op(ctx, opc, pi); } +void tcg_gen_mb(TCGArg a) +{ + /* ??? Enable only when MTTCG is enabled. */ + tcg_gen_op1(&tcg_ctx, INDEX_op_mb, 0); +} + /* 32 bit ops */ void tcg_gen_addi_i32(TCGv_i32 ret, TCGv_i32 arg1, int32_t arg2) diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h index c446d3d..40920fb 100644 --- a/tcg/tcg-op.h +++ b/tcg/tcg-op.h @@ -261,6 +261,8 @@ static inline void tcg_gen_br(TCGLabel *l) tcg_gen_op1(&tcg_ctx, INDEX_op_br, label_arg(l)); } +void tcg_gen_mb(TCGArg a); + /* Helper calls. */ /* 32 bit ops */ diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h index 6d0410c..c0f3e83 100644 --- a/tcg/tcg-opc.h +++ b/tcg/tcg-opc.h @@ -42,6 +42,8 @@ DEF(br, 0, 0, 1, TCG_OPF_BB_END) # define IMPL64 TCG_OPF_64BIT #endif +DEF(mb, 0, 1, 0, 0) + DEF(mov_i32, 1, 1, 0, TCG_OPF_NOT_PRESENT) DEF(movi_i32, 1, 0, 1, TCG_OPF_NOT_PRESENT) DEF(setcond_i32, 1, 2, 1, 0) diff --git a/tcg/tcg.h b/tcg/tcg.h index a46d17c..a1d59f7 100644 --- a/tcg/tcg.h +++ b/tcg/tcg.h @@ -385,6 +385,14 @@ static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t) #define TCG_CALL_DUMMY_TCGV MAKE_TCGV_I32(-1) #define TCG_CALL_DUMMY_ARG ((TCGArg)(-1)) +/* TCGOpmb args */ +#define TCG_MB_FULL ((TCGArg)(0)) +#define TCG_MB_READ ((TCGArg)(1)) +#define TCG_MB_WRITE ((TCGArg)(2)) +#define TCG_MB_ACQUIRE ((TCGArg)(3)) +#define TCG_MB_RELEASE ((TCGArg)(4)) + + /* Conditions. Note that these are laid out for easy manipulation by the functions below: bit 0 is used for inverting;