Message ID | 20160618040343.19517-15-bobby.prani@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 06/17/2016 09:03 PM, Pranith Kumar wrote: > case 0xe8 ... 0xef: /* lfence */ > + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); > + break; > case 0xf0 ... 0xf7: /* mfence */ > if (!(s->cpuid_features & CPUID_SSE2) > || (prefixes & PREFIX_LOCK)) { > goto illegal_op; > } > + tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC); You need to duplicate the sse2 check if you split lfence from mfence. r~
On Sat, Jun 18, 2016 at 1:48 AM, Richard Henderson <rth@twiddle.net> wrote: > On 06/17/2016 09:03 PM, Pranith Kumar wrote: >> >> case 0xe8 ... 0xef: /* lfence */ >> + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); >> + break; >> case 0xf0 ... 0xf7: /* mfence */ >> if (!(s->cpuid_features & CPUID_SSE2) >> || (prefixes & PREFIX_LOCK)) { >> goto illegal_op; >> } >> + tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC); > > > You need to duplicate the sse2 check if you split lfence from mfence. > Yes, I missed this one. I will fix it. Thanks for the review.
On 18/06/2016 06:03, Pranith Kumar wrote: > Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> > --- > target-i386/translate.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/target-i386/translate.c b/target-i386/translate.c > index bf33e6b..32b0f5c 100644 > --- a/target-i386/translate.c > +++ b/target-i386/translate.c > @@ -8012,13 +8012,17 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, > || (prefixes & PREFIX_LOCK)) { > goto illegal_op; > } > + tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC); > break; > case 0xe8 ... 0xef: /* lfence */ > + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); > + break; These are unnecessary. On the other hand, _each and every load_ must be followed by a LD_LD | LD_ST barrier, and each and every store must be preceded by a LD_ST | ST_ST barrier. Paolo > case 0xf0 ... 0xf7: /* mfence */ > if (!(s->cpuid_features & CPUID_SSE2) > || (prefixes & PREFIX_LOCK)) { > goto illegal_op; > } > + tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC); > break; > > default: >
On 06/21/2016 12:28 AM, Paolo Bonzini wrote: > > > On 18/06/2016 06:03, Pranith Kumar wrote: >> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> >> --- >> target-i386/translate.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/target-i386/translate.c b/target-i386/translate.c >> index bf33e6b..32b0f5c 100644 >> --- a/target-i386/translate.c >> +++ b/target-i386/translate.c >> @@ -8012,13 +8012,17 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, >> || (prefixes & PREFIX_LOCK)) { >> goto illegal_op; >> } >> + tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC); >> break; >> case 0xe8 ... 0xef: /* lfence */ >> + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); >> + break; > > These are unnecessary. On the other hand, _each and every load_ must be > followed by a LD_LD | LD_ST barrier, and each and every store must be > preceded by a LD_ST | ST_ST barrier. They're not unnecessary if we (1) add those barriers for normal loads and stores and (2) omit them from the non-temporal loads and stores. In the meantime they're good documentation. r~
On 21/06/2016 17:57, Richard Henderson wrote: >>> >>> || (prefixes & PREFIX_LOCK)) { >>> goto illegal_op; >>> } >>> + tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC); >>> break; >>> case 0xe8 ... 0xef: /* lfence */ >>> + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); >>> + break; >> >> These are unnecessary. On the other hand, _each and every load_ must be >> followed by a LD_LD | LD_ST barrier, and each and every store must be >> preceded by a LD_ST | ST_ST barrier. > > They're not unnecessary if we (1) add those barriers for normal loads > and stores and (2) omit them from the non-temporal loads and stores. When does TCG generate non-temporal loads and stores? Paolo
On 06/21/2016 09:12 AM, Paolo Bonzini wrote: > > > On 21/06/2016 17:57, Richard Henderson wrote: >>>> >>>> || (prefixes & PREFIX_LOCK)) { >>>> goto illegal_op; >>>> } >>>> + tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC); >>>> break; >>>> case 0xe8 ... 0xef: /* lfence */ >>>> + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); >>>> + break; >>> >>> These are unnecessary. On the other hand, _each and every load_ must be >>> followed by a LD_LD | LD_ST barrier, and each and every store must be >>> preceded by a LD_ST | ST_ST barrier. >> >> They're not unnecessary if we (1) add those barriers for normal loads >> and stores and (2) omit them from the non-temporal loads and stores. > > When does TCG generate non-temporal loads and stores? I was talking about the guest non-temporal loads and stores. r~
On 21/06/2016 18:23, Richard Henderson wrote: > On 06/21/2016 09:12 AM, Paolo Bonzini wrote: >> >> >> On 21/06/2016 17:57, Richard Henderson wrote: >>>>> >>>>> || (prefixes & PREFIX_LOCK)) { >>>>> goto illegal_op; >>>>> } >>>>> + tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC); >>>>> break; >>>>> case 0xe8 ... 0xef: /* lfence */ >>>>> + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); >>>>> + break; >>>> >>>> These are unnecessary. On the other hand, _each and every load_ >>>> must be >>>> followed by a LD_LD | LD_ST barrier, and each and every store must be >>>> preceded by a LD_ST | ST_ST barrier. >>> >>> They're not unnecessary if we (1) add those barriers for normal loads >>> and stores and (2) omit them from the non-temporal loads and stores. >> >> When does TCG generate non-temporal loads and stores? > > I was talking about the guest non-temporal loads and stores. Oh, you're right---sorry, I confused target-i386 and tcg/i386. Thanks, Paolo
On Tue, Jun 21, 2016 at 3:28 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 18/06/2016 06:03, Pranith Kumar wrote: >> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> >> --- >> target-i386/translate.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/target-i386/translate.c b/target-i386/translate.c >> index bf33e6b..32b0f5c 100644 >> --- a/target-i386/translate.c >> +++ b/target-i386/translate.c >> @@ -8012,13 +8012,17 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, >> || (prefixes & PREFIX_LOCK)) { >> goto illegal_op; >> } >> + tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC); >> break; >> case 0xe8 ... 0xef: /* lfence */ >> + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); >> + break; > > These are unnecessary. On the other hand, _each and every load_ must be > followed by a LD_LD | LD_ST barrier, and each and every store must be > preceded by a LD_ST | ST_ST barrier. > Reg. the second point, I did consider this situation of running x86 on ARM where such barriers are necessary for correctness. But, I am really apprehensive of the cost it will impose. I am not sure if there are any alternative solutions to avoid generating barriers for each memory operation, but it would be great if we could reduce them. Thanks,
On 21 June 2016 at 18:28, Pranith Kumar <bobby.prani@gmail.com> wrote: > Reg. the second point, I did consider this situation of running x86 on > ARM where such barriers are necessary for correctness. But, I am > really apprehensive of the cost it will impose. I am not sure if there > are any alternative solutions to avoid generating barriers for each > memory operation, but it would be great if we could reduce them. I vaguely recall an idea that you could avoid needing explicit barriers by turning all the guest load/stores into host load-acquire/store-release, but I have no idea whether that's (a) actually true (b) any better than piles of explicit barriers. thanks -- PMM
On Tue, Jun 21, 2016 at 1:54 PM, Peter Maydell <peter.maydell@linaro.org> wrote: > On 21 June 2016 at 18:28, Pranith Kumar <bobby.prani@gmail.com> wrote: >> Reg. the second point, I did consider this situation of running x86 on >> ARM where such barriers are necessary for correctness. But, I am >> really apprehensive of the cost it will impose. I am not sure if there >> are any alternative solutions to avoid generating barriers for each >> memory operation, but it would be great if we could reduce them. > > I vaguely recall an idea that you could avoid needing > explicit barriers by turning all the guest load/stores into > host load-acquire/store-release, but I have no idea whether > that's (a) actually true (b) any better than piles of > explicit barriers. Yes, this is true for ARMv8(not sure about ia64). The load-acquire/store-release operations are sequentially consistent to each other. But this does not work for ARMv7 and as you said... I think the cost here too is really prohibitive.
Pranith Kumar <bobby.prani@gmail.com> writes: > On Tue, Jun 21, 2016 at 1:54 PM, Peter Maydell <peter.maydell@linaro.org> wrote: >> On 21 June 2016 at 18:28, Pranith Kumar <bobby.prani@gmail.com> wrote: >>> Reg. the second point, I did consider this situation of running x86 on >>> ARM where such barriers are necessary for correctness. But, I am >>> really apprehensive of the cost it will impose. I am not sure if there >>> are any alternative solutions to avoid generating barriers for each >>> memory operation, but it would be great if we could reduce them. >> >> I vaguely recall an idea that you could avoid needing >> explicit barriers by turning all the guest load/stores into >> host load-acquire/store-release, but I have no idea whether >> that's (a) actually true (b) any better than piles of >> explicit barriers. > > Yes, this is true for ARMv8(not sure about ia64). The > load-acquire/store-release operations are sequentially consistent to > each other. But this does not work for ARMv7 and as you said... I > think the cost here too is really prohibitive. If the cost ends up being too prohibitive (as in -smp N runs slower and slower) then we may just keep -accel tcg,thread=single the default for that combination. We need some hard numbers either way. -- Alex Bennée
On 21/06/16 21:03, Pranith Kumar wrote: > On Tue, Jun 21, 2016 at 1:54 PM, Peter Maydell <peter.maydell@linaro.org> wrote: >> On 21 June 2016 at 18:28, Pranith Kumar <bobby.prani@gmail.com> wrote: >>> Reg. the second point, I did consider this situation of running x86 on >>> ARM where such barriers are necessary for correctness. But, I am >>> really apprehensive of the cost it will impose. I am not sure if there >>> are any alternative solutions to avoid generating barriers for each >>> memory operation, but it would be great if we could reduce them. >> I vaguely recall an idea that you could avoid needing >> explicit barriers by turning all the guest load/stores into >> host load-acquire/store-release, but I have no idea whether >> that's (a) actually true (b) any better than piles of >> explicit barriers. > Yes, this is true for ARMv8(not sure about ia64). The > load-acquire/store-release operations are sequentially consistent to > each other. But this does not work for ARMv7 and as you said... I > think the cost here too is really prohibitive. As I understand, there's no requirement for sequential consistency even on a systems with pretty strong memory model such as x86. Due to the presence of store queue, earlier regular stores are allowed to be completed after the following regular loads. This relaxation breaks sequential consistency requirement, if I understand correctly, since it allows a CPU to see its own stores with respect to other CPU stores in different order. However, such a model can perfectly match acquire/release semantics, even as it is defined by Itanium memory model. Lets break it down: (1) if a load-acquire must not be reordered with any subsequent loads and stores, (2) and if a store-release must not be reordered with any preceding loads and stores, (3) thus if all loads are load-acquires and all stores are store-releases, then the only possible reordering can be a store-release reordered after the subsequent load-acquire. Considering this, I think that strongly-ordered memory model semantics such (as in x86 memory model) can be translated directly into relaxed acquire/release memory model semantics (as in Itanium memory model or a bit more strong ARMv8). And I believe this can perform better than inserting separate memory barriers on those architectures which provide acquire/release semantics since it is more relaxed and permit certain hardware optimizations like store-after-load reordering. Kind regards, Sergey
diff --git a/target-i386/translate.c b/target-i386/translate.c index bf33e6b..32b0f5c 100644 --- a/target-i386/translate.c +++ b/target-i386/translate.c @@ -8012,13 +8012,17 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s, || (prefixes & PREFIX_LOCK)) { goto illegal_op; } + tcg_gen_mb(TCG_MO_ST_ST | TCG_BAR_SC); break; case 0xe8 ... 0xef: /* lfence */ + tcg_gen_mb(TCG_MO_LD_LD | TCG_BAR_SC); + break; case 0xf0 ... 0xf7: /* mfence */ if (!(s->cpuid_features & CPUID_SSE2) || (prefixes & PREFIX_LOCK)) { goto illegal_op; } + tcg_gen_mb(TCG_MO_ALL | TCG_BAR_SC); break; default:
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> --- target-i386/translate.c | 4 ++++ 1 file changed, 4 insertions(+)