mbox series

[0/5] Add LoongArch v1.1 instructions

Message ID 20231023153029.269211-2-c@jia.je (mailing list archive)
Headers show
Series Add LoongArch v1.1 instructions | expand

Message

Jiajie Chen Oct. 23, 2023, 3:29 p.m. UTC
Latest revision of LoongArch ISA is out at
https://www.loongson.cn/uploads/images/2023102309132647981.%E9%BE%99%E8%8A%AF%E6%9E%B6%E6%9E%84%E5%8F%82%E8%80%83%E6%89%8B%E5%86%8C%E5%8D%B7%E4%B8%80_r1p10.pdf
(Chinese only). The revision includes the following updates:

- estimated fp reciporcal instructions: frecip -> frecipe, frsqrt ->
  frsqrte
- 128-bit width store-conditional instruction: sc.q
- ll.w/d with acquire semantic: llacq.w/d, sc.w/d with release semantic:
  screl.w/d
- compare and swap instructions: amcas[_db].b/w/h/d
- byte and word-wide amswap/add instructions: am{swap/add}[_db].{b/h}
- new definition for dbar hints
- clarify 32-bit division instruction hebavior
- clarify load ordering when accessing the same address
- introduce message signaled interrupt
- introduce hardware page table walker

The new revision is implemented in the to be released Loongson 3A6000
processor.

This patch series implements the new instructions except sc.q, because I
do not know how to match a pair of ll.d to sc.q.


Jiajie Chen (5):
  include/exec/memop.h: Add MO_TESB
  target/loongarch: Add am{swap/add}[_db].{b/h}
  target/loongarch: Add amcas[_db].{b/h/w/d}
  target/loongarch: Add estimated reciprocal instructions
  target/loongarch: Add llacq/screl instructions

 include/exec/memop.h                          |  1 +
 target/loongarch/cpu.h                        |  4 ++
 target/loongarch/disas.c                      | 32 ++++++++++++
 .../loongarch/insn_trans/trans_atomic.c.inc   | 52 +++++++++++++++++++
 .../loongarch/insn_trans/trans_farith.c.inc   |  4 ++
 target/loongarch/insn_trans/trans_vec.c.inc   |  8 +++
 target/loongarch/insns.decode                 | 32 ++++++++++++
 target/loongarch/translate.h                  | 27 +++++++---
 8 files changed, 152 insertions(+), 8 deletions(-)

Comments

Richard Henderson Oct. 23, 2023, 11:26 p.m. UTC | #1
On 10/23/23 08:29, Jiajie Chen wrote:
> This patch series implements the new instructions except sc.q, because I do not know how 
> to match a pair of ll.d to sc.q.

There are a couple of examples within the tree.

See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 block.
See target/ppc/translate.c, gen_stqcx_.


r~
Jiajie Chen Oct. 24, 2023, 6:10 a.m. UTC | #2
On 2023/10/24 07:26, Richard Henderson wrote:
> On 10/23/23 08:29, Jiajie Chen wrote:
>> This patch series implements the new instructions except sc.q, 
>> because I do not know how to match a pair of ll.d to sc.q.
>
> There are a couple of examples within the tree.
>
> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 block.
> See target/ppc/translate.c, gen_stqcx_.


The situation here is slightly different: aarch64 and ppc64 have both 
128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll and 128-bit 
sc. I guest the intended usage of sc.q is:


ll.d lo, base, 0

ll.d hi, base, 4

# do some computation

sc.q lo, hi, base

# try again if sc failed



>
>
> r~
Jiajie Chen Oct. 25, 2023, 5:13 p.m. UTC | #3
On 2023/10/24 14:10, Jiajie Chen wrote:
>
> On 2023/10/24 07:26, Richard Henderson wrote:
>> On 10/23/23 08:29, Jiajie Chen wrote:
>>> This patch series implements the new instructions except sc.q, 
>>> because I do not know how to match a pair of ll.d to sc.q.
>>
>> There are a couple of examples within the tree.
>>
>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 
>> block.
>> See target/ppc/translate.c, gen_stqcx_.
>
>
> The situation here is slightly different: aarch64 and ppc64 have both 
> 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll and 
> 128-bit sc. I guest the intended usage of sc.q is:
>
>
> ll.d lo, base, 0
>
> ll.d hi, base, 4
>
> # do some computation
>
> sc.q lo, hi, base
>
> # try again if sc failed


Possibly use the combination of ll.d and ld.d:


ll.d lo, base, 0

ld.d hi, base, 4

# do some computation

sc.q lo, hi, base

# try again if sc failed


Then a possible implementation of gen_ll() would be: align base to 
128-bit boundary, read 128-bit from memory, save 64-bit part to rd and 
record whole 128-bit data in llval. Then, in gen_sc_q(), it uses a 
128-bit cmpxchg.


But what about the reversed instruction pattern: ll.d hi, base, 4; ld.d 
lo, base 0?


Since there are no existing code utilizing the new sc.q instruction, I 
don't know what should we consider here.


>
>
>
>>
>>
>> r~
Richard Henderson Oct. 25, 2023, 7:04 p.m. UTC | #4
On 10/25/23 10:13, Jiajie Chen wrote:
>> On 2023/10/24 07:26, Richard Henderson wrote:
>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 block.
>>> See target/ppc/translate.c, gen_stqcx_.
>>
>> The situation here is slightly different: aarch64 and ppc64 have both 128-bit ll and sc, 
>> however LoongArch v1.1 only has 64-bit ll and 128-bit sc.

Ah, that does complicate things.

> Possibly use the combination of ll.d and ld.d:
> 
> 
> ll.d lo, base, 0
> ld.d hi, base, 4
> 
> # do some computation
> 
> sc.q lo, hi, base
> 
> # try again if sc failed
> 
> Then a possible implementation of gen_ll() would be: align base to 128-bit boundary, read 
> 128-bit from memory, save 64-bit part to rd and record whole 128-bit data in llval. Then, 
> in gen_sc_q(), it uses a 128-bit cmpxchg.
> 
> 
> But what about the reversed instruction pattern: ll.d hi, base, 4; ld.d lo, base 0?

It would be worth asking your hardware engineers about the bounds of legal behaviour. 
Ideally there would be some very explicit language, similar to

https://developer.arm.com/documentation/ddi0487/latest/
B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions

But you could do the same thing, aligning and recording the entire 128-bit quantity, then 
extract the ll.d result based on address bit 6.  This would complicate the implementation 
of sc.d as well, but would perhaps bring us "close enough" to the actual architecture.

Note that our Arm store-exclusive implementation isn't quite in spec either.  There is 
quite a large comment within translate-a64.c store_exclusive() about the ways things are 
not quite right.  But it seems to be close enough for actual usage to succeed.


r~
Jiajie Chen Oct. 26, 2023, 1:38 a.m. UTC | #5
On 2023/10/26 03:04, Richard Henderson wrote:
> On 10/25/23 10:13, Jiajie Chen wrote:
>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 
>>>> block.
>>>> See target/ppc/translate.c, gen_stqcx_.
>>>
>>> The situation here is slightly different: aarch64 and ppc64 have 
>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll 
>>> and 128-bit sc.
>
> Ah, that does complicate things.
>
>> Possibly use the combination of ll.d and ld.d:
>>
>>
>> ll.d lo, base, 0
>> ld.d hi, base, 4
>>
>> # do some computation
>>
>> sc.q lo, hi, base
>>
>> # try again if sc failed
>>
>> Then a possible implementation of gen_ll() would be: align base to 
>> 128-bit boundary, read 128-bit from memory, save 64-bit part to rd 
>> and record whole 128-bit data in llval. Then, in gen_sc_q(), it uses 
>> a 128-bit cmpxchg.
>>
>>
>> But what about the reversed instruction pattern: ll.d hi, base, 4; 
>> ld.d lo, base 0?
>
> It would be worth asking your hardware engineers about the bounds of 
> legal behaviour. Ideally there would be some very explicit language, 
> similar to


I'm a community developer not affiliated with Loongson. Song Gao, could 
you provide some detail from Loongson Inc.?


>
> https://developer.arm.com/documentation/ddi0487/latest/
> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions
>
> But you could do the same thing, aligning and recording the entire 
> 128-bit quantity, then extract the ll.d result based on address bit 
> 6.  This would complicate the implementation of sc.d as well, but 
> would perhaps bring us "close enough" to the actual architecture.
>
> Note that our Arm store-exclusive implementation isn't quite in spec 
> either.  There is quite a large comment within translate-a64.c 
> store_exclusive() about the ways things are not quite right.  But it 
> seems to be close enough for actual usage to succeed.
>
>
> r~
gaosong Oct. 26, 2023, 6:54 a.m. UTC | #6
在 2023/10/26 上午9:38, Jiajie Chen 写道:
>
> On 2023/10/26 03:04, Richard Henderson wrote:
>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 
>>>>> block.
>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>
>>>> The situation here is slightly different: aarch64 and ppc64 have 
>>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll 
>>>> and 128-bit sc.
>>
>> Ah, that does complicate things.
>>
>>> Possibly use the combination of ll.d and ld.d:
>>>
>>>
>>> ll.d lo, base, 0
>>> ld.d hi, base, 4
>>>
>>> # do some computation
>>>
>>> sc.q lo, hi, base
>>>
>>> # try again if sc failed
>>>
>>> Then a possible implementation of gen_ll() would be: align base to 
>>> 128-bit boundary, read 128-bit from memory, save 64-bit part to rd 
>>> and record whole 128-bit data in llval. Then, in gen_sc_q(), it uses 
>>> a 128-bit cmpxchg.
>>>
>>>
>>> But what about the reversed instruction pattern: ll.d hi, base, 4; 
>>> ld.d lo, base 0?
>>
>> It would be worth asking your hardware engineers about the bounds of 
>> legal behaviour. Ideally there would be some very explicit language, 
>> similar to
>
>
> I'm a community developer not affiliated with Loongson. Song Gao, 
> could you provide some detail from Loongson Inc.?
>
>

ll.d   r1, base, 0
dbar 0x700          ==> see 2.2.8.1
ld.d  r2, base,  8
...
sc.q r1, r2, base


For this series,
I think we need set the new config bits to the 'max cpu', and change 
linux-user/target_elf.h ''any' to 'max', so that we can use these new 
instructions on linux-user mode.

Thanks
Song Gao
>>
>> https://developer.arm.com/documentation/ddi0487/latest/
>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions
>>
>> But you could do the same thing, aligning and recording the entire 
>> 128-bit quantity, then extract the ll.d result based on address bit 
>> 6.  This would complicate the implementation of sc.d as well, but 
>> would perhaps bring us "close enough" to the actual architecture.
>>
>> Note that our Arm store-exclusive implementation isn't quite in spec 
>> either.  There is quite a large comment within translate-a64.c 
>> store_exclusive() about the ways things are not quite right.  But it 
>> seems to be close enough for actual usage to succeed.
>>
>>
>> r~
Jiajie Chen Oct. 28, 2023, 1:09 p.m. UTC | #7
On 2023/10/26 14:54, gaosong wrote:
> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>
>> On 2023/10/26 03:04, Richard Henderson wrote:
>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>> TCGv_i128 block.
>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>
>>>>> The situation here is slightly different: aarch64 and ppc64 have 
>>>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll 
>>>>> and 128-bit sc.
>>>
>>> Ah, that does complicate things.
>>>
>>>> Possibly use the combination of ll.d and ld.d:
>>>>
>>>>
>>>> ll.d lo, base, 0
>>>> ld.d hi, base, 4
>>>>
>>>> # do some computation
>>>>
>>>> sc.q lo, hi, base
>>>>
>>>> # try again if sc failed
>>>>
>>>> Then a possible implementation of gen_ll() would be: align base to 
>>>> 128-bit boundary, read 128-bit from memory, save 64-bit part to rd 
>>>> and record whole 128-bit data in llval. Then, in gen_sc_q(), it 
>>>> uses a 128-bit cmpxchg.
>>>>
>>>>
>>>> But what about the reversed instruction pattern: ll.d hi, base, 4; 
>>>> ld.d lo, base 0?
>>>
>>> It would be worth asking your hardware engineers about the bounds of 
>>> legal behaviour. Ideally there would be some very explicit language, 
>>> similar to
>>
>>
>> I'm a community developer not affiliated with Loongson. Song Gao, 
>> could you provide some detail from Loongson Inc.?
>>
>>
>
> ll.d   r1, base, 0
> dbar 0x700          ==> see 2.2.8.1
> ld.d  r2, base,  8
> ...
> sc.q r1, r2, base


Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence and 
translate the sequence into one tcg_gen_qemu_ld_i128 and split the 
result into two 64-bit parts. Can do this in QEMU?


>
>
> For this series,
> I think we need set the new config bits to the 'max cpu', and change 
> linux-user/target_elf.h ''any' to 'max', so that we can use these new 
> instructions on linux-user mode.

I will work on it.


>
> Thanks
> Song Gao
>>>
>>> https://developer.arm.com/documentation/ddi0487/latest/
>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>> restrictions
>>>
>>> But you could do the same thing, aligning and recording the entire 
>>> 128-bit quantity, then extract the ll.d result based on address bit 
>>> 6.  This would complicate the implementation of sc.d as well, but 
>>> would perhaps bring us "close enough" to the actual architecture.
>>>
>>> Note that our Arm store-exclusive implementation isn't quite in spec 
>>> either.  There is quite a large comment within translate-a64.c 
>>> store_exclusive() about the ways things are not quite right.  But it 
>>> seems to be close enough for actual usage to succeed.
>>>
>>>
>>> r~
>
gaosong Oct. 30, 2023, 8:23 a.m. UTC | #8
在 2023/10/28 下午9:09, Jiajie Chen 写道:
>
> On 2023/10/26 14:54, gaosong wrote:
>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>
>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>> TCGv_i128 block.
>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>
>>>>>> The situation here is slightly different: aarch64 and ppc64 have 
>>>>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll 
>>>>>> and 128-bit sc.
>>>>
>>>> Ah, that does complicate things.
>>>>
>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>
>>>>>
>>>>> ll.d lo, base, 0
>>>>> ld.d hi, base, 4
>>>>>
>>>>> # do some computation
>>>>>
>>>>> sc.q lo, hi, base
>>>>>
>>>>> # try again if sc failed
>>>>>
>>>>> Then a possible implementation of gen_ll() would be: align base to 
>>>>> 128-bit boundary, read 128-bit from memory, save 64-bit part to rd 
>>>>> and record whole 128-bit data in llval. Then, in gen_sc_q(), it 
>>>>> uses a 128-bit cmpxchg.
>>>>>
>>>>>
>>>>> But what about the reversed instruction pattern: ll.d hi, base, 4; 
>>>>> ld.d lo, base 0?
>>>>
>>>> It would be worth asking your hardware engineers about the bounds 
>>>> of legal behaviour. Ideally there would be some very explicit 
>>>> language, similar to
>>>
>>>
>>> I'm a community developer not affiliated with Loongson. Song Gao, 
>>> could you provide some detail from Loongson Inc.?
>>>
>>>
>>
>> ll.d   r1, base, 0
>> dbar 0x700          ==> see 2.2.8.1
>> ld.d  r2, base,  8
>> ...
>> sc.q r1, r2, base
>
>
> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence and 
> translate the sequence into one tcg_gen_qemu_ld_i128 and split the 
> result into two 64-bit parts. Can do this in QEMU?
>
>
Oh, I'm not sure.

I think we just need to implement sc.q. We don't need to care about 
'll.d-dbar-ld.d'. It's just like 'll.q'.
It needs the user to ensure that .

ll.q' is
1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
2) dbar 0x700 
3) ld.d r2 base, 8 ==> load the high 64 bits to r2

sc.q needs to
1) Use 64-bit cmpxchg.
2) Write 128 bits to memory.

Thanks.
Song Gao
>>
>>
>> For this series,
>> I think we need set the new config bits to the 'max cpu', and change 
>> linux-user/target_elf.h ''any' to 'max', so that we can use these new 
>> instructions on linux-user mode.
>
> I will work on it.
>
>
>>
>> Thanks
>> Song Gao
>>>>
>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>> restrictions
>>>>
>>>> But you could do the same thing, aligning and recording the entire 
>>>> 128-bit quantity, then extract the ll.d result based on address bit 
>>>> 6.  This would complicate the implementation of sc.d as well, but 
>>>> would perhaps bring us "close enough" to the actual architecture.
>>>>
>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>> spec either.  There is quite a large comment within translate-a64.c 
>>>> store_exclusive() about the ways things are not quite right.  But 
>>>> it seems to be close enough for actual usage to succeed.
>>>>
>>>>
>>>> r~
>>
Jiajie Chen Oct. 30, 2023, 11:54 a.m. UTC | #9
On 2023/10/30 16:23, gaosong wrote:
> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>
>> On 2023/10/26 14:54, gaosong wrote:
>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>
>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>> TCGv_i128 block.
>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>
>>>>>>> The situation here is slightly different: aarch64 and ppc64 have 
>>>>>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit 
>>>>>>> ll and 128-bit sc.
>>>>>
>>>>> Ah, that does complicate things.
>>>>>
>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>
>>>>>>
>>>>>> ll.d lo, base, 0
>>>>>> ld.d hi, base, 4
>>>>>>
>>>>>> # do some computation
>>>>>>
>>>>>> sc.q lo, hi, base
>>>>>>
>>>>>> # try again if sc failed
>>>>>>
>>>>>> Then a possible implementation of gen_ll() would be: align base 
>>>>>> to 128-bit boundary, read 128-bit from memory, save 64-bit part 
>>>>>> to rd and record whole 128-bit data in llval. Then, in 
>>>>>> gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>
>>>>>>
>>>>>> But what about the reversed instruction pattern: ll.d hi, base, 
>>>>>> 4; ld.d lo, base 0?
>>>>>
>>>>> It would be worth asking your hardware engineers about the bounds 
>>>>> of legal behaviour. Ideally there would be some very explicit 
>>>>> language, similar to
>>>>
>>>>
>>>> I'm a community developer not affiliated with Loongson. Song Gao, 
>>>> could you provide some detail from Loongson Inc.?
>>>>
>>>>
>>>
>>> ll.d   r1, base, 0
>>> dbar 0x700          ==> see 2.2.8.1
>>> ld.d  r2, base,  8
>>> ...
>>> sc.q r1, r2, base
>>
>>
>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence and 
>> translate the sequence into one tcg_gen_qemu_ld_i128 and split the 
>> result into two 64-bit parts. Can do this in QEMU?
>>
>>
> Oh, I'm not sure.
>
> I think we just need to implement sc.q. We don't need to care about 
> 'll.d-dbar-ld.d'. It's just like 'll.q'.
> It needs the user to ensure that .
>
> ll.q' is
> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
> 2) dbar 0x700 
> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>
> sc.q needs to
> 1) Use 64-bit cmpxchg.
> 2) Write 128 bits to memory.

Consider the following code:


ll.d r1, base, 0

dbar 0x700

ld.d r2, base, 8

addi.d r2, r2, 1

sc.q r1, r2, base


We translate them into native code:


ld.d r1, base, 0

mv LLbit, 1

mv LLaddr, base

mv LLval, r1

dbar 0x700

ld.d r2, base, 8

addi.d r2, r2, 1

if (LLbit == 1 && LLaddr == base) {

     cmpxchg addr=base compare=LLval new=r1

     128-bit write {r2, r1} to base if cmpxchg succeeded

}

set r1 if sc.q succeeded



If the memory content of base+8 has changed between ld.d r2 and addi.d 
r2, the atomicity is not guaranteed, i.e. only the high part has 
changed, the low part hasn't.



>
> Thanks.
> Song Gao
>>>
>>>
>>> For this series,
>>> I think we need set the new config bits to the 'max cpu', and change 
>>> linux-user/target_elf.h ''any' to 'max', so that we can use these 
>>> new instructions on linux-user mode.
>>
>> I will work on it.
>>
>>
>>>
>>> Thanks
>>> Song Gao
>>>>>
>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>> restrictions
>>>>>
>>>>> But you could do the same thing, aligning and recording the entire 
>>>>> 128-bit quantity, then extract the ll.d result based on address 
>>>>> bit 6.  This would complicate the implementation of sc.d as well, 
>>>>> but would perhaps bring us "close enough" to the actual architecture.
>>>>>
>>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>>> spec either.  There is quite a large comment within 
>>>>> translate-a64.c store_exclusive() about the ways things are not 
>>>>> quite right.  But it seems to be close enough for actual usage to 
>>>>> succeed.
>>>>>
>>>>>
>>>>> r~
>>>
>
gaosong Oct. 31, 2023, 9:11 a.m. UTC | #10
在 2023/10/30 下午7:54, Jiajie Chen 写道:
>
> On 2023/10/30 16:23, gaosong wrote:
>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>
>>> On 2023/10/26 14:54, gaosong wrote:
>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>
>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>> TCGv_i128 block.
>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>
>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only has 
>>>>>>>> 64-bit ll and 128-bit sc.
>>>>>>
>>>>>> Ah, that does complicate things.
>>>>>>
>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>
>>>>>>>
>>>>>>> ll.d lo, base, 0
>>>>>>> ld.d hi, base, 4
>>>>>>>
>>>>>>> # do some computation
>>>>>>>
>>>>>>> sc.q lo, hi, base
>>>>>>>
>>>>>>> # try again if sc failed
>>>>>>>
>>>>>>> Then a possible implementation of gen_ll() would be: align base 
>>>>>>> to 128-bit boundary, read 128-bit from memory, save 64-bit part 
>>>>>>> to rd and record whole 128-bit data in llval. Then, in 
>>>>>>> gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>
>>>>>>>
>>>>>>> But what about the reversed instruction pattern: ll.d hi, base, 
>>>>>>> 4; ld.d lo, base 0?
>>>>>>
>>>>>> It would be worth asking your hardware engineers about the bounds 
>>>>>> of legal behaviour. Ideally there would be some very explicit 
>>>>>> language, similar to
>>>>>
>>>>>
>>>>> I'm a community developer not affiliated with Loongson. Song Gao, 
>>>>> could you provide some detail from Loongson Inc.?
>>>>>
>>>>>
>>>>
>>>> ll.d   r1, base, 0
>>>> dbar 0x700          ==> see 2.2.8.1
>>>> ld.d  r2, base,  8
>>>> ...
>>>> sc.q r1, r2, base
>>>
>>>
>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence 
>>> and translate the sequence into one tcg_gen_qemu_ld_i128 and split 
>>> the result into two 64-bit parts. Can do this in QEMU?
>>>
>>>
>> Oh, I'm not sure.
>>
>> I think we just need to implement sc.q. We don't need to care about 
>> 'll.d-dbar-ld.d'. It's just like 'll.q'.
>> It needs the user to ensure that .
>>
>> ll.q' is
>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>> 2) dbar 0x700 
>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>
>> sc.q needs to
>> 1) Use 64-bit cmpxchg.
>> 2) Write 128 bits to memory.
>
> Consider the following code:
>
>
> ll.d r1, base, 0
>
> dbar 0x700
>
> ld.d r2, base, 8
>
> addi.d r2, r2, 1
>
> sc.q r1, r2, base
>
>
> We translate them into native code:
>
>
> ld.d r1, base, 0
>
> mv LLbit, 1
>
> mv LLaddr, base
>
> mv LLval, r1
>
> dbar 0x700
>
> ld.d r2, base, 8
>
> addi.d r2, r2, 1
>
> if (LLbit == 1 && LLaddr == base) {
>
>     cmpxchg addr=base compare=LLval new=r1
>
>     128-bit write {r2, r1} to base if cmpxchg succeeded
>
> }
>
> set r1 if sc.q succeeded
>
>
>
> If the memory content of base+8 has changed between ld.d r2 and addi.d 
> r2, the atomicity is not guaranteed, i.e. only the high part has 
> changed, the low part hasn't.
>
>
Sorry,  my mistake.  need use cmpxchg_i128.   See 
target/arm/tcg/translate-a64.c   gen_store_exclusive().

gen_scq(rd, rk, rj)
{
      ...
     TCGv_i128 t16 = tcg_temp_new_i128();
     TCGv_i128 c16 = tcg_temp_new_i128();
     TCGv_i64 low = tcg_temp_new_i64();
     TCGv_i64 high= tcg_temp_new_i64();
     TCGv_i64 temp = tcg_temp_new_i64();

     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd],  cpu_gpr[rk]));

     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx,  MO_TEUQ);
     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
     tcg_gen_concat_i64_i128(c16, low,  high);

     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
ctx->mem_idx, MO_128);

     ...
}

I am not sure this is right.

I think Richard can give you more suggestions. @Richard

Thanks.
Song Gao
>
>> Thanks.
>> Song Gao
>>>>
>>>>
>>>> For this series,
>>>> I think we need set the new config bits to the 'max cpu', and 
>>>> change linux-user/target_elf.h ''any' to 'max', so that we can use 
>>>> these new instructions on linux-user mode.
>>>
>>> I will work on it.
>>>
>>>
>>>>
>>>> Thanks
>>>> Song Gao
>>>>>>
>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>> restrictions
>>>>>>
>>>>>> But you could do the same thing, aligning and recording the 
>>>>>> entire 128-bit quantity, then extract the ll.d result based on 
>>>>>> address bit 6.  This would complicate the implementation of sc.d 
>>>>>> as well, but would perhaps bring us "close enough" to the actual 
>>>>>> architecture.
>>>>>>
>>>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>>>> spec either.  There is quite a large comment within 
>>>>>> translate-a64.c store_exclusive() about the ways things are not 
>>>>>> quite right.  But it seems to be close enough for actual usage to 
>>>>>> succeed.
>>>>>>
>>>>>>
>>>>>> r~
>>>>
>>
Jiajie Chen Oct. 31, 2023, 9:13 a.m. UTC | #11
On 2023/10/31 17:11, gaosong wrote:
> 在 2023/10/30 下午7:54, Jiajie Chen 写道:
>>
>> On 2023/10/30 16:23, gaosong wrote:
>>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>>
>>>> On 2023/10/26 14:54, gaosong wrote:
>>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>>
>>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>>> TCGv_i128 block.
>>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>>
>>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only has 
>>>>>>>>> 64-bit ll and 128-bit sc.
>>>>>>>
>>>>>>> Ah, that does complicate things.
>>>>>>>
>>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>>
>>>>>>>>
>>>>>>>> ll.d lo, base, 0
>>>>>>>> ld.d hi, base, 4
>>>>>>>>
>>>>>>>> # do some computation
>>>>>>>>
>>>>>>>> sc.q lo, hi, base
>>>>>>>>
>>>>>>>> # try again if sc failed
>>>>>>>>
>>>>>>>> Then a possible implementation of gen_ll() would be: align base 
>>>>>>>> to 128-bit boundary, read 128-bit from memory, save 64-bit part 
>>>>>>>> to rd and record whole 128-bit data in llval. Then, in 
>>>>>>>> gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>>
>>>>>>>>
>>>>>>>> But what about the reversed instruction pattern: ll.d hi, base, 
>>>>>>>> 4; ld.d lo, base 0?
>>>>>>>
>>>>>>> It would be worth asking your hardware engineers about the 
>>>>>>> bounds of legal behaviour. Ideally there would be some very 
>>>>>>> explicit language, similar to
>>>>>>
>>>>>>
>>>>>> I'm a community developer not affiliated with Loongson. Song Gao, 
>>>>>> could you provide some detail from Loongson Inc.?
>>>>>>
>>>>>>
>>>>>
>>>>> ll.d   r1, base, 0
>>>>> dbar 0x700          ==> see 2.2.8.1
>>>>> ld.d  r2, base,  8
>>>>> ...
>>>>> sc.q r1, r2, base
>>>>
>>>>
>>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence 
>>>> and translate the sequence into one tcg_gen_qemu_ld_i128 and split 
>>>> the result into two 64-bit parts. Can do this in QEMU?
>>>>
>>>>
>>> Oh, I'm not sure.
>>>
>>> I think we just need to implement sc.q. We don't need to care about 
>>> 'll.d-dbar-ld.d'. It's just like 'll.q'.
>>> It needs the user to ensure that .
>>>
>>> ll.q' is
>>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>>> 2) dbar 0x700 
>>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>>
>>> sc.q needs to
>>> 1) Use 64-bit cmpxchg.
>>> 2) Write 128 bits to memory.
>>
>> Consider the following code:
>>
>>
>> ll.d r1, base, 0
>>
>> dbar 0x700
>>
>> ld.d r2, base, 8
>>
>> addi.d r2, r2, 1
>>
>> sc.q r1, r2, base
>>
>>
>> We translate them into native code:
>>
>>
>> ld.d r1, base, 0
>>
>> mv LLbit, 1
>>
>> mv LLaddr, base
>>
>> mv LLval, r1
>>
>> dbar 0x700
>>
>> ld.d r2, base, 8
>>
>> addi.d r2, r2, 1
>>
>> if (LLbit == 1 && LLaddr == base) {
>>
>>     cmpxchg addr=base compare=LLval new=r1
>>
>>     128-bit write {r2, r1} to base if cmpxchg succeeded
>>
>> }
>>
>> set r1 if sc.q succeeded
>>
>>
>>
>> If the memory content of base+8 has changed between ld.d r2 and 
>> addi.d r2, the atomicity is not guaranteed, i.e. only the high part 
>> has changed, the low part hasn't.
>>
>>
> Sorry,  my mistake.  need use cmpxchg_i128.   See 
> target/arm/tcg/translate-a64.c   gen_store_exclusive().
>
> gen_scq(rd, rk, rj)
> {
>      ...
>     TCGv_i128 t16 = tcg_temp_new_i128();
>     TCGv_i128 c16 = tcg_temp_new_i128();
>     TCGv_i64 low = tcg_temp_new_i64();
>     TCGv_i64 high= tcg_temp_new_i64();
>     TCGv_i64 temp = tcg_temp_new_i64();
>
>     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd],  cpu_gpr[rk]));
>
>     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx,  MO_TEUQ);
>     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
>     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
>     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);


The problem is that, the high value read here might not equal to the 
previously read one in ll.d r2, base 8 instruction.


> tcg_gen_concat_i64_i128(c16, low,  high);
>
>     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
> ctx->mem_idx, MO_128);
>
>     ...
> }
>
> I am not sure this is right.
>
> I think Richard can give you more suggestions. @Richard
>
> Thanks.
> Song Gao
>>
>>> Thanks.
>>> Song Gao
>>>>>
>>>>>
>>>>> For this series,
>>>>> I think we need set the new config bits to the 'max cpu', and 
>>>>> change linux-user/target_elf.h ''any' to 'max', so that we can use 
>>>>> these new instructions on linux-user mode.
>>>>
>>>> I will work on it.
>>>>
>>>>
>>>>>
>>>>> Thanks
>>>>> Song Gao
>>>>>>>
>>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>>> restrictions
>>>>>>>
>>>>>>> But you could do the same thing, aligning and recording the 
>>>>>>> entire 128-bit quantity, then extract the ll.d result based on 
>>>>>>> address bit 6.  This would complicate the implementation of sc.d 
>>>>>>> as well, but would perhaps bring us "close enough" to the actual 
>>>>>>> architecture.
>>>>>>>
>>>>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>>>>> spec either.  There is quite a large comment within 
>>>>>>> translate-a64.c store_exclusive() about the ways things are not 
>>>>>>> quite right.  But it seems to be close enough for actual usage 
>>>>>>> to succeed.
>>>>>>>
>>>>>>>
>>>>>>> r~
>>>>>
>>>
>
gaosong Oct. 31, 2023, 11:06 a.m. UTC | #12
在 2023/10/31 下午5:13, Jiajie Chen 写道:
>
> On 2023/10/31 17:11, gaosong wrote:
>> 在 2023/10/30 下午7:54, Jiajie Chen 写道:
>>>
>>> On 2023/10/30 16:23, gaosong wrote:
>>>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>>>
>>>>> On 2023/10/26 14:54, gaosong wrote:
>>>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>>>
>>>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>>>> TCGv_i128 block.
>>>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>>>
>>>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only has 
>>>>>>>>>> 64-bit ll and 128-bit sc.
>>>>>>>>
>>>>>>>> Ah, that does complicate things.
>>>>>>>>
>>>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ll.d lo, base, 0
>>>>>>>>> ld.d hi, base, 4
>>>>>>>>>
>>>>>>>>> # do some computation
>>>>>>>>>
>>>>>>>>> sc.q lo, hi, base
>>>>>>>>>
>>>>>>>>> # try again if sc failed
>>>>>>>>>
>>>>>>>>> Then a possible implementation of gen_ll() would be: align 
>>>>>>>>> base to 128-bit boundary, read 128-bit from memory, save 
>>>>>>>>> 64-bit part to rd and record whole 128-bit data in llval. 
>>>>>>>>> Then, in gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But what about the reversed instruction pattern: ll.d hi, 
>>>>>>>>> base, 4; ld.d lo, base 0?
>>>>>>>>
>>>>>>>> It would be worth asking your hardware engineers about the 
>>>>>>>> bounds of legal behaviour. Ideally there would be some very 
>>>>>>>> explicit language, similar to
>>>>>>>
>>>>>>>
>>>>>>> I'm a community developer not affiliated with Loongson. Song 
>>>>>>> Gao, could you provide some detail from Loongson Inc.?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ll.d   r1, base, 0
>>>>>> dbar 0x700          ==> see 2.2.8.1
>>>>>> ld.d  r2, base,  8
>>>>>> ...
>>>>>> sc.q r1, r2, base
>>>>>
>>>>>
>>>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence 
>>>>> and translate the sequence into one tcg_gen_qemu_ld_i128 and split 
>>>>> the result into two 64-bit parts. Can do this in QEMU?
>>>>>
>>>>>
>>>> Oh, I'm not sure.
>>>>
>>>> I think we just need to implement sc.q. We don't need to care about 
>>>> 'll.d-dbar-ld.d'. It's just like 'll.q'.
>>>> It needs the user to ensure that .
>>>>
>>>> ll.q' is
>>>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>>>> 2) dbar 0x700 
>>>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>>>
>>>> sc.q needs to
>>>> 1) Use 64-bit cmpxchg.
>>>> 2) Write 128 bits to memory.
>>>
>>> Consider the following code:
>>>
>>>
>>> ll.d r1, base, 0
>>>
>>> dbar 0x700
>>>
>>> ld.d r2, base, 8
>>>
>>> addi.d r2, r2, 1
>>>
>>> sc.q r1, r2, base
>>>
>>>
>>> We translate them into native code:
>>>
>>>
>>> ld.d r1, base, 0
>>>
>>> mv LLbit, 1
>>>
>>> mv LLaddr, base
>>>
>>> mv LLval, r1
>>>
>>> dbar 0x700
>>>
>>> ld.d r2, base, 8
>>>
>>> addi.d r2, r2, 1
>>>
>>> if (LLbit == 1 && LLaddr == base) {
>>>
>>>     cmpxchg addr=base compare=LLval new=r1
>>>
>>>     128-bit write {r2, r1} to base if cmpxchg succeeded
>>>
>>> }
>>>
>>> set r1 if sc.q succeeded
>>>
>>>
>>>
>>> If the memory content of base+8 has changed between ld.d r2 and 
>>> addi.d r2, the atomicity is not guaranteed, i.e. only the high part 
>>> has changed, the low part hasn't.
>>>
>>>
>> Sorry,  my mistake.  need use cmpxchg_i128.   See 
>> target/arm/tcg/translate-a64.c   gen_store_exclusive().
>>
>> gen_scq(rd, rk, rj)
>> {
>>      ...
>>     TCGv_i128 t16 = tcg_temp_new_i128();
>>     TCGv_i128 c16 = tcg_temp_new_i128();
>>     TCGv_i64 low = tcg_temp_new_i64();
>>     TCGv_i64 high= tcg_temp_new_i64();
>>     TCGv_i64 temp = tcg_temp_new_i64();
>>
>>     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd],  cpu_gpr[rk]));
>>
>>     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx,  MO_TEUQ);
>>     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
>>     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
>>     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
>
>
> The problem is that, the high value read here might not equal to the 
> previously read one in ll.d r2, base 8 instruction.
I think dbar 0x7000 ensures that the 2 loads in 'll.q' are a 128bit 
atomic operation.

Thanks.
Song Gao
>> tcg_gen_concat_i64_i128(c16, low,  high);
>>
>>     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
>> ctx->mem_idx, MO_128);
>>
>>     ...
>> }
>>
>> I am not sure this is right.
>>
>> I think Richard can give you more suggestions. @Richard
>>
>> Thanks.
>> Song Gao
>>>
>>>> Thanks.
>>>> Song Gao
>>>>>>
>>>>>>
>>>>>> For this series,
>>>>>> I think we need set the new config bits to the 'max cpu', and 
>>>>>> change linux-user/target_elf.h ''any' to 'max', so that we can 
>>>>>> use these new instructions on linux-user mode.
>>>>>
>>>>> I will work on it.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Song Gao
>>>>>>>>
>>>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>>>> restrictions
>>>>>>>>
>>>>>>>> But you could do the same thing, aligning and recording the 
>>>>>>>> entire 128-bit quantity, then extract the ll.d result based on 
>>>>>>>> address bit 6.  This would complicate the implementation of 
>>>>>>>> sc.d as well, but would perhaps bring us "close enough" to the 
>>>>>>>> actual architecture.
>>>>>>>>
>>>>>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>>>>>> spec either.  There is quite a large comment within 
>>>>>>>> translate-a64.c store_exclusive() about the ways things are not 
>>>>>>>> quite right.  But it seems to be close enough for actual usage 
>>>>>>>> to succeed.
>>>>>>>>
>>>>>>>>
>>>>>>>> r~
>>>>>>
>>>>
>>
Jiajie Chen Oct. 31, 2023, 11:10 a.m. UTC | #13
On 2023/10/31 19:06, gaosong wrote:
> 在 2023/10/31 下午5:13, Jiajie Chen 写道:
>>
>> On 2023/10/31 17:11, gaosong wrote:
>>> 在 2023/10/30 下午7:54, Jiajie Chen 写道:
>>>>
>>>> On 2023/10/30 16:23, gaosong wrote:
>>>>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>>>>
>>>>>> On 2023/10/26 14:54, gaosong wrote:
>>>>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>>>>
>>>>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>>>>> TCGv_i128 block.
>>>>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>>>>
>>>>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only has 
>>>>>>>>>>> 64-bit ll and 128-bit sc.
>>>>>>>>>
>>>>>>>>> Ah, that does complicate things.
>>>>>>>>>
>>>>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ll.d lo, base, 0
>>>>>>>>>> ld.d hi, base, 4
>>>>>>>>>>
>>>>>>>>>> # do some computation
>>>>>>>>>>
>>>>>>>>>> sc.q lo, hi, base
>>>>>>>>>>
>>>>>>>>>> # try again if sc failed
>>>>>>>>>>
>>>>>>>>>> Then a possible implementation of gen_ll() would be: align 
>>>>>>>>>> base to 128-bit boundary, read 128-bit from memory, save 
>>>>>>>>>> 64-bit part to rd and record whole 128-bit data in llval. 
>>>>>>>>>> Then, in gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But what about the reversed instruction pattern: ll.d hi, 
>>>>>>>>>> base, 4; ld.d lo, base 0?
>>>>>>>>>
>>>>>>>>> It would be worth asking your hardware engineers about the 
>>>>>>>>> bounds of legal behaviour. Ideally there would be some very 
>>>>>>>>> explicit language, similar to
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm a community developer not affiliated with Loongson. Song 
>>>>>>>> Gao, could you provide some detail from Loongson Inc.?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ll.d   r1, base, 0
>>>>>>> dbar 0x700          ==> see 2.2.8.1
>>>>>>> ld.d  r2, base,  8
>>>>>>> ...
>>>>>>> sc.q r1, r2, base
>>>>>>
>>>>>>
>>>>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence 
>>>>>> and translate the sequence into one tcg_gen_qemu_ld_i128 and 
>>>>>> split the result into two 64-bit parts. Can do this in QEMU?
>>>>>>
>>>>>>
>>>>> Oh, I'm not sure.
>>>>>
>>>>> I think we just need to implement sc.q. We don't need to care 
>>>>> about 'll.d-dbar-ld.d'. It's just like 'll.q'.
>>>>> It needs the user to ensure that .
>>>>>
>>>>> ll.q' is
>>>>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>>>>> 2) dbar 0x700 
>>>>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>>>>
>>>>> sc.q needs to
>>>>> 1) Use 64-bit cmpxchg.
>>>>> 2) Write 128 bits to memory.
>>>>
>>>> Consider the following code:
>>>>
>>>>
>>>> ll.d r1, base, 0
>>>>
>>>> dbar 0x700
>>>>
>>>> ld.d r2, base, 8
>>>>
>>>> addi.d r2, r2, 1
>>>>
>>>> sc.q r1, r2, base
>>>>
>>>>
>>>> We translate them into native code:
>>>>
>>>>
>>>> ld.d r1, base, 0
>>>>
>>>> mv LLbit, 1
>>>>
>>>> mv LLaddr, base
>>>>
>>>> mv LLval, r1
>>>>
>>>> dbar 0x700
>>>>
>>>> ld.d r2, base, 8
>>>>
>>>> addi.d r2, r2, 1
>>>>
>>>> if (LLbit == 1 && LLaddr == base) {
>>>>
>>>>     cmpxchg addr=base compare=LLval new=r1
>>>>
>>>>     128-bit write {r2, r1} to base if cmpxchg succeeded
>>>>
>>>> }
>>>>
>>>> set r1 if sc.q succeeded
>>>>
>>>>
>>>>
>>>> If the memory content of base+8 has changed between ld.d r2 and 
>>>> addi.d r2, the atomicity is not guaranteed, i.e. only the high part 
>>>> has changed, the low part hasn't.
>>>>
>>>>
>>> Sorry,  my mistake.  need use cmpxchg_i128.   See 
>>> target/arm/tcg/translate-a64.c   gen_store_exclusive().
>>>
>>> gen_scq(rd, rk, rj)
>>> {
>>>      ...
>>>     TCGv_i128 t16 = tcg_temp_new_i128();
>>>     TCGv_i128 c16 = tcg_temp_new_i128();
>>>     TCGv_i64 low = tcg_temp_new_i64();
>>>     TCGv_i64 high= tcg_temp_new_i64();
>>>     TCGv_i64 temp = tcg_temp_new_i64();
>>>
>>>     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd],  cpu_gpr[rk]));
>>>
>>>     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx, MO_TEUQ);
>>>     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
>>>     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
>>>     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
>>
>>
>> The problem is that, the high value read here might not equal to the 
>> previously read one in ll.d r2, base 8 instruction.
> I think dbar 0x7000 ensures that the 2 loads in 'll.q' are a 128bit 
> atomic operation.


The code does work in real LoongArch machine. However, we are emulating 
LoongArch in qemu, we have to make it atomic, yet it isn't now.


>
> Thanks.
> Song Gao
>>> tcg_gen_concat_i64_i128(c16, low, high);
>>>
>>>     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
>>> ctx->mem_idx, MO_128);
>>>
>>>     ...
>>> }
>>>
>>> I am not sure this is right.
>>>
>>> I think Richard can give you more suggestions. @Richard
>>>
>>> Thanks.
>>> Song Gao
>>>>
>>>>> Thanks.
>>>>> Song Gao
>>>>>>>
>>>>>>>
>>>>>>> For this series,
>>>>>>> I think we need set the new config bits to the 'max cpu', and 
>>>>>>> change linux-user/target_elf.h ''any' to 'max', so that we can 
>>>>>>> use these new instructions on linux-user mode.
>>>>>>
>>>>>> I will work on it.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Song Gao
>>>>>>>>>
>>>>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>>>>> restrictions
>>>>>>>>>
>>>>>>>>> But you could do the same thing, aligning and recording the 
>>>>>>>>> entire 128-bit quantity, then extract the ll.d result based on 
>>>>>>>>> address bit 6. This would complicate the implementation of 
>>>>>>>>> sc.d as well, but would perhaps bring us "close enough" to the 
>>>>>>>>> actual architecture.
>>>>>>>>>
>>>>>>>>> Note that our Arm store-exclusive implementation isn't quite 
>>>>>>>>> in spec either.  There is quite a large comment within 
>>>>>>>>> translate-a64.c store_exclusive() about the ways things are 
>>>>>>>>> not quite right.  But it seems to be close enough for actual 
>>>>>>>>> usage to succeed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> r~
>>>>>>>
>>>>>
>>>
>
gaosong Oct. 31, 2023, 12:12 p.m. UTC | #14
在 2023/10/31 下午7:10, Jiajie Chen 写道:
>
> On 2023/10/31 19:06, gaosong wrote:
>> 在 2023/10/31 下午5:13, Jiajie Chen 写道:
>>>
>>> On 2023/10/31 17:11, gaosong wrote:
>>>> 在 2023/10/30 下午7:54, Jiajie Chen 写道:
>>>>>
>>>>> On 2023/10/30 16:23, gaosong wrote:
>>>>>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>>>>>
>>>>>>> On 2023/10/26 14:54, gaosong wrote:
>>>>>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>>>>>
>>>>>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>>>>>> TCGv_i128 block.
>>>>>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>>>>>
>>>>>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only 
>>>>>>>>>>>> has 64-bit ll and 128-bit sc.
>>>>>>>>>>
>>>>>>>>>> Ah, that does complicate things.
>>>>>>>>>>
>>>>>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ll.d lo, base, 0
>>>>>>>>>>> ld.d hi, base, 4
>>>>>>>>>>>
>>>>>>>>>>> # do some computation
>>>>>>>>>>>
>>>>>>>>>>> sc.q lo, hi, base
>>>>>>>>>>>
>>>>>>>>>>> # try again if sc failed
>>>>>>>>>>>
>>>>>>>>>>> Then a possible implementation of gen_ll() would be: align 
>>>>>>>>>>> base to 128-bit boundary, read 128-bit from memory, save 
>>>>>>>>>>> 64-bit part to rd and record whole 128-bit data in llval. 
>>>>>>>>>>> Then, in gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But what about the reversed instruction pattern: ll.d hi, 
>>>>>>>>>>> base, 4; ld.d lo, base 0?
>>>>>>>>>>
>>>>>>>>>> It would be worth asking your hardware engineers about the 
>>>>>>>>>> bounds of legal behaviour. Ideally there would be some very 
>>>>>>>>>> explicit language, similar to
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm a community developer not affiliated with Loongson. Song 
>>>>>>>>> Gao, could you provide some detail from Loongson Inc.?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ll.d   r1, base, 0
>>>>>>>> dbar 0x700          ==> see 2.2.8.1
>>>>>>>> ld.d  r2, base,  8
>>>>>>>> ...
>>>>>>>> sc.q r1, r2, base
>>>>>>>
>>>>>>>
>>>>>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d 
>>>>>>> sequence and translate the sequence into one 
>>>>>>> tcg_gen_qemu_ld_i128 and split the result into two 64-bit parts. 
>>>>>>> Can do this in QEMU?
>>>>>>>
>>>>>>>
>>>>>> Oh, I'm not sure.
>>>>>>
>>>>>> I think we just need to implement sc.q. We don't need to care 
>>>>>> about 'll.d-dbar-ld.d'. It's just like 'll.q'.
>>>>>> It needs the user to ensure that .
>>>>>>
>>>>>> ll.q' is
>>>>>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>>>>>> 2) dbar 0x700 
>>>>>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>>>>>
>>>>>> sc.q needs to
>>>>>> 1) Use 64-bit cmpxchg.
>>>>>> 2) Write 128 bits to memory.
>>>>>
>>>>> Consider the following code:
>>>>>
>>>>>
>>>>> ll.d r1, base, 0
>>>>>
>>>>> dbar 0x700
>>>>>
>>>>> ld.d r2, base, 8
>>>>>
>>>>> addi.d r2, r2, 1
>>>>>
>>>>> sc.q r1, r2, base
>>>>>
>>>>>
>>>>> We translate them into native code:
>>>>>
>>>>>
>>>>> ld.d r1, base, 0
>>>>>
>>>>> mv LLbit, 1
>>>>>
>>>>> mv LLaddr, base
>>>>>
>>>>> mv LLval, r1
>>>>>
>>>>> dbar 0x700
>>>>>
>>>>> ld.d r2, base, 8
>>>>>
>>>>> addi.d r2, r2, 1
>>>>>
>>>>> if (LLbit == 1 && LLaddr == base) {
>>>>>
>>>>>     cmpxchg addr=base compare=LLval new=r1
>>>>>
>>>>>     128-bit write {r2, r1} to base if cmpxchg succeeded
>>>>>
>>>>> }
>>>>>
>>>>> set r1 if sc.q succeeded
>>>>>
>>>>>
>>>>>
>>>>> If the memory content of base+8 has changed between ld.d r2 and 
>>>>> addi.d r2, the atomicity is not guaranteed, i.e. only the high 
>>>>> part has changed, the low part hasn't.
>>>>>
>>>>>
>>>> Sorry,  my mistake.  need use cmpxchg_i128.   See 
>>>> target/arm/tcg/translate-a64.c   gen_store_exclusive().
>>>>
>>>> gen_scq(rd, rk, rj)
>>>> {
>>>>      ...
>>>>     TCGv_i128 t16 = tcg_temp_new_i128();
>>>>     TCGv_i128 c16 = tcg_temp_new_i128();
>>>>     TCGv_i64 low = tcg_temp_new_i64();
>>>>     TCGv_i64 high= tcg_temp_new_i64();
>>>>     TCGv_i64 temp = tcg_temp_new_i64();
>>>>
>>>>     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd], cpu_gpr[rk]));
>>>>
>>>>     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx, MO_TEUQ);
>>>>     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
>>>>     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
>>>>     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
>>>
>>>
>>> The problem is that, the high value read here might not equal to the 
>>> previously read one in ll.d r2, base 8 instruction.
>> I think dbar 0x7000 ensures that the 2 loads in 'll.q' are a 128bit 
>> atomic operation.
>
>
> The code does work in real LoongArch machine. However, we are 
> emulating LoongArch in qemu, we have to make it atomic, yet it isn't now.
>
>
yes, I know,  As i said before,  we need't care about 'll.q', it needs 
the user to ensure that.

In QEMU,   I think  the instruction dbar can make it atomic.  but I am 
not sure  this is right.

static bool trans_dbar()
{
         tcg_gen_mb(TCG_BAR_SC | TCG_MO_ALL);
         return;
}

may be this is already enough.

or

like this:
static bool trans_dbar()
{
     TCGBar bar;
     if (a->hint == 0x700)
         bar = TCG_BAR_SC |  TCG_MO_LD_LD;
     } else {
         bar = TCG_BAR_SC | TCG_MO_ALL;
     }

     tcg_gen_mb(bar);
     return true;
}

Thanks.
Song Gao
>>
>> Thanks.
>> Song Gao
>>>> tcg_gen_concat_i64_i128(c16, low, high);
>>>>
>>>>     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
>>>> ctx->mem_idx, MO_128);
>>>>
>>>>     ...
>>>> }
>>>>
>>>> I am not sure this is right.
>>>>
>>>> I think Richard can give you more suggestions. @Richard
>>>>
>>>> Thanks.
>>>> Song Gao
>>>>>
>>>>>> Thanks.
>>>>>> Song Gao
>>>>>>>>
>>>>>>>>
>>>>>>>> For this series,
>>>>>>>> I think we need set the new config bits to the 'max cpu', and 
>>>>>>>> change linux-user/target_elf.h ''any' to 'max', so that we can 
>>>>>>>> use these new instructions on linux-user mode.
>>>>>>>
>>>>>>> I will work on it.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Song Gao
>>>>>>>>>>
>>>>>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>>>>>> restrictions
>>>>>>>>>>
>>>>>>>>>> But you could do the same thing, aligning and recording the 
>>>>>>>>>> entire 128-bit quantity, then extract the ll.d result based 
>>>>>>>>>> on address bit 6. This would complicate the implementation of 
>>>>>>>>>> sc.d as well, but would perhaps bring us "close enough" to 
>>>>>>>>>> the actual architecture.
>>>>>>>>>>
>>>>>>>>>> Note that our Arm store-exclusive implementation isn't quite 
>>>>>>>>>> in spec either.  There is quite a large comment within 
>>>>>>>>>> translate-a64.c store_exclusive() about the ways things are 
>>>>>>>>>> not quite right.  But it seems to be close enough for actual 
>>>>>>>>>> usage to succeed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> r~
>>>>>>>>
>>>>>>
>>>>
>>