Message ID | 20230124195945.181842-9-christoph.muellner@vrull.eu (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add support for the T-Head vendor extensions | expand |
On 1/24/23 09:59, Christoph Muellner wrote: > +static bool gen_loadpair_tl(DisasContext *ctx, arg_th_pair *a, MemOp memop, > + int shamt) > +{ > + TCGv rd1 = dest_gpr(ctx, a->rd1); > + TCGv rd2 = dest_gpr(ctx, a->rd2); > + TCGv addr1 = tcg_temp_new(); > + TCGv addr2 = tcg_temp_new(); > + > + addr1 = get_address(ctx, a->rs, a->sh2 << shamt); > + if ((memop & MO_SIZE) == MO_64) { > + addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt)); > + } else { > + addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt)); > + } > + > + tcg_gen_qemu_ld_tl(rd1, addr1, ctx->mem_idx, memop); > + tcg_gen_qemu_ld_tl(rd2, addr2, ctx->mem_idx, memop); > + gen_set_gpr(ctx, a->rd1, rd1); > + gen_set_gpr(ctx, a->rd2, rd2); Since dest_gpr may return cpu_gpr[n], this may update the rd1 before recognizing the exception that the second load may generate. Is that correct? The manual says that rd1, rd2, and rs1 must not be the same, but you do not check this. r~
On 2023/1/25 4:44, Richard Henderson wrote: > On 1/24/23 09:59, Christoph Muellner wrote: >> +static bool gen_loadpair_tl(DisasContext *ctx, arg_th_pair *a, MemOp >> memop, >> + int shamt) >> +{ >> + TCGv rd1 = dest_gpr(ctx, a->rd1); >> + TCGv rd2 = dest_gpr(ctx, a->rd2); >> + TCGv addr1 = tcg_temp_new(); >> + TCGv addr2 = tcg_temp_new(); >> + >> + addr1 = get_address(ctx, a->rs, a->sh2 << shamt); >> + if ((memop & MO_SIZE) == MO_64) { >> + addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt)); >> + } else { >> + addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt)); >> + } >> + >> + tcg_gen_qemu_ld_tl(rd1, addr1, ctx->mem_idx, memop); >> + tcg_gen_qemu_ld_tl(rd2, addr2, ctx->mem_idx, memop); >> + gen_set_gpr(ctx, a->rd1, rd1); >> + gen_set_gpr(ctx, a->rd2, rd2); > > Since dest_gpr may return cpu_gpr[n], this may update the rd1 before > recognizing the exception that the second load may generate. Is that > correct? Thanks. It's a bug. We should load all memory addresses to local TCG temps first. Do you think we should probe all the memory addresses for the store pair instructions? If so, can we avoid the use of a helper function? > > The manual says that rd1, rd2, and rs1 must not be the same, but you > do not check this. The main reason is that assembler can do this check. Is it necessary to check this in QEMU? Best Regards, Zhiwei > > > r~
On 1/29/23 16:03, LIU Zhiwei wrote: > Thanks. It's a bug. We should load all memory addresses to local TCG temps first. > > Do you think we should probe all the memory addresses for the store pair instructions? If > so, can we avoid the use of a helper function? Depends on what the hardware does. Even with a trap in the middle the stores are restartable, since no register state changes. But if you'd like no changes verifying both stores, for this case you can pack the pair into a larger data type: TCGv_i64 for pair of 32-bit, and TCGv_i128 for pair of 64-bit. Patches for TCGv_i128 [1] are just finishing review; patches to describe atomicity of the larger operation are also on list [2]. Anyway, the idea is that you issue one TCG memory operation, the entire operation is validated, and then the stores happen. > The main reason is that assembler can do this check. Is it necessary to check this in QEMU? Yes. Conciser what happens when the insn is encoded with .long. Does the hardware trap an illegal instruction? Is the behavior simply unspecified? The manual could be improved to specify, akin to the Arm terms: UNDEFINED, CONSTRAINED UNPREDICTABLE, IMPLEMENTATION DEFINED, etc. r~ [1] https://patchew.org/QEMU/20230126043824.54819-1-richard.henderson@linaro.org/ [2] https://patchew.org/QEMU/20221118094754.242910-1-richard.henderson@linaro.org/
On 2023/1/30 13:43, Richard Henderson wrote: > On 1/29/23 16:03, LIU Zhiwei wrote: >> Thanks. It's a bug. We should load all memory addresses to local TCG >> temps first. >> >> Do you think we should probe all the memory addresses for the store >> pair instructions? If so, can we avoid the use of a helper function? > > Depends on what the hardware does. Even with a trap in the middle the > stores are restartable, since no register state changes. I refer to the specification of LDP and STP on AARCH64. The specification allows "any access performed before the exception was taken is repeated". In detailed, "If, according to these rules, an instruction is executed as a sequence of accesses, exceptions, including interrupts, can be taken during that sequence, regardless of the memory type being accessed. If any of these exceptions are returned from using their preferred return address, the instruction that generated the sequence of accesses is re-executed, and so any access performed before the exception was taken is repeated. See also Taking an interrupt during a multi-access load or store on page D1-4664." However I see the implementation of LDP and STP on QEMU are in different ways. LDP will only load the first register when it ensures no trap in the second access. So I have two questions here. 1) One for the QEMU implementation about LDP. Can we implement the LDP as two directly loads to cpu registers instead of local TCG temps? 2) One for the comment. Why register state changes cause non-restartable? Do you mean if the first register changes, it may influence the calculation of address after the trap? "Even with a trap in the middle the stores are restartable, since no register state changes." > > But if you'd like no changes verifying both stores, for this case you > can pack the pair into a larger data type: TCGv_i64 for pair of > 32-bit, and TCGv_i128 for pair of 64-bit. > Patches for TCGv_i128 [1] are just finishing review; patches to > describe atomicity of the larger operation are also on list [2]. > Anyway, the idea is that you issue one TCG memory operation, the > entire operation is validated, and then the stores happen. > > >> The main reason is that assembler can do this check. Is it necessary >> to check this in QEMU? > > Yes. Conciser what happens when the insn is encoded with .long. Does > the hardware trap an illegal instruction? Is the behavior simply > unspecified? The manual could be improved to specify, akin to the Arm > terms: UNDEFINED, CONSTRAINED UNPREDICTABLE, IMPLEMENTATION DEFINED, etc. > > Thanks, I will fix the manual. Best Regards, Zhiwei > r~ > > [1] > https://patchew.org/QEMU/20230126043824.54819-1-richard.henderson@linaro.org/ > [2] > https://patchew.org/QEMU/20221118094754.242910-1-richard.henderson@linaro.org/
On 1/29/23 22:41, LIU Zhiwei wrote: > > On 2023/1/30 13:43, Richard Henderson wrote: >> On 1/29/23 16:03, LIU Zhiwei wrote: >>> Thanks. It's a bug. We should load all memory addresses to local TCG temps first. >>> >>> Do you think we should probe all the memory addresses for the store pair instructions? >>> If so, can we avoid the use of a helper function? >> >> Depends on what the hardware does. Even with a trap in the middle the stores are >> restartable, since no register state changes. > > I refer to the specification of LDP and STP on AARCH64. The specification allows > > "any access performed before the exception was taken is repeated". > > In detailed, > > "If, according to these rules, an instruction is executed as a sequence of accesses, exceptions, including interrupts, > can be taken during that sequence, regardless of the memory type being accessed. If any of these exceptions are > returned from using their preferred return address, the instruction that generated the sequence of accesses is > re-executed, and so any access performed before the exception was taken is repeated. See also Taking an interrupt > during a multi-access load or store on page D1-4664." > > However I see the implementation of LDP and STP on QEMU are in different ways. LDP will > only load the first register when it ensures no trap in the second access. > > So I have two questions here. > > 1) One for the QEMU implementation about LDP. Can we implement the LDP as two directly > loads to cpu registers instead of local TCG temps? For the Thead specification, where rd1 != rs1 (and you enforce it), then yes, I suppose you could load directly to the cpu registers, because on restart rs1 would be unmodified. For AArch64, which you quote above, there is no constraint that the destinations do not overlap the address register, so we must implement "LDP r0, r1, [r0]" as a load into temps. > 2) One for the comment. Why register state changes cause non-restartable? Do you mean if > the first register changes, it may influence the calculation of address after the trap? Yes, that's what I mean about non-restartable -- if any of the input registers are changed before the trap is recognized. >> Yes. Conciser what happens when the insn is encoded with .long. Does the hardware trap >> an illegal instruction? Is the behavior simply unspecified? The manual could be >> improved to specify, akin to the Arm terms: UNDEFINED, CONSTRAINED UNPREDICTABLE, >> IMPLEMENTATION DEFINED, etc. >> >> > Thanks, I will fix the manual. Excellent, thanks. r~
On 2023/1/31 3:03, Richard Henderson wrote: > On 1/29/23 22:41, LIU Zhiwei wrote: >> >> On 2023/1/30 13:43, Richard Henderson wrote: >>> On 1/29/23 16:03, LIU Zhiwei wrote: >>>> Thanks. It's a bug. We should load all memory addresses to local >>>> TCG temps first. >>>> >>>> Do you think we should probe all the memory addresses for the store >>>> pair instructions? If so, can we avoid the use of a helper function? >>> >>> Depends on what the hardware does. Even with a trap in the middle >>> the stores are restartable, since no register state changes. >> >> I refer to the specification of LDP and STP on AARCH64. The >> specification allows >> >> "any access performed before the exception was taken is repeated". >> >> In detailed, >> >> "If, according to these rules, an instruction is executed as a >> sequence of accesses, exceptions, including interrupts, >> can be taken during that sequence, regardless of the memory type >> being accessed. If any of these exceptions are >> returned from using their preferred return address, the instruction >> that generated the sequence of accesses is >> re-executed, and so any access performed before the exception was >> taken is repeated. See also Taking an interrupt >> during a multi-access load or store on page D1-4664." >> >> However I see the implementation of LDP and STP on QEMU are in >> different ways. LDP will only load the first register when it ensures >> no trap in the second access. >> >> So I have two questions here. >> >> 1) One for the QEMU implementation about LDP. Can we implement the >> LDP as two directly loads to cpu registers instead of local TCG temps? > > For the Thead specification, where rd1 != rs1 (and you enforce it), > then yes, I suppose you could load directly to the cpu registers, > because on restart rs1 would be unmodified. > > For AArch64, which you quote above, there is no constraint that the > destinations do not overlap the address register, so we must implement > "LDP r0, r1, [r0]" as a load into temps. > Got it. Thanks. > >> 2) One for the comment. Why register state changes cause >> non-restartable? Do you mean if the first register changes, it may >> influence the calculation of address after the trap? > > Yes, that's what I mean about non-restartable -- if any of the input > registers are changed before the trap is recognized. > > Thanks for the clarification. Once I thought the reason of non-restartable is the side effects of repeat execution, which may cause watchpoint matches twice or access MMIO device twice. >>> Yes. Conciser what happens when the insn is encoded with .long. >>> Does the hardware trap an illegal instruction? Is the behavior >>> simply unspecified? The manual could be improved to specify, akin >>> to the Arm terms: UNDEFINED, CONSTRAINED UNPREDICTABLE, >>> IMPLEMENTATION DEFINED, etc. >>> >>> >> Thanks, I will fix the manual. The manual has been fixed by Christopher. Thanks. Best Regards, Zhiwei > > Excellent, thanks. > > > r~
On Tue, Jan 24, 2023 at 9:44 PM Richard Henderson <richard.henderson@linaro.org> wrote: > > On 1/24/23 09:59, Christoph Muellner wrote: > > +static bool gen_loadpair_tl(DisasContext *ctx, arg_th_pair *a, MemOp memop, > > + int shamt) > > +{ > > + TCGv rd1 = dest_gpr(ctx, a->rd1); > > + TCGv rd2 = dest_gpr(ctx, a->rd2); > > + TCGv addr1 = tcg_temp_new(); > > + TCGv addr2 = tcg_temp_new(); > > + > > + addr1 = get_address(ctx, a->rs, a->sh2 << shamt); > > + if ((memop & MO_SIZE) == MO_64) { > > + addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt)); > > + } else { > > + addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt)); > > + } > > + > > + tcg_gen_qemu_ld_tl(rd1, addr1, ctx->mem_idx, memop); > > + tcg_gen_qemu_ld_tl(rd2, addr2, ctx->mem_idx, memop); > > + gen_set_gpr(ctx, a->rd1, rd1); > > + gen_set_gpr(ctx, a->rd2, rd2); > > Since dest_gpr may return cpu_gpr[n], this may update the rd1 before recognizing the > exception that the second load may generate. Is that correct? Solved in v4 by using temporaries. > > The manual says that rd1, rd2, and rs1 must not be the same, but you do not check this. Fixed in v4. Thank you! > > > r~
diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c index 2ce8eb6a6f..e3a10f782c 100644 --- a/target/riscv/cpu.c +++ b/target/riscv/cpu.c @@ -115,6 +115,7 @@ static const struct isa_ext_data isa_edata_arr[] = { ISA_EXT_DATA_ENTRY(xtheadcmo, true, PRIV_VERSION_1_11_0, ext_xtheadcmo), ISA_EXT_DATA_ENTRY(xtheadcondmov, true, PRIV_VERSION_1_11_0, ext_xtheadcondmov), ISA_EXT_DATA_ENTRY(xtheadmac, true, PRIV_VERSION_1_11_0, ext_xtheadmac), + ISA_EXT_DATA_ENTRY(xtheadmempair, true, PRIV_VERSION_1_11_0, ext_xtheadmempair), ISA_EXT_DATA_ENTRY(xtheadsync, true, PRIV_VERSION_1_11_0, ext_xtheadsync), ISA_EXT_DATA_ENTRY(xventanacondops, true, PRIV_VERSION_1_12_0, ext_XVentanaCondOps), }; @@ -1084,6 +1085,7 @@ static Property riscv_cpu_extensions[] = { DEFINE_PROP_BOOL("xtheadcmo", RISCVCPU, cfg.ext_xtheadcmo, false), DEFINE_PROP_BOOL("xtheadcondmov", RISCVCPU, cfg.ext_xtheadcondmov, false), DEFINE_PROP_BOOL("xtheadmac", RISCVCPU, cfg.ext_xtheadmac, false), + DEFINE_PROP_BOOL("xtheadmempair", RISCVCPU, cfg.ext_xtheadmempair, false), DEFINE_PROP_BOOL("xtheadsync", RISCVCPU, cfg.ext_xtheadsync, false), DEFINE_PROP_BOOL("xventanacondops", RISCVCPU, cfg.ext_XVentanaCondOps, false), diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h index 55aea777a0..4f5f3b2c20 100644 --- a/target/riscv/cpu.h +++ b/target/riscv/cpu.h @@ -479,6 +479,7 @@ struct RISCVCPUConfig { bool ext_xtheadcmo; bool ext_xtheadcondmov; bool ext_xtheadmac; + bool ext_xtheadmempair; bool ext_xtheadsync; bool ext_XVentanaCondOps; diff --git a/target/riscv/insn_trans/trans_xthead.c.inc b/target/riscv/insn_trans/trans_xthead.c.inc index 1c583ea8ec..7ab2a7a48e 100644 --- a/target/riscv/insn_trans/trans_xthead.c.inc +++ b/target/riscv/insn_trans/trans_xthead.c.inc @@ -52,6 +52,12 @@ } \ } while (0) +#define REQUIRE_XTHEADMEMPAIR(ctx) do { \ + if (!ctx->cfg_ptr->ext_xtheadmempair) { \ + return false; \ + } \ +} while (0) + #define REQUIRE_XTHEADSYNC(ctx) do { \ if (!ctx->cfg_ptr->ext_xtheadsync) { \ return false; \ @@ -382,6 +388,88 @@ static bool trans_th_mulsw(DisasContext *ctx, arg_th_mulsw *a) return gen_th_mac(ctx, a, tcg_gen_sub_tl, NULL); } +/* XTheadMemPair */ + +static bool gen_loadpair_tl(DisasContext *ctx, arg_th_pair *a, MemOp memop, + int shamt) +{ + TCGv rd1 = dest_gpr(ctx, a->rd1); + TCGv rd2 = dest_gpr(ctx, a->rd2); + TCGv addr1 = tcg_temp_new(); + TCGv addr2 = tcg_temp_new(); + + addr1 = get_address(ctx, a->rs, a->sh2 << shamt); + if ((memop & MO_SIZE) == MO_64) { + addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt)); + } else { + addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt)); + } + + tcg_gen_qemu_ld_tl(rd1, addr1, ctx->mem_idx, memop); + tcg_gen_qemu_ld_tl(rd2, addr2, ctx->mem_idx, memop); + gen_set_gpr(ctx, a->rd1, rd1); + gen_set_gpr(ctx, a->rd2, rd2); + + tcg_temp_free(addr1); + tcg_temp_free(addr2); + return true; +} + +static bool trans_th_ldd(DisasContext *ctx, arg_th_pair *a) +{ + REQUIRE_XTHEADMEMPAIR(ctx); + REQUIRE_64BIT(ctx); + return gen_loadpair_tl(ctx, a, MO_TESQ, 4); +} + +static bool trans_th_lwd(DisasContext *ctx, arg_th_pair *a) +{ + REQUIRE_XTHEADMEMPAIR(ctx); + return gen_loadpair_tl(ctx, a, MO_TESL, 3); +} + +static bool trans_th_lwud(DisasContext *ctx, arg_th_pair *a) +{ + REQUIRE_XTHEADMEMPAIR(ctx); + return gen_loadpair_tl(ctx, a, MO_TEUL, 3); +} + +static bool gen_storepair_tl(DisasContext *ctx, arg_th_pair *a, MemOp memop, + int shamt) +{ + TCGv data1 = get_gpr(ctx, a->rd1, EXT_NONE); + TCGv data2 = get_gpr(ctx, a->rd2, EXT_NONE); + TCGv addr1 = tcg_temp_new(); + TCGv addr2 = tcg_temp_new(); + + addr1 = get_address(ctx, a->rs, a->sh2 << shamt); + if ((memop & MO_SIZE) == MO_64) { + addr2 = get_address(ctx, a->rs, 8 + (a->sh2 << shamt)); + } else { + addr2 = get_address(ctx, a->rs, 4 + (a->sh2 << shamt)); + } + + tcg_gen_qemu_st_tl(data1, addr1, ctx->mem_idx, memop); + tcg_gen_qemu_st_tl(data2, addr2, ctx->mem_idx, memop); + + tcg_temp_free(addr1); + tcg_temp_free(addr2); + return true; +} + +static bool trans_th_sdd(DisasContext *ctx, arg_th_pair *a) +{ + REQUIRE_XTHEADMEMPAIR(ctx); + REQUIRE_64BIT(ctx); + return gen_storepair_tl(ctx, a, MO_TESQ, 4); +} + +static bool trans_th_swd(DisasContext *ctx, arg_th_pair *a) +{ + REQUIRE_XTHEADMEMPAIR(ctx); + return gen_storepair_tl(ctx, a, MO_TESL, 3); +} + /* XTheadSync */ static bool trans_th_sfence_vmas(DisasContext *ctx, arg_th_sfence_vmas *a) diff --git a/target/riscv/translate.c b/target/riscv/translate.c index 5be1c9da69..27bab07994 100644 --- a/target/riscv/translate.c +++ b/target/riscv/translate.c @@ -133,7 +133,7 @@ static bool has_xthead_p(DisasContext *ctx __attribute__((__unused__))) return ctx->cfg_ptr->ext_xtheadba || ctx->cfg_ptr->ext_xtheadbb || ctx->cfg_ptr->ext_xtheadbs || ctx->cfg_ptr->ext_xtheadcmo || ctx->cfg_ptr->ext_xtheadcondmov || ctx->cfg_ptr->ext_xtheadmac || - ctx->cfg_ptr->ext_xtheadsync; + ctx->cfg_ptr->ext_xtheadmempair || ctx->cfg_ptr->ext_xtheadsync; } #define MATERIALISE_EXT_PREDICATE(ext) \ diff --git a/target/riscv/xthead.decode b/target/riscv/xthead.decode index 696de6cecf..ff2a83b56d 100644 --- a/target/riscv/xthead.decode +++ b/target/riscv/xthead.decode @@ -11,16 +11,21 @@ # Fields: %rd 7:5 +%rd1 7:5 +%rs 15:5 %rs1 15:5 +%rd2 20:5 %rs2 20:5 %sh5 20:5 %sh6 20:6 +%sh2 25:2 # Argument sets &r rd rs1 rs2 !extern &r2 rd rs1 !extern &shift shamt rs1 rd !extern &th_bfext msb lsb rs1 rd +&th_pair rd1 rs rd2 sh2 # Formats @sfence_vm ....... ..... ..... ... ..... ....... %rs1 @@ -30,6 +35,7 @@ @th_bfext msb:6 lsb:6 ..... ... ..... ....... &th_bfext %rs1 %rd @sh5 ....... ..... ..... ... ..... ....... &shift shamt=%sh5 %rs1 %rd @sh6 ...... ...... ..... ... ..... ....... &shift shamt=%sh6 %rs1 %rd +@th_pair ..... .. ..... ..... ... ..... ....... &th_pair %rd1 %rs %rd2 %sh2 # XTheadBa # Instead of defining a new encoding, we simply use the decoder to @@ -96,6 +102,13 @@ th_muls 00100 01 ..... ..... 001 ..... 0001011 @r th_mulsh 00101 01 ..... ..... 001 ..... 0001011 @r th_mulsw 00100 11 ..... ..... 001 ..... 0001011 @r +# XTheadMemPair +th_ldd 11111 .. ..... ..... 100 ..... 0001011 @th_pair +th_lwd 11100 .. ..... ..... 100 ..... 0001011 @th_pair +th_lwud 11110 .. ..... ..... 100 ..... 0001011 @th_pair +th_sdd 11111 .. ..... ..... 101 ..... 0001011 @th_pair +th_swd 11100 .. ..... ..... 101 ..... 0001011 @th_pair + # XTheadSync th_sfence_vmas 0000010 ..... ..... 000 00000 0001011 @rs2_s th_sync 0000000 11000 00000 000 00000 0001011