Message ID | 20241211125113.583902-3-craig.blackmore@embecosm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores | expand |
On 2024/12/11 8:51 PM, Craig Blackmore wrote: > Calling `vext_continuous_ldst_tlb` for load/stores smaller than 12 bytes > significantly improves performance. > > Co-authored-by: Helene CHELIN <helene.chelin@embecosm.com> > Co-authored-by: Paolo Savini <paolo.savini@embecosm.com> > Co-authored-by: Craig Blackmore <craig.blackmore@embecosm.com> > > Signed-off-by: Helene CHELIN <helene.chelin@embecosm.com> > Signed-off-by: Paolo Savini <paolo.savini@embecosm.com> > Signed-off-by: Craig Blackmore <craig.blackmore@embecosm.com> > --- > target/riscv/vector_helper.c | 16 ++++++++++++++++ > 1 file changed, 16 insertions(+) > > diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c > index 0f57e48cc5..529b4b261e 100644 > --- a/target/riscv/vector_helper.c > +++ b/target/riscv/vector_helper.c > @@ -393,6 +393,22 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc, > return; > } > > +#if defined(CONFIG_USER_ONLY) > + /* > + * For data sizes < 12 bits we get better performance by simply calling I think that the `bits` should be replaced with `bytes` here. It seems that this patch aims to strike a balance between the overhead of checking and probing for the fast path and the benefits of directly executing load/store operations through the slow path. Could you please share any experiment results (and the steps/commands) that illustrate the performance impact when the load/store size exceeds 12 bytes? (BTW, this version can pass the vstart test case that I created for the unexpected vstart value issue in the previous version.) (https://github.com/rnax/rise-rvv-tcg-qemu-tooling/commit/bccc43a7be9f636cfdaea3c2bfb020558777c7dc) > + * vext_continuous_ldst_tlb > + */ > + if (nf == 1 && (evl << log2_esz) < 12) { > + addr = base + (env->vstart << log2_esz); > + vext_continuous_ldst_tlb(env, ldst_tlb, vd, evl, addr, env->vstart, ra, > + esz, is_load); > + > + env->vstart = 0; > + vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems); > + return; > + } > +#endif > + > /* Calculate the page range of first page */ > addr = base + ((env->vstart * nf) << log2_esz); > page_split = -(addr | TARGET_PAGE_MASK);
diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c index 0f57e48cc5..529b4b261e 100644 --- a/target/riscv/vector_helper.c +++ b/target/riscv/vector_helper.c @@ -393,6 +393,22 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc, return; } +#if defined(CONFIG_USER_ONLY) + /* + * For data sizes < 12 bits we get better performance by simply calling + * vext_continuous_ldst_tlb + */ + if (nf == 1 && (evl << log2_esz) < 12) { + addr = base + (env->vstart << log2_esz); + vext_continuous_ldst_tlb(env, ldst_tlb, vd, evl, addr, env->vstart, ra, + esz, is_load); + + env->vstart = 0; + vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems); + return; + } +#endif + /* Calculate the page range of first page */ addr = base + ((env->vstart * nf) << log2_esz); page_split = -(addr | TARGET_PAGE_MASK);