mbox series

[RFC,0/1,v3] target/riscv: use tcg ops generation to emulate whole reg rvv loads/stores.

Message ID 20250122164905.13615-1-paolo.savini@embecosm.com (mailing list archive)
Headers show
Series target/riscv: use tcg ops generation to emulate whole reg rvv loads/stores. | expand

Message

Paolo Savini Jan. 22, 2025, 4:49 p.m. UTC
Previous versions:

- RFC v1: https://lore.kernel.org/all/20241218170840.1090473-1-paolo.savini@embecosm.com/
- RFC v2: https://lore.kernel.org/all/20241220153834.16302-1-paolo.savini@embecosm.com/

Thanks Max for the feedback here: https://lore.kernel.org/all/258795e9-4e97-4cd7-949f-24e88d24f25e@sifive.com/

The previous version had the issue that calls to tcg_gen_qemu_[ld/st]_i128 and
tcg_gen_[ld/st]_i128 would not generate 128 bits loads and stores but generated
64-bit pairs of loads/stores. This meant that with a trap on the second load/store
we weren't able to increment vstart in ldst_whole_trans by the number of elements
processed by the first 64 bits load/store.

I propose here the following fixes:

- Split the emulation of whole register loads/stores into smaller sizes:
  we generate at best pairs of 64 bits loads/stores anyway so we'd rather call
  directly for the generation of 64 bits load/store operations and update vstart
  accordingly, instead of calling for 128 bits loads and store that under the
  hood will be split.
- Emulate whole register loads/stores by 32 bits blocks for hosts with 32 bits
  registers: this is done again to avoid a splitting of the load or store that
  we want to generate without us being able to set vstart correctly in case a
  trap happens.
- Don't generate 32 bits loads/stores but fall back to the helper function
  if the host has 32 bits registers and we are loading/storing vector elements
  of 64 bits. This is done in order to avoid that a trap stops the execution
  mid-element.

The patch also adds a set of conditions for the use of tcg nodes or helper
function that is host architecture specific.
We observed a performance gain on all the combinations of vector length,
element size and number of fields in the emulation of the whole register loads
and stores apart from a few cases where the helper function pefroms better.
We add a set of condition that cover those cases and future strings can be added
if other architectures require them.

The commit message changed to better reflect the new behaviour of the patch.

Cc: Richard Handerson <richard.henderson@linaro.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alistair Francis <alistair.francis@wdc.com>
Cc: Bin Meng <bmeng.cn@gmail.com>
Cc: Weiwei Li <liwei1518@gmail.com>
Cc: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Cc: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
Cc: Helene Chelin <helene.chelin@embecosm.com>
Cc: Nathan Egge <negge@google.com>
Cc: Max Chou <max.chou@sifive.com>
Cc: Jeremy Bennett <jeremy.bennett@embecosm.com>
Cc: Craig Blackmore <craig.blackmore@embecosm.com>


Paolo Savini (1):
  target/riscv: use tcg ops generation to emulate whole reg rvv
    loads/stores.

 target/riscv/insn_trans/trans_rvv.c.inc | 164 +++++++++++++++++-------
 1 file changed, 119 insertions(+), 45 deletions(-)