[RFC,4/6] accel/tcg: Inline cpu_mmu_lookup function

Message ID	20240215192823.729209-5-max.chou@sifive.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Max Chou <max.chou@sifive.com> To: qemu-devel@nongnu.org, qemu-riscv@nongnu.org Cc: dbarboza@ventanamicro.com, Max Chou <max.chou@sifive.com>, Riku Voipio <riku.voipio@iki.fi>, Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com> Subject: [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Date: Fri, 16 Feb 2024 03:28:15 +0800 Message-Id: <20240215192823.729209-5-max.chou@sifive.com> In-Reply-To: <20240215192823.729209-1-max.chou@sifive.com> References: <20240215192823.729209-1-max.chou@sifive.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::62c; envelope-from=max.chou@sifive.com; helo=mail-pl1-x62c.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01, T_SPF_TEMPERROR=0.01 autolearn=unavailable autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	Improve the performance of RISC-V vector unit-stride ld/st instructions \| expand [RFC,0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions [RFC,1/6] target/riscv: Seperate vector segment ld/st instructions [RFC,2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb [RFC,3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance [RFC,4/6] accel/tcg: Inline cpu_mmu_lookup function [RFC,5/6] accel/tcg: Inline do_ld1_mmu function [RFC,6/6] accel/tcg: Inline do_st1_mmu function

Max Chou Feb. 15, 2024, 7:28 p.m. UTC

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 accel/tcg/user-exec.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Richard Henderson Feb. 15, 2024, 8:10 p.m. UTC | #1

On 2/15/24 09:28, Max Chou wrote:
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---
>   accel/tcg/user-exec.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
> index 68b252cb8e8..c5453810eee 100644
> --- a/accel/tcg/user-exec.c
> +++ b/accel/tcg/user-exec.c
> @@ -942,8 +942,11 @@ void page_reset_target_data(target_ulong start, target_ulong last) { }
>   
>   /* The system-mode versions of these helpers are in cputlb.c.  */
>   
> -static void *cpu_mmu_lookup(CPUState *cpu, vaddr addr,
> -                            MemOp mop, uintptr_t ra, MMUAccessType type)
> +static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu,
> +                                                      vaddr addr,
> +                                                      MemOp mop,
> +                                                      uintptr_t ra,
> +                                                      MMUAccessType type)
>   {
>       int a_bits = get_alignment_bits(mop);
>       void *ret;

This is a large function.  Why does it need to be inlined?
For this and the next two patches I require evidence, because I don't believe you are 
attacking the problem correctly.


r~

Max Chou Feb. 17, 2024, 5:27 p.m. UTC | #2

Hi Richard,

After I tried to reduce the overhead of unnecessary segment flow and 
memory plugin callbacks, I observed that there several functions consume 
most of runtime from perf report.
So these three inline patches and previous patch were created to reduce 
the overhead without modifying the functions.

The following are the experiment results.
The benchmark target is the bench-memcpy executable generated from the 
glibc repository (release 2.38 with RVV support patch [1]).

- Execution command

`qemu-riscv64 -E TIMEOUTFACTOR=10 -R 1G -L {glibc build folder}/rootfs 
-cpu 
rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true 
bench-memcpy`


- Total  runtime

0. Original riscv-to-apply.next branch (commit ID: deb0ff0)
     - Total execution time: ~383 sec.
1. Cherry pick PATCH 4 to riscv-to-apply.next branch (commit ID: deb0ff0)
     - Total execution time: ~375 sec.
2. Cherry pick PATCH 4+5+6 to riscv-to-apply.next branch (commit ID: 
deb0ff0)
     - Total execution time: ~342 sec.


- Perf report (cycles)

0. Original riscv-to-apply.next branch (commit ID: deb0ff0)

# Samples: 385K of event 'cycles:u'
# Event count (approx.): 1411574690539
#
# Children      Self  Command       Shared Object Symbol
# ........  ........  ............  ....................... 
............................................
#
     47.54%    31.35%  qemu-riscv64  qemu-riscv64             [.] 
vext_ldst_us
     25.60%     0.03%  qemu-riscv64  qemu-riscv64             [.] 
helper_vse8_v
     18.87%    13.29%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldb_mmu
     17.40%     0.04%  qemu-riscv64  qemu-riscv64             [.] 
helper_vle8_v
     17.39%    15.71%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_mmu
     17.17%    17.17%  qemu-riscv64  qemu-riscv64             [.] 
qemu_plugin_vcpu_mem_cb
     12.25%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
helper_stb_mmu (inlined)
      8.18%     4.08%  qemu-riscv64  qemu-riscv64             [.] lde_b
      8.17%     8.17%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup
      7.45%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_load_cb (inlined)
      7.32%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_st1_mmu (inlined)
      6.79%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000000000ff
      5.91%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
adjust_addr (inlined)
      5.70%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_ld1_mmu (inlined)
      5.58%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000000001fe
      5.23%     4.25%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldsb_data_ra
      4.93%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_memop (inlined)
      4.11%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_data_ra (inlined)
      4.11%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_mmuidx_ra (inlined)
      2.88%     2.88%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_data_ra
      2.88%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_mmuidx_ra (inlined)
      2.75%     0.00%  qemu-riscv64  qemu-riscv64             [.] stb_p 
(inlined)
      2.66%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      1.79%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000e40203bf
      1.68%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_store_cb (inlined)
      1.60%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000e403733f
      1.13%     0.00%  qemu-riscv64  qemu-riscv64             [.] ldub_p 
(inlined)
      0.73%     0.73%  qemu-riscv64  qemu-riscv64             [.] ste_b
      0.53%     0.21%  qemu-riscv64  qemu-riscv64             [.] 
helper_lookup_tb_ptr

1. Cherry pick PATCH 4 to riscv-to-apply.next branch (commit ID: deb0ff0)

# Samples: 378K of event 'cycles:u'
# Event count (approx.): 1381912775966
#
# Children      Self  Command       Shared Object Symbol
# ........  ........  ............  ....................... 
.......................................
#
     63.30%    29.62%  qemu-riscv64  qemu-riscv64             [.] 
vext_ldst_us
     30.77%     0.04%  qemu-riscv64  qemu-riscv64             [.] 
helper_vle8_v
     28.59%     0.02%  qemu-riscv64  qemu-riscv64             [.] 
helper_vse8_v
     22.78%    22.78%  qemu-riscv64  qemu-riscv64             [.] 
qemu_plugin_vcpu_mem_cb
     21.40%     5.26%  qemu-riscv64  qemu-riscv64             [.] lde_b
     20.69%    10.40%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldb_mmu
     20.06%     3.91%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldsb_data_ra
     19.16%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_data_ra (inlined)
     19.16%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_mmuidx_ra (inlined)
     12.65%     9.36%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_mmu
      8.60%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_load_cb (inlined)
      6.73%     6.73%  qemu-riscv64  qemu-riscv64             [.] 
do_ld1_mmu.constprop.23
      6.73%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_ld1_mmu (inlined)
      6.31%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
helper_stb_mmu (inlined)
      6.20%     6.20%  qemu-riscv64  qemu-riscv64             [.] do_st1_mmu
      6.20%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
adjust_addr (inlined)
      5.49%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000000001fe
      3.82%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_store_cb (inlined)
      3.01%     0.00%  qemu-riscv64  qemu-riscv64             [.] ldub_p 
(inlined)
      2.94%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup (inlined)
      2.94%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      2.91%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
clear_helper_retaddr (inlined)
      2.91%     2.91%  qemu-riscv64  qemu-riscv64             [.] ste_b
      2.90%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup (inlined)
      2.90%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      0.59%     0.24%  qemu-riscv64  qemu-riscv64             [.] 
helper_lookup_tb_ptr

2. Cherry pick PATCH 4+5+6 to riscv-to-apply.next branch (commit ID: 
deb0ff0)

# Samples: 343K of event 'cycles:u'
# Event count (approx.): 1259748868940
#
# Children      Self  Command       Shared Object Symbol
# ........  ........  ............  ....................... 
.......................................
#
     64.16%    35.81%  qemu-riscv64  qemu-riscv64             [.] 
vext_ldst_us
     30.96%     0.02%  qemu-riscv64  qemu-riscv64             [.] 
helper_vse8_v
     27.44%     0.05%  qemu-riscv64  qemu-riscv64             [.] 
helper_vle8_v
     18.37%    18.37%  qemu-riscv64  qemu-riscv64             [.] 
qemu_plugin_vcpu_mem_cb
     17.33%     7.45%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldsb_data_ra
     14.90%     5.02%  qemu-riscv64  qemu-riscv64             [.] lde_b
     14.83%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_data_ra (inlined)
     14.83%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_mmuidx_ra (inlined)
     14.15%    10.33%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_mmu
     11.25%     9.62%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldb_mmu
      7.22%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
adjust_addr (inlined)
      6.99%     6.99%  qemu-riscv64  qemu-riscv64             [.] 
helper_stb_mmu
      6.99%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_st1_mmu (inlined)
      5.96%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_ld1_mmu (inlined)
      5.95%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup (inlined)
      5.30%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000000001fe
      4.18%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_store_cb (inlined)
      3.22%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup (inlined)
      3.22%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      3.22%     3.22%  qemu-riscv64  qemu-riscv64             [.] ste_b
      3.19%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      3.16%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
clear_helper_retaddr (inlined)
      2.76%     0.00%  qemu-riscv64  qemu-riscv64             [.] g2h 
(inlined)
      2.76%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
g2h_untagged (inlined)
      2.30%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
env_cpu (inlined)
      2.21%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_load_cb (inlined)
      0.99%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
env_cpu (inlined)


I agree that these functions are large functions to inline and I didn't 
test these patches on all combinations (different guest architecture + 
different host architecture).
So I think that we can drop these three patches until we can make sure 
that these patches can get benefit on all combinations without side effect.
I'll focus on avoiding over-use of the full out-of-line load/store 
routines for the next version.


Thanks for the suggestion and question,

Max

[1] 
https://inbox.sourceware.org/libc-alpha/20230504074851.38763-1-hau.hsu@sifive.com

On 2024/2/16 4:10 AM, Richard Henderson wrote:
> On 2/15/24 09:28, Max Chou wrote:
>> Signed-off-by: Max Chou <max.chou@sifive.com>
>> ---
>>   accel/tcg/user-exec.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
>> index 68b252cb8e8..c5453810eee 100644
>> --- a/accel/tcg/user-exec.c
>> +++ b/accel/tcg/user-exec.c
>> @@ -942,8 +942,11 @@ void page_reset_target_data(target_ulong start, 
>> target_ulong last) { }
>>     /* The system-mode versions of these helpers are in cputlb.c.  */
>>   -static void *cpu_mmu_lookup(CPUState *cpu, vaddr addr,
>> -                            MemOp mop, uintptr_t ra, MMUAccessType 
>> type)
>> +static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu,
>> +                                                      vaddr addr,
>> +                                                      MemOp mop,
>> +                                                      uintptr_t ra,
>> + MMUAccessType type)
>>   {
>>       int a_bits = get_alignment_bits(mop);
>>       void *ret;
>
> This is a large function.  Why does it need to be inlined?
> For this and the next two patches I require evidence, because I don't 
> believe you are attacking the problem correctly.
>
>
> r~

[RFC,4/6] accel/tcg: Inline cpu_mmu_lookup function

Commit Message

Comments

Patch