mbox series

[v6,0/4] riscv: tlb flush improvements

Message ID 20231030133027.19542-1-alexghiti@rivosinc.com (mailing list archive)
Headers show
Series riscv: tlb flush improvements | expand

Message

Alexandre Ghiti Oct. 30, 2023, 1:30 p.m. UTC
This series optimizes the tlb flushes on riscv which used to simply
flush the whole tlb whatever the size of the range to flush or the size
of the stride.

Patch 3 introduces a threshold that is microarchitecture specific and
will very likely be modified by vendors, not sure though which mechanism
we'll use to do that (dt? alternatives? vendor initialization code?).

Next steps would be to implement:
- svinval extension as Mayuresh did here [1]
- BATCHED_UNMAP_TLB_FLUSH (I'll wait for arm64 patchset to land)
- MMU_GATHER_RCU_TABLE_FREE
- MMU_GATHER_MERGE_VMAS

Any other idea welcome.

[1] https://lore.kernel.org/linux-riscv/20230623123849.1425805-1-mchitale@ventanamicro.com/

Changes in v6:
- Remove ifdef SVNAPOT, as suggested by Samuel
- Fix usage of u16 which could overflow, as noted by Samuel
- Use cpu_online_mask, as suggested by Samuel
- Move static_branch_unlikely(&use_asid_allocator) test, as suggested by Samuel
- Add TB/RB from Prabhakar and Samuel, thanks guys!

Changes in v5:
- Fix commit message s/flush_tlb/tlb_flush thanks to Samuel
- Simplify NAPOT mapping stride size handling, as suggested by Samuel
- Add TB from Prabhakar
- Add RB from Samuel
- Remove TB/RB from patch 2 as it changed enough

Changes in v4:
- Correctly handle the stride size for a NAPOT hugepage, thanks to Aaron Durbin!
- Fix flush_tlb_kernel_range() which passed a wrong argument to __flush_tlb_range()
- Factorize code to handle asid/no asid flushes
- Fix kernel flush bug where I used to pass 0 instead of x0, big thanks to Samuel for finding that!

Changes in v3:
- Add RB from Andrew, thanks!
- Unwrap a few lines, as suggested by Andrew
- Introduce defines for -1 constants used in tlbflush.c, as suggested by Andrew and Conor
- Use huge_page_size() directly instead of using the shift, as suggested by Andrew
- Remove misleading comments as suggested by Conor

Changes in v2:
- Make static tlb_flush_all_threshold, we'll figure out later how to
  override this value on a vendor basis, as suggested by Conor and Palmer
- Fix nommu build, as reported by Conor

Alexandre Ghiti (4):
  riscv: Improve tlb_flush()
  riscv: Improve flush_tlb_range() for hugetlb pages
  riscv: Make __flush_tlb_range() loop over pte instead of flushing the
    whole tlb
  riscv: Improve flush_tlb_kernel_range()

 arch/riscv/include/asm/sbi.h      |   3 -
 arch/riscv/include/asm/tlb.h      |   8 +-
 arch/riscv/include/asm/tlbflush.h |  15 ++-
 arch/riscv/kernel/sbi.c           |  32 ++----
 arch/riscv/mm/tlbflush.c          | 181 +++++++++++++++++++-----------
 5 files changed, 144 insertions(+), 95 deletions(-)

Comments

Nadav Amit Oct. 30, 2023, 2:01 p.m. UTC | #1
> On Oct 30, 2023, at 3:30 PM, Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> 
> +			on_each_cpu_mask(cmask,
> +					 __ipi_flush_tlb_range_asid,
> +					 &ftd, 1);
> 

Unrelated, but having fed on the stack might cause it to be unaligned to
the cacheline, which in x86 we have seen introduces some overhead.

Actually, it is best not to put it on the stack, if possible to reduce
cache traffic.
patchwork-bot+linux-riscv@kernel.org Nov. 7, 2023, 6:50 a.m. UTC | #2
Hello:

This series was applied to riscv/linux.git (for-next)
by Palmer Dabbelt <palmer@rivosinc.com>:

On Mon, 30 Oct 2023 14:30:24 +0100 you wrote:
> This series optimizes the tlb flushes on riscv which used to simply
> flush the whole tlb whatever the size of the range to flush or the size
> of the stride.
> 
> Patch 3 introduces a threshold that is microarchitecture specific and
> will very likely be modified by vendors, not sure though which mechanism
> we'll use to do that (dt? alternatives? vendor initialization code?).
> 
> [...]

Here is the summary with links:
  - [v6,1/4] riscv: Improve tlb_flush()
    https://git.kernel.org/riscv/c/c5e9b2c2ae82
  - [v6,2/4] riscv: Improve flush_tlb_range() for hugetlb pages
    https://git.kernel.org/riscv/c/c962a6e74639
  - [v6,3/4] riscv: Make __flush_tlb_range() loop over pte instead of flushing the whole tlb
    https://git.kernel.org/riscv/c/9d4e8d5fa7db
  - [v6,4/4] riscv: Improve flush_tlb_kernel_range()
    https://git.kernel.org/riscv/c/5e22bfd520ea

You are awesome, thank you!
Palmer Dabbelt Nov. 7, 2023, 7 a.m. UTC | #3
On Mon, 30 Oct 2023 07:01:48 PDT (-0700), nadav.amit@gmail.com wrote:
>
>> On Oct 30, 2023, at 3:30 PM, Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>> 
>> +			on_each_cpu_mask(cmask,
>> +					 __ipi_flush_tlb_range_asid,
>> +					 &ftd, 1);
>> 
>
> Unrelated, but having fed

Do you mean `ftd`?

If so I'm not all that convinced that's a problem: sure it's 4x`long`, 
so we pass it on the stack instead of registers, but otherwise we'd need 
another `on_each_cpu_mask()` callback to shim stuff through via 
registers.

> on the stack might cause it to be unaligned to
> the cacheline, which in x86 we have seen introduces some overhead.

We have 128-bit stack alignment on RISC-V, so the elements are at least 
aligned.  Since they're just being loaded up as scalars for the next 
function call I'm not sure the alignment is all that exciting here.

> Actually, it is best not to put it on the stack, if possible to reduce
> cache traffic.

Sorry if I'm just missing something, but I'm not convinced this is a 
measurable performance problem.
Nadav Amit Nov. 7, 2023, 8:38 a.m. UTC | #4
> On Nov 7, 2023, at 9:00 AM, Palmer Dabbelt <palmer@dabbelt.com> wrote:
> 
> On Mon, 30 Oct 2023 07:01:48 PDT (-0700), nadav.amit@gmail.com wrote:
>> 
>>> On Oct 30, 2023, at 3:30 PM, Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>>> + on_each_cpu_mask(cmask,
>>> + __ipi_flush_tlb_range_asid,
>>> + &ftd, 1);
>> 
>> Unrelated, but having fed
> 
> Do you mean `ftd`?
> 
> If so I'm not all that convinced that's a problem: sure it's 4x`long`, so we pass it on the stack instead of registers, but otherwise we'd need another `on_each_cpu_mask()` callback to shim stuff through via registers.

I have no idea why you need to move stuff through the registers.

>> Actually, it is best not to put it on the stack, if possible to reduce
>> cache traffic.
> 
> Sorry if I'm just missing something, but I'm not convinced this is a measurable performance problem.

I am not going to try to convince you (I ran the numbers on x86 a long
time ago).

There is a cost of bouncing cache-lines (because multiple cores access
the stack), TLB-miss on remote cores (which is mostly avoidable if ftd
is global).

Having said that, the optimizations you added now and intend to add in
the next steps are definitely more important for performance.