mbox series

[00/10] cputlb: track dirty tlbs and general cleanup

Message ID 20181023070253.6407-1-richard.henderson@linaro.org (mailing list archive)
Headers show
Series cputlb: track dirty tlbs and general cleanup | expand

Message

Richard Henderson Oct. 23, 2018, 7:02 a.m. UTC
The motivation here is reducing the total overhead.

Before a few patches went into target-arm.next, I measured total
tlb flush overhead for aarch64 at 25%.  This appears to reduce the
total overhead to about 5% (I do need to re-run the control tests,
not just watch perf top as I'm doing now).

The final patch is somewhat of an RFC.  I'd like to know what
benchmark was used when putting in pending_tlb_flushes, and I
have not done any archaeology to find out.  I suspect that it
does make any measurable difference beyond tlb_c.dirty, and I
think the code is a bit cleaner without it.


r~


Richard Henderson (10):
  cputlb: Move tlb_lock to CPUTLBCommon
  cputlb: Remove tcg_enabled hack from tlb_flush_nocheck
  cputlb: Move cpu->pending_tlb_flush to env->tlb_c.pending_flush
  cputlb: Split large page tracking per mmu_idx
  cputlb: Move env->vtlb_index to env->tlb_d.vindex
  cputlb: Merge tlb_flush_nocheck into tlb_flush_by_mmuidx_async_work
  cputlb: Merge tlb_flush_page into tlb_flush_page_by_mmuidx
  cputlb: Count "partial" and "elided" tlb flushes
  cputlb: Filter flushes on already clean tlbs
  cputlb: Remove tlb_c.pending_flushes

 include/exec/cpu-defs.h   |  51 +++++-
 include/exec/cputlb.h     |   2 +-
 include/qom/cpu.h         |   6 -
 accel/tcg/cputlb.c        | 354 +++++++++++++++-----------------------
 accel/tcg/translate-all.c |   8 +-
 5 files changed, 184 insertions(+), 237 deletions(-)

Comments

Emilio Cota Oct. 23, 2018, 5:11 p.m. UTC | #1
On Tue, Oct 23, 2018 at 08:02:42 +0100, Richard Henderson wrote:
> The motivation here is reducing the total overhead.
> 
> Before a few patches went into target-arm.next, I measured total
> tlb flush overhead for aarch64 at 25%.  This appears to reduce the
> total overhead to about 5% (I do need to re-run the control tests,
> not just watch perf top as I'm doing now).

I'd like to see those absolute perf numbers; I ran a few Ubuntu aarch64
boots and the noise is just too high to draw any conclusions (I'm
using your tlb-dirty branch on github).

When booting the much smaller debian image, these patches are
performance-neutral though. So,
  Reviewed-by: Emilio G. Cota <cota@braap.org>
for the series.

(On a pedantic note: consider s/miniscule/minuscule/ in patches 6-7)

> The final patch is somewhat of an RFC.  I'd like to know what
> benchmark was used when putting in pending_tlb_flushes, and I
> have not done any archaeology to find out.  I suspect that it
> does make any measurable difference beyond tlb_c.dirty, and I
> think the code is a bit cleaner without it.

I suspect that pending_tlb_flushes was premature optimization.
Avoiding an async job sounds like a good idea, since it is very
expensive for the remote vCPU.
However, in most cases we'll be taking a lock (or a full barrier
in the original code) but we won't avoid the async job (because
a race when flushing other vCPUs is unlikely), therefore wasting
cycles in the lock (formerly barrier).

Thanks,

		Emilio