diff mbox series

KVM: arm64: Limit stage2_apply_range() batch size to smallest block

Message ID ebf0fac84cb1d19bdc6e73576e3cc40a9cab0635.1711649501.git.kjlx@templeofstupid.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Limit stage2_apply_range() batch size to smallest block | expand

Commit Message

Krister Johansen March 28, 2024, 7:05 p.m. UTC
stage2_apply_range() for unmap operations can interfere with the
performance of IO if the device's interrupts share the CPU where the
unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block") improved this.  Prior
to that commit, workloads that were unfortunate enough to have their IO
interrupts pinned to the same CPU as the unmap operation would observe a
complete stall.  With the switch to using the largest block size, it is
possible for IO to make progress, albeit at a reduced speed.

This author tested network and storage where the interrupts were pinned
to the same CPU where an unmap was occurring and found that throughput
was reduced about 4.75-5.8x for networking, and 65.5x-500x for storage.

The use-case where this has been especially painful is with hardware
virtualized containers.  Many containers have a short lifetime and may
be run on systems where the host is intentionally oversubscribed. This
limits the options for pinning and prefaulting.  Although NIC interrupts
allow their CPU affinity to be altered, some NVMe devices do not permit
it.  Some cloud-block storage devices have only a few queues, which
means unlucky placement can have high performance impact.

Further reducing the stage2_apply_range() batch size has substantial
performance improvements for IO that share a CPU performing an unmap
operation.  By switching to a 2mb chunk, IO performance regressions were
no longer observed in this author's tests.  E.g. it was possible to
obtain the advertised device throughput despite an unmap operation
occurring on the CPU where the interrupt was running.  There is a
tradeoff, however.  No changes were observed in per-operation timings
when running the kvm_pagetable_test without an interrupt load.  However,
with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
by about 15% and unmap times increased by about 58%.  In essence, this
trades slower map/unmap times for improved IO throughput.

This introduces KVM_PGTABLE_MAX_BLOCK_LEVEL, and then uses it to limit
the size of stage2_apply_range() chunks to the smallest size that's
addressable via a block mapping -- 2mb on a 4k granule size.

Cc: <stable@vger.kernel.org> # 5.15.x: 3b5c082bbfa2: KVM: arm64: Work out supported block level at compile time
Cc: <stable@vger.kernel.org> # 5.15.x: 5994bc9e05c2: KVM: arm64: Limit stage2_apply_range() batch size to largest block
Cc: <stable@vger.kernel.org> # 5.15.x
Suggested-by: Ali Saidi <alisaidi@amazon.com>
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 4 ++++
 arch/arm64/kvm/mmu.c                 | 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

Comments

Oliver Upton March 29, 2024, 1:48 p.m. UTC | #1
Hi Krister,

On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> stage2_apply_range() for unmap operations can interfere with the
> performance of IO if the device's interrupts share the CPU where the
> unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> stage2_apply_range() batch size to largest block") improved this.  Prior
> to that commit, workloads that were unfortunate enough to have their IO
> interrupts pinned to the same CPU as the unmap operation would observe a
> complete stall.  With the switch to using the largest block size, it is
> possible for IO to make progress, albeit at a reduced speed.

Can you describe the workload a bit more? I'm having a hard time
understanding how you're unmapping that much memory on the fly in
your workload. Is guest memory getting swapped? Are VMs being torn
down?

Also, it seems a bit odd to steer interrupts *into* the workload you
care about...

> Further reducing the stage2_apply_range() batch size has substantial
> performance improvements for IO that share a CPU performing an unmap
> operation.  By switching to a 2mb chunk, IO performance regressions were
> no longer observed in this author's tests.  E.g. it was possible to
> obtain the advertised device throughput despite an unmap operation
> occurring on the CPU where the interrupt was running.  There is a
> tradeoff, however.  No changes were observed in per-operation timings
> when running the kvm_pagetable_test without an interrupt load.  However,
> with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> by about 15% and unmap times increased by about 58%.  In essence, this
> trades slower map/unmap times for improved IO throughput.

There are other users of the range-based operations, like
write-protection. Live migration is especially sensitive to the latency
of page table updates as it can affect the VMM's ability to converge
with the guest.

> Cc: <stable@vger.kernel.org> # 5.15.x: 3b5c082bbfa2: KVM: arm64: Work out supported block level at compile time
> Cc: <stable@vger.kernel.org> # 5.15.x: 5994bc9e05c2: KVM: arm64: Limit stage2_apply_range() batch size to largest block
> Cc: <stable@vger.kernel.org> # 5.15.x

This is a performance improvement, *not* a correctness fix. Please don't
cc stable for it.

> Suggested-by: Ali Saidi <alisaidi@amazon.com>
> Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 4 ++++
>  arch/arm64/kvm/mmu.c                 | 2 +-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 19278dfe7978..b0c4651a4d9a 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -19,11 +19,15 @@
>   *  - 4K (level 1):	1GB
>   *  - 16K (level 2):	32MB
>   *  - 64K (level 2):	512MB
> + *
> + *  The max block level is the _smallest_ supported block size for KVM.

This feels like a non sequitur given the old comment is left in place...

>   */
>  #ifdef CONFIG_ARM64_4K_PAGES
>  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	1
> +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	2
>  #else
>  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	2
> +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	KVM_PGTABLE_MIN_BLOCK_LEVEL
>  #endif
>  
>  #define kvm_lpa2_is_enabled()		system_supports_lpa2()
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..1e927b306aee 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -41,7 +41,7 @@ static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
>  
>  static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
>  {
> -	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> +	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MAX_BLOCK_LEVEL);
>  
>  	return __stage2_range_addr_end(addr, end, size);
>  }

This doesn't feel right to me. A property that we had before is that
leaf entries are visited at most once, since every mapping size was
evenly divisible into KVM_PGTABLE_MIN_BLOCK_LEVEL.

Seems like we could wind up visiting a PUD mapping 512 times, at least
for 4K pages.
Krister Johansen March 29, 2024, 7:15 p.m. UTC | #2
Hi Oliver,
Thanks for the response.

On Fri, Mar 29, 2024 at 06:48:38AM -0700, Oliver Upton wrote:
> On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> > stage2_apply_range() for unmap operations can interfere with the
> > performance of IO if the device's interrupts share the CPU where the
> > unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> > stage2_apply_range() batch size to largest block") improved this.  Prior
> > to that commit, workloads that were unfortunate enough to have their IO
> > interrupts pinned to the same CPU as the unmap operation would observe a
> > complete stall.  With the switch to using the largest block size, it is
> > possible for IO to make progress, albeit at a reduced speed.
> 
> Can you describe the workload a bit more? I'm having a hard time
> understanding how you're unmapping that much memory on the fly in
> your workload. Is guest memory getting swapped? Are VMs being torn
> down?

Sorry I wasn't clear here.  Yes, it's the VMs getting torn down that's
causing the problems.  The container VMs don't have long lifetimes, but
some may be up to 256Gb in size, depending on the user.  The workloads
running the VMs aren't especially performance sensitive, but their users
do notice when network connections time-out.  IOW, if the performance is
bad enough to temporarily prevent new TCP connections from being
established or requests / responses being recieved in a timely fashion,
we'll hear about it.  Users deploy their services a lot, so there's a
lot of container vm churn.  (Really it's automation redeploying the
services on behalf of the users in response to new commits to their
repos...)

> Also, it seems a bit odd to steer interrupts *into* the workload you
> care about...

Ah, that was only intentionally done for the purposes of measuring the
impact.  That's not done on purpose in production.

Nevertheless, the example we tend to run into is that a box may have 2
NICs and each NIC has 32 Tx-Rx queues.  This means we've got 64 NIC
interrupts, each assigned to a different CPU.  Our systems have 64 CPUs.
What happens in practice is that a VM will get torn down, and that has a
1-in-64 chance of impacting the performance of the subset of the flows
that are mapped via RSS to the interrupt that happens to be assigned to
the CPU where the VM is being torn down.

Of course, the obvious next question is why not just bind the VMs flows
to the CPUs the VM is running on?  We don't have a 1:1 mapping of
network device to VM, or VM to CPU right now, which frustrates this
approach.

> > Further reducing the stage2_apply_range() batch size has substantial
> > performance improvements for IO that share a CPU performing an unmap
> > operation.  By switching to a 2mb chunk, IO performance regressions were
> > no longer observed in this author's tests.  E.g. it was possible to
> > obtain the advertised device throughput despite an unmap operation
> > occurring on the CPU where the interrupt was running.  There is a
> > tradeoff, however.  No changes were observed in per-operation timings
> > when running the kvm_pagetable_test without an interrupt load.  However,
> > with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> > by about 15% and unmap times increased by about 58%.  In essence, this
> > trades slower map/unmap times for improved IO throughput.
> 
> There are other users of the range-based operations, like
> write-protection. Live migration is especially sensitive to the latency
> of page table updates as it can affect the VMM's ability to converge
> with the guest.

To be clear, the reduction in performance was observed when I
concurrently executed both the kvm_pagetable_test and a networking
benchmark where the NIC's interrupts were assigned to the same CPU where
the pagetable test was executing.  I didn't see a slowdown just running
the pagetable test.

> > Cc: <stable@vger.kernel.org> # 5.15.x: 3b5c082bbfa2: KVM: arm64: Work out supported block level at compile time
> > Cc: <stable@vger.kernel.org> # 5.15.x: 5994bc9e05c2: KVM: arm64: Limit stage2_apply_range() batch size to largest block
> > Cc: <stable@vger.kernel.org> # 5.15.x
> 
> This is a performance improvement, *not* a correctness fix. Please don't
> cc stable for it.

Apologies.  I consulted the Stable Rules[1] before applying these tags and
the guidance it gave was just that "It must either fix a real bug that
bothers people."

In our case, the teardown causes TCP throughput to drop from 9.5Gbps to
about 2Gbps during a teardown, which is something that does bother our
users.

> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h | 4 ++++
> >  arch/arm64/kvm/mmu.c                 | 2 +-
> >  2 files changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 19278dfe7978..b0c4651a4d9a 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -19,11 +19,15 @@
> >   *  - 4K (level 1):	1GB
> >   *  - 16K (level 2):	32MB
> >   *  - 64K (level 2):	512MB
> > + *
> > + *  The max block level is the _smallest_ supported block size for KVM.
> 
> This feels like a non sequitur given the old comment is left in place...

I'll fix if we keep this approach.  Is the objection to the name
KVM_PGTABLE_MAX_BLOCK_LEVEL or just the comment?

> >   */
> >  #ifdef CONFIG_ARM64_4K_PAGES
> >  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	1
> > +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	2
> >  #else
> >  #define KVM_PGTABLE_MIN_BLOCK_LEVEL	2
> > +#define KVM_PGTABLE_MAX_BLOCK_LEVEL	KVM_PGTABLE_MIN_BLOCK_LEVEL
> >  #endif
> >  
> >  #define kvm_lpa2_is_enabled()		system_supports_lpa2()
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index dc04bc767865..1e927b306aee 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -41,7 +41,7 @@ static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> >  
> >  static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> >  {
> > -	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> > +	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MAX_BLOCK_LEVEL);
> >  
> >  	return __stage2_range_addr_end(addr, end, size);
> >  }
> 
> This doesn't feel right to me. A property that we had before is that
> leaf entries are visited at most once, since every mapping size was
> evenly divisible into KVM_PGTABLE_MIN_BLOCK_LEVEL.
> 
> Seems like we could wind up visiting a PUD mapping 512 times, at least
> for 4K pages.

I have an idea, but it seems to go against the current design of the
pagtable walkers.  My sense was that they've been written to
discourage passing mutable state to the function that calls
kvm_pgtable_walk().  If we were willing to permit this, it seems like we
could leverage __kvm_pgtable_visit()'s knowledge about the size of the
mapping it walked to determine whether range_addr_end should be
incremented by our BLOCK_LEVEL constant, or advanced to the end of the
mapping that was already successfully walked.  (If I'm reading right,
anyway)  Does that seem like a reasonable approach?

If we do modify the walk to allow state to be passed back, I have a
second patch I'd like to send you.  Ali found that there was a
performance regression on the kvm_pagetable_test on the map creation
step when a large number of threads operated on a comparatively small
memory range.  (E.g. 64 cpus and 8g of RAM).  We debugged this a bit and
found that there's an unmap in the map creation step if the test ends up
instantiating a readable zero page that needs to be copied and made
writable.  With the deferred TLBI logic, the tlb invalidation happens at
the end of the unmap operation whether a PTE is cleared or not.  With so
many threads, this doesn't always suceeed. The prior approach of just
doing the invalidation in stage2_unmap_put_pte() outperforms the
deferred invalidation, because stage2_unmap_put_pte() only calls
__kvm_tlb_flush_vmid_ipa() if it clears a valid PTE.  If we modify the
walk to keep state on whether any PTEs are successfully cleared, and
condition the deferred invalidation on that state, we obtain performance
that is equivalent to the pre range based deferred invalidation
approach.

Thanks,

-K
Marc Zyngier March 30, 2024, 10:17 a.m. UTC | #3
On Fri, 29 Mar 2024 19:15:37 +0000,
Krister Johansen <kjlx@templeofstupid.com> wrote:
> 
> Hi Oliver,
> Thanks for the response.
> 
> On Fri, Mar 29, 2024 at 06:48:38AM -0700, Oliver Upton wrote:
> > On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> > > stage2_apply_range() for unmap operations can interfere with the
> > > performance of IO if the device's interrupts share the CPU where the
> > > unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> > > stage2_apply_range() batch size to largest block") improved this.  Prior
> > > to that commit, workloads that were unfortunate enough to have their IO
> > > interrupts pinned to the same CPU as the unmap operation would observe a
> > > complete stall.  With the switch to using the largest block size, it is
> > > possible for IO to make progress, albeit at a reduced speed.
> > 
> > Can you describe the workload a bit more? I'm having a hard time
> > understanding how you're unmapping that much memory on the fly in
> > your workload. Is guest memory getting swapped? Are VMs being torn
> > down?
> 
> Sorry I wasn't clear here.  Yes, it's the VMs getting torn down that's
> causing the problems.  The container VMs don't have long lifetimes, but
> some may be up to 256Gb in size, depending on the user.  The workloads
> running the VMs aren't especially performance sensitive, but their users
> do notice when network connections time-out.  IOW, if the performance is
> bad enough to temporarily prevent new TCP connections from being
> established or requests / responses being recieved in a timely fashion,
> we'll hear about it.  Users deploy their services a lot, so there's a
> lot of container vm churn.  (Really it's automation redeploying the
> services on behalf of the users in response to new commits to their
> repos...)

I think this advocates for a teardown-specific code path rather than
just relying on the usual S2 unmapping which is really designed for
eviction. There are two things to consider here:

- TLB invalidation: this should only take a single VMALLS12E1, rather
  than iterating over the PTs

- Cache maintenance: this could be elided with FWB, or *optionally*
  elided if userspace buys in a "I don't need to see the memory of the
  guest after teardown" type of behaviour

> > Also, it seems a bit odd to steer interrupts *into* the workload you
> > care about...
> 
> Ah, that was only intentionally done for the purposes of measuring the
> impact.  That's not done on purpose in production.
> 
> Nevertheless, the example we tend to run into is that a box may have 2
> NICs and each NIC has 32 Tx-Rx queues.  This means we've got 64 NIC
> interrupts, each assigned to a different CPU.  Our systems have 64 CPUs.
> What happens in practice is that a VM will get torn down, and that has a
> 1-in-64 chance of impacting the performance of the subset of the flows
> that are mapped via RSS to the interrupt that happens to be assigned to
> the CPU where the VM is being torn down.
> 
> Of course, the obvious next question is why not just bind the VMs flows
> to the CPUs the VM is running on?  We don't have a 1:1 mapping of
> network device to VM, or VM to CPU right now, which frustrates this
> approach.
> 
> > > Further reducing the stage2_apply_range() batch size has substantial
> > > performance improvements for IO that share a CPU performing an unmap
> > > operation.  By switching to a 2mb chunk, IO performance regressions were
> > > no longer observed in this author's tests.  E.g. it was possible to
> > > obtain the advertised device throughput despite an unmap operation
> > > occurring on the CPU where the interrupt was running.  There is a
> > > tradeoff, however.  No changes were observed in per-operation timings
> > > when running the kvm_pagetable_test without an interrupt load.  However,
> > > with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> > > by about 15% and unmap times increased by about 58%.  In essence, this
> > > trades slower map/unmap times for improved IO throughput.
> > 
> > There are other users of the range-based operations, like
> > write-protection. Live migration is especially sensitive to the latency
> > of page table updates as it can affect the VMM's ability to converge
> > with the guest.
> 
> To be clear, the reduction in performance was observed when I
> concurrently executed both the kvm_pagetable_test and a networking
> benchmark where the NIC's interrupts were assigned to the same CPU where
> the pagetable test was executing.  I didn't see a slowdown just running
> the pagetable test.

Any chance you could share more details about your HW configuration
(what CPU is that?)  and the type of traffic? This is the sort of
things I'd like to be able to reproduce in order to experiment various
strategies.

Thanks,

	M.
Krister Johansen April 2, 2024, 5 p.m. UTC | #4
Hi Marc,

On Sat, Mar 30, 2024 at 10:17:43AM +0000, Marc Zyngier wrote:
> On Fri, 29 Mar 2024 19:15:37 +0000,
> Krister Johansen <kjlx@templeofstupid.com> wrote:
> > On Fri, Mar 29, 2024 at 06:48:38AM -0700, Oliver Upton wrote:
> > > On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> > > > stage2_apply_range() for unmap operations can interfere with the
> > > > performance of IO if the device's interrupts share the CPU where the
> > > > unmap operation is occurring.  commit 5994bc9e05c2 ("KVM: arm64: Limit
> > > > stage2_apply_range() batch size to largest block") improved this.  Prior
> > > > to that commit, workloads that were unfortunate enough to have their IO
> > > > interrupts pinned to the same CPU as the unmap operation would observe a
> > > > complete stall.  With the switch to using the largest block size, it is
> > > > possible for IO to make progress, albeit at a reduced speed.
> > > 
> > > Can you describe the workload a bit more? I'm having a hard time
> > > understanding how you're unmapping that much memory on the fly in
> > > your workload. Is guest memory getting swapped? Are VMs being torn
> > > down?
> > 
> > Sorry I wasn't clear here.  Yes, it's the VMs getting torn down that's
> > causing the problems.  The container VMs don't have long lifetimes, but
> > some may be up to 256Gb in size, depending on the user.  The workloads
> > running the VMs aren't especially performance sensitive, but their users
> > do notice when network connections time-out.  IOW, if the performance is
> > bad enough to temporarily prevent new TCP connections from being
> > established or requests / responses being recieved in a timely fashion,
> > we'll hear about it.  Users deploy their services a lot, so there's a
> > lot of container vm churn.  (Really it's automation redeploying the
> > services on behalf of the users in response to new commits to their
> > repos...)
> 
> I think this advocates for a teardown-specific code path rather than
> just relying on the usual S2 unmapping which is really designed for
> eviction. There are two things to consider here:
> 
> - TLB invalidation: this should only take a single VMALLS12E1, rather
>   than iterating over the PTs
> 
> - Cache maintenance: this could be elided with FWB, or *optionally*
>   elided if userspace buys in a "I don't need to see the memory of the
>   guest after teardown" type of behaviour

This approach would work for this workload, I think.  The hardware
supports FWB and AFAIK isn't looking at the guest memory after teardown.
This is also desirable because in the future we'd like to support
hotplug of VFIO devices. A separate path for unmap the memory used by
the device vs unmap all of the guest seems smart.

> > > Also, it seems a bit odd to steer interrupts *into* the workload you
> > > care about...
> > 
> > Ah, that was only intentionally done for the purposes of measuring the
> > impact.  That's not done on purpose in production.
> > 
> > Nevertheless, the example we tend to run into is that a box may have 2
> > NICs and each NIC has 32 Tx-Rx queues.  This means we've got 64 NIC
> > interrupts, each assigned to a different CPU.  Our systems have 64 CPUs.
> > What happens in practice is that a VM will get torn down, and that has a
> > 1-in-64 chance of impacting the performance of the subset of the flows
> > that are mapped via RSS to the interrupt that happens to be assigned to
> > the CPU where the VM is being torn down.
> > 
> > Of course, the obvious next question is why not just bind the VMs flows
> > to the CPUs the VM is running on?  We don't have a 1:1 mapping of
> > network device to VM, or VM to CPU right now, which frustrates this
> > approach.
> > 
> > > > Further reducing the stage2_apply_range() batch size has substantial
> > > > performance improvements for IO that share a CPU performing an unmap
> > > > operation.  By switching to a 2mb chunk, IO performance regressions were
> > > > no longer observed in this author's tests.  E.g. it was possible to
> > > > obtain the advertised device throughput despite an unmap operation
> > > > occurring on the CPU where the interrupt was running.  There is a
> > > > tradeoff, however.  No changes were observed in per-operation timings
> > > > when running the kvm_pagetable_test without an interrupt load.  However,
> > > > with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> > > > by about 15% and unmap times increased by about 58%.  In essence, this
> > > > trades slower map/unmap times for improved IO throughput.
> > > 
> > > There are other users of the range-based operations, like
> > > write-protection. Live migration is especially sensitive to the latency
> > > of page table updates as it can affect the VMM's ability to converge
> > > with the guest.
> > 
> > To be clear, the reduction in performance was observed when I
> > concurrently executed both the kvm_pagetable_test and a networking
> > benchmark where the NIC's interrupts were assigned to the same CPU where
> > the pagetable test was executing.  I didn't see a slowdown just running
> > the pagetable test.
> 
> Any chance you could share more details about your HW configuration
> (what CPU is that?)  and the type of traffic? This is the sort of
> things I'd like to be able to reproduce in order to experiment various
> strategies.

Sure, I only have access to documentation that is publicly available.

The hardware where we ran into this inititally was Graviton 3, which is
a Neoverse-V1 based core.  It does not support FEAT_TLBIRANGE.  I've
also tested on Graviton 4, which is Neoverse-V2 based.  It _does_
support FEAT_TLBIRANGE.  The deferred range based invalidation
support, was enough to allow us to teardown a large VM based on 4k pages
and not incur a visible performance penalty.  I haven't had a chance to
test to see if and how Will's patches change this, though.

The tests themselves were not especially fancy. The networking hardware
was a ENA device on an EC2 box with 30Gbps limit (5/10 Gbps per flow,
depending on the config).  The storage tested was a gp3 EBS device
configured to max IOPS/throughput (16,000 IOPS / 1000Mb/s).

Networking tests were iperf3 with a 9001 byte packet size.  The storage
tests were fio's randwrite workload in directio mode using the libaio
backend.  The "IOPS" test used a 4k blocksize and a queue depth of 128.
The "throughput" test used a blocksize of 64k and an iodepth of 32.  For
the fio tests, it was a 10gb file and 2 workers, mostly because the EBS
devices have two hardware queues for data.

I ran the kvm_page_table_test with a few different sizes, but settled on
64G with 1 vcpu for most tests.

Let me know if there's anything else I can share here.

-K
Krister Johansen April 4, 2024, 4:40 a.m. UTC | #5
On Tue, Apr 02, 2024 at 10:00:53AM -0700, Krister Johansen wrote:
> On Sat, Mar 30, 2024 at 10:17:43AM +0000, Marc Zyngier wrote:
> > On Fri, 29 Mar 2024 19:15:37 +0000,
> > Krister Johansen <kjlx@templeofstupid.com> wrote:
> > > On Fri, Mar 29, 2024 at 06:48:38AM -0700, Oliver Upton wrote:
> > > > On Thu, Mar 28, 2024 at 12:05:08PM -0700, Krister Johansen wrote:
> > > > > Further reducing the stage2_apply_range() batch size has substantial
> > > > > performance improvements for IO that share a CPU performing an unmap
> > > > > operation.  By switching to a 2mb chunk, IO performance regressions were
> > > > > no longer observed in this author's tests.  E.g. it was possible to
> > > > > obtain the advertised device throughput despite an unmap operation
> > > > > occurring on the CPU where the interrupt was running.  There is a
> > > > > tradeoff, however.  No changes were observed in per-operation timings
> > > > > when running the kvm_pagetable_test without an interrupt load.  However,
> > > > > with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
> > > > > by about 15% and unmap times increased by about 58%.  In essence, this
> > > > > trades slower map/unmap times for improved IO throughput.
> > > > 
> > > > There are other users of the range-based operations, like
> > > > write-protection. Live migration is especially sensitive to the latency
> > > > of page table updates as it can affect the VMM's ability to converge
> > > > with the guest.
> > > 
> > > To be clear, the reduction in performance was observed when I
> > > concurrently executed both the kvm_pagetable_test and a networking
> > > benchmark where the NIC's interrupts were assigned to the same CPU where
> > > the pagetable test was executing.  I didn't see a slowdown just running
> > > the pagetable test.
> > 
> > Any chance you could share more details about your HW configuration
> > (what CPU is that?)  and the type of traffic? This is the sort of
> > things I'd like to be able to reproduce in order to experiment various
> > strategies.
> 
> Sure, I only have access to documentation that is publicly available.
> 
> The hardware where we ran into this inititally was Graviton 3, which is
> a Neoverse-V1 based core.  It does not support FEAT_TLBIRANGE.  I've
> also tested on Graviton 4, which is Neoverse-V2 based.  It _does_
> support FEAT_TLBIRANGE.  The deferred range based invalidation
> support, was enough to allow us to teardown a large VM based on 4k pages
> and not incur a visible performance penalty.  I haven't had a chance to
> test to see if and how Will's patches change this, though.

Just a quick followup that I did test Will's patches and didn't find
that it changed the performance of the workload that I'd been testing.
IOW, I wasn't able to discern a network performance difference between
the baseline and those changes.

Thanks,

-K
Ali Saidi April 4, 2024, 9:27 p.m. UTC | #6
I measured the time it takes to unmap a VM by changing the kvm_page_table_test
to report it. It's and on Graviton3 it's about 300ms per 1GB of flushing
with 4KB pages. Unmapping 128GB takes around 39.5s and with a single call to
__kvm_tlb_flush_vmid() instead of the 32M calls to __kvm_tlb_flush_vmid_ipa()
reduces this to around 5.9s (~7x). This means each iteration of the
stage2_apply_range() is reduced to 46ms. So we're certainly calling
cond_resched() a whole lot more. 


> Just a quick followup that I did test Will's patches and didn't find
> that it changed the performance of the workload that I'd been testing.
> IOW, I wasn't able to discern a network performance difference between
> the baseline and those changes.

That is a bit unexpected that the performance wasn't worse with the patch Will
sent because it should have disabled the range invalidates since they these 
invalidates will be getting rid of blocks?  Which Graviton were you testing
this on? 

Ali
Krister Johansen April 4, 2024, 9:41 p.m. UTC | #7
On Thu, Apr 04, 2024 at 09:27:42PM +0000, Ali Saidi wrote:
> > Just a quick followup that I did test Will's patches and didn't find
> > that it changed the performance of the workload that I'd been testing.
> > IOW, I wasn't able to discern a network performance difference between
> > the baseline and those changes.
> 
> That is a bit unexpected that the performance wasn't worse with the patch Will
> sent because it should have disabled the range invalidates since they these 
> invalidates will be getting rid of blocks?  Which Graviton were you testing
> this on? 

Sorry I didn't mention it earlier.  This was on a Graviton 4 with
FEAT_TLBIRANGE.  Due to the placement of the test machine and the client
the max single flow rate was 5Gbps, and with both the baseline and
Will's patches I wasn't able to discern any slowdown there, at least in
terms of impact to the adjacent networking traffic.  I saw an approx 1%
slowdown with multiple flows at 10Gbps, but didn't have a baseline for
that test case so was hesitant to offer it as conclusive.  I should be
clear that I _wasn't_ measuring the the teardown times in this test
-- just how the teardown itself impacted the networking performance.

-K
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 19278dfe7978..b0c4651a4d9a 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -19,11 +19,15 @@ 
  *  - 4K (level 1):	1GB
  *  - 16K (level 2):	32MB
  *  - 64K (level 2):	512MB
+ *
+ *  The max block level is the _smallest_ supported block size for KVM.
  */
 #ifdef CONFIG_ARM64_4K_PAGES
 #define KVM_PGTABLE_MIN_BLOCK_LEVEL	1
+#define KVM_PGTABLE_MAX_BLOCK_LEVEL	2
 #else
 #define KVM_PGTABLE_MIN_BLOCK_LEVEL	2
+#define KVM_PGTABLE_MAX_BLOCK_LEVEL	KVM_PGTABLE_MIN_BLOCK_LEVEL
 #endif
 
 #define kvm_lpa2_is_enabled()		system_supports_lpa2()
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dc04bc767865..1e927b306aee 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -41,7 +41,7 @@  static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
 
 static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
 {
-	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
+	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MAX_BLOCK_LEVEL);
 
 	return __stage2_range_addr_end(addr, end, size);
 }