diff mbox

arm/arm64: KVM: Perform local TLB invalidation when multiplexing vcpus on a single CPU

Message ID 1477323088-18768-1-git-send-email-marc.zyngier@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Marc Zyngier Oct. 24, 2016, 3:31 p.m. UTC
Architecturally, TLBs are private to the (physical) CPU they're
associated with. But when multiple vcpus from the same VM are
being multiplexed on the same CPU, the TLBs are not private
to the vcpus (and are actually shared across the VMID).

Let's consider the following scenario:

- vcpu-0 maps PA to VA
- vcpu-1 maps PA' to VA

If run on the same physical CPU, vcpu-1 can hit TLB entries generated
by vcpu-0 accesses, and access the wrong physical page.

The solution to this is to keep a per-VM map of which vcpu ran last
on each given physical CPU, and invalidate local TLBs when switching
to a different vcpu from the same VM.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_host.h   |  5 +++++
 arch/arm/include/asm/kvm_hyp.h    |  1 +
 arch/arm/kvm/arm.c                | 35 ++++++++++++++++++++++++++++++++++-
 arch/arm/kvm/hyp/switch.c         |  9 +++++++++
 arch/arm64/include/asm/kvm_host.h |  6 +++++-
 arch/arm64/kvm/hyp/switch.c       |  8 ++++++++
 6 files changed, 62 insertions(+), 2 deletions(-)

Comments

Mark Rutland Oct. 24, 2016, 4:16 p.m. UTC | #1
Hi Marc,

On Mon, Oct 24, 2016 at 04:31:28PM +0100, Marc Zyngier wrote:
> Architecturally, TLBs are private to the (physical) CPU they're
> associated with. But when multiple vcpus from the same VM are
> being multiplexed on the same CPU, the TLBs are not private
> to the vcpus (and are actually shared across the VMID).
> 
> Let's consider the following scenario:
> 
> - vcpu-0 maps PA to VA
> - vcpu-1 maps PA' to VA
> 
> If run on the same physical CPU, vcpu-1 can hit TLB entries generated
> by vcpu-0 accesses, and access the wrong physical page.

It might be worth noting that this could also lead to TLB conflicts and
other such fun usually associated with missing TLB maintenance,
depending on the two mappings (or the intermediate stages thereof).

> The solution to this is to keep a per-VM map of which vcpu ran last
> on each given physical CPU, and invalidate local TLBs when switching
> to a different vcpu from the same VM.
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>

Modulo my comment on preemptiblity of kvm_arch_sched_in, everything
below is a nit. Assuming that's not preemptible, this looks right to me.

FWIW, with or without the other comments considered:

Reviewed-by: Mark Rutland <mark.rutland@arm.com>

> ---
>  arch/arm/include/asm/kvm_host.h   |  5 +++++
>  arch/arm/include/asm/kvm_hyp.h    |  1 +
>  arch/arm/kvm/arm.c                | 35 ++++++++++++++++++++++++++++++++++-
>  arch/arm/kvm/hyp/switch.c         |  9 +++++++++
>  arch/arm64/include/asm/kvm_host.h |  6 +++++-
>  arch/arm64/kvm/hyp/switch.c       |  8 ++++++++
>  6 files changed, 62 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 2d19e02..035e744 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -57,6 +57,8 @@ struct kvm_arch {
>  	/* VTTBR value associated with below pgd and vmid */
>  	u64    vttbr;
>  
> +	int __percpu *last_vcpu_ran;
> +
>  	/* Timer */
>  	struct arch_timer_kvm	timer;
>  
> @@ -174,6 +176,9 @@ struct kvm_vcpu_arch {
>  	/* vcpu power-off state */
>  	bool power_off;
>  
> +	/* TLBI required */
> +	bool requires_tlbi;

A bit of a nit, but it's not clear which class of TLBI is required, or
why. It's probably worth a comment (and perhaps a bikeshed), like:

	/*
	 * Local TLBs potentially contain conflicting entries from
	 * another vCPU within this VMID. All entries for this VMID must
	 * be invalidated from (local) TLBs before we run this vCPU.
	 */
	bool tlb_vmid_stale;

[...]

> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	int *last_ran;
> +
> +	last_ran = per_cpu_ptr(vcpu->kvm->arch.last_vcpu_ran, cpu);
> +
> +	/*
> +	 * If we're very unlucky and get preempted before having ran
> +	 * this vcpu for real, we'll end-up in a situation where any
> +	 * vcpu that gets scheduled will perform an invalidation (this
> +	 * vcpu explicitely requires it, and all the others will have
> +	 * a different vcpu_id).
> +	 */

Nit: s/explicitely/explicitly/

To bikeshed a little further, perhaps:

	/*
	 * We might get preempted before the vCPU actually runs, but
	 * this is fine. Our TLBI stays pending until we actually make
	 * it to __activate_vm, so we won't miss a TLBI. If another vCPU
	 * gets scheduled, it will see our vcpu_id in last_ran, and pend
	 * a TLBI for itself.
	 */

> +	if (*last_ran != vcpu->vcpu_id) {
> +		if (*last_ran != -1)
> +			vcpu->arch.requires_tlbi = true;
> +
> +		*last_ran = vcpu->vcpu_id;
> +	}
> +}

I take it this function is run in some non-preemptible context; if so,
this looks correct to me.

If this is preemptible, then this looks racy.

Thanks,
Mark.
Marc Zyngier Oct. 25, 2016, 10:20 a.m. UTC | #2
Hi Mark,

On 24/10/16 17:16, Mark Rutland wrote:
> Hi Marc,
> 
> On Mon, Oct 24, 2016 at 04:31:28PM +0100, Marc Zyngier wrote:
>> Architecturally, TLBs are private to the (physical) CPU they're
>> associated with. But when multiple vcpus from the same VM are
>> being multiplexed on the same CPU, the TLBs are not private
>> to the vcpus (and are actually shared across the VMID).
>>
>> Let's consider the following scenario:
>>
>> - vcpu-0 maps PA to VA
>> - vcpu-1 maps PA' to VA
>>
>> If run on the same physical CPU, vcpu-1 can hit TLB entries generated
>> by vcpu-0 accesses, and access the wrong physical page.
> 
> It might be worth noting that this could also lead to TLB conflicts and
> other such fun usually associated with missing TLB maintenance,
> depending on the two mappings (or the intermediate stages thereof).
> 
>> The solution to this is to keep a per-VM map of which vcpu ran last
>> on each given physical CPU, and invalidate local TLBs when switching
>> to a different vcpu from the same VM.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> 
> Modulo my comment on preemptiblity of kvm_arch_sched_in, everything
> below is a nit. Assuming that's not preemptible, this looks right to me.
> 
> FWIW, with or without the other comments considered:
> 
> Reviewed-by: Mark Rutland <mark.rutland@arm.com>
> 
>> ---
>>  arch/arm/include/asm/kvm_host.h   |  5 +++++
>>  arch/arm/include/asm/kvm_hyp.h    |  1 +
>>  arch/arm/kvm/arm.c                | 35 ++++++++++++++++++++++++++++++++++-
>>  arch/arm/kvm/hyp/switch.c         |  9 +++++++++
>>  arch/arm64/include/asm/kvm_host.h |  6 +++++-
>>  arch/arm64/kvm/hyp/switch.c       |  8 ++++++++
>>  6 files changed, 62 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index 2d19e02..035e744 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -57,6 +57,8 @@ struct kvm_arch {
>>  	/* VTTBR value associated with below pgd and vmid */
>>  	u64    vttbr;
>>  
>> +	int __percpu *last_vcpu_ran;
>> +
>>  	/* Timer */
>>  	struct arch_timer_kvm	timer;
>>  
>> @@ -174,6 +176,9 @@ struct kvm_vcpu_arch {
>>  	/* vcpu power-off state */
>>  	bool power_off;
>>  
>> +	/* TLBI required */
>> +	bool requires_tlbi;
> 
> A bit of a nit, but it's not clear which class of TLBI is required, or
> why. It's probably worth a comment (and perhaps a bikeshed), like:
> 
> 	/*
> 	 * Local TLBs potentially contain conflicting entries from
> 	 * another vCPU within this VMID. All entries for this VMID must
> 	 * be invalidated from (local) TLBs before we run this vCPU.
> 	 */
> 	bool tlb_vmid_stale;

Yup, looks good.

> 
> [...]
> 
>> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
>> +{
>> +	int *last_ran;
>> +
>> +	last_ran = per_cpu_ptr(vcpu->kvm->arch.last_vcpu_ran, cpu);
>> +
>> +	/*
>> +	 * If we're very unlucky and get preempted before having ran
>> +	 * this vcpu for real, we'll end-up in a situation where any
>> +	 * vcpu that gets scheduled will perform an invalidation (this
>> +	 * vcpu explicitely requires it, and all the others will have
>> +	 * a different vcpu_id).
>> +	 */
> 
> Nit: s/explicitely/explicitly/
> 
> To bikeshed a little further, perhaps:
> 
> 	/*
> 	 * We might get preempted before the vCPU actually runs, but
> 	 * this is fine. Our TLBI stays pending until we actually make
> 	 * it to __activate_vm, so we won't miss a TLBI. If another vCPU
> 	 * gets scheduled, it will see our vcpu_id in last_ran, and pend
> 	 * a TLBI for itself.
> 	 */

Looks good too. I'll incorporate this into v2.

> 
>> +	if (*last_ran != vcpu->vcpu_id) {
>> +		if (*last_ran != -1)
>> +			vcpu->arch.requires_tlbi = true;
>> +
>> +		*last_ran = vcpu->vcpu_id;
>> +	}
>> +}
> 
> I take it this function is run in some non-preemptible context; if so,
> this looks correct to me.
> 
> If this is preemptible, then this looks racy.

This function is called from a preempt notifier, which itself isn't
preemptible.

Thanks,

	M.
Christoffer Dall Oct. 27, 2016, 9:19 a.m. UTC | #3
On Mon, Oct 24, 2016 at 04:31:28PM +0100, Marc Zyngier wrote:
> Architecturally, TLBs are private to the (physical) CPU they're
> associated with. But when multiple vcpus from the same VM are
> being multiplexed on the same CPU, the TLBs are not private
> to the vcpus (and are actually shared across the VMID).
> 
> Let's consider the following scenario:
> 
> - vcpu-0 maps PA to VA
> - vcpu-1 maps PA' to VA
> 
> If run on the same physical CPU, vcpu-1 can hit TLB entries generated
> by vcpu-0 accesses, and access the wrong physical page.
> 
> The solution to this is to keep a per-VM map of which vcpu ran last
> on each given physical CPU, and invalidate local TLBs when switching
> to a different vcpu from the same VM.

Just making sure I understand this:  The reason you cannot rely on the
guest doing the necessary distinction with ASIDs or invalidating the TLB
is that a guest (which assumes it's running on hardware) can validly
defer any neccessary invalidation until it starts running on other
physical CPUs, but we do this transparently in KVM?

Thanks,
-Christoffer

> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm/include/asm/kvm_host.h   |  5 +++++
>  arch/arm/include/asm/kvm_hyp.h    |  1 +
>  arch/arm/kvm/arm.c                | 35 ++++++++++++++++++++++++++++++++++-
>  arch/arm/kvm/hyp/switch.c         |  9 +++++++++
>  arch/arm64/include/asm/kvm_host.h |  6 +++++-
>  arch/arm64/kvm/hyp/switch.c       |  8 ++++++++
>  6 files changed, 62 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 2d19e02..035e744 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -57,6 +57,8 @@ struct kvm_arch {
>  	/* VTTBR value associated with below pgd and vmid */
>  	u64    vttbr;
>  
> +	int __percpu *last_vcpu_ran;
> +
>  	/* Timer */
>  	struct arch_timer_kvm	timer;
>  
> @@ -174,6 +176,9 @@ struct kvm_vcpu_arch {
>  	/* vcpu power-off state */
>  	bool power_off;
>  
> +	/* TLBI required */
> +	bool requires_tlbi;
> +
>  	 /* Don't run the guest (internal implementation need) */
>  	bool pause;
>  
> diff --git a/arch/arm/include/asm/kvm_hyp.h b/arch/arm/include/asm/kvm_hyp.h
> index 343135e..5850890 100644
> --- a/arch/arm/include/asm/kvm_hyp.h
> +++ b/arch/arm/include/asm/kvm_hyp.h
> @@ -71,6 +71,7 @@
>  #define ICIALLUIS	__ACCESS_CP15(c7, 0, c1, 0)
>  #define ATS1CPR		__ACCESS_CP15(c7, 0, c8, 0)
>  #define TLBIALLIS	__ACCESS_CP15(c8, 0, c3, 0)
> +#define TLBIALL		__ACCESS_CP15(c8, 0, c7, 0)
>  #define TLBIALLNSNHIS	__ACCESS_CP15(c8, 4, c3, 4)
>  #define PRRR		__ACCESS_CP15(c10, 0, c2, 0)
>  #define NMRR		__ACCESS_CP15(c10, 0, c2, 1)
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 03e9273..09942f0 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -114,11 +114,18 @@ void kvm_arch_check_processor_compat(void *rtn)
>   */
>  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  {
> -	int ret = 0;
> +	int ret, cpu;
>  
>  	if (type)
>  		return -EINVAL;
>  
> +	kvm->arch.last_vcpu_ran = alloc_percpu(typeof(*kvm->arch.last_vcpu_ran));
> +	if (!kvm->arch.last_vcpu_ran)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu)
> +		*per_cpu_ptr(kvm->arch.last_vcpu_ran, cpu) = -1;
> +
>  	ret = kvm_alloc_stage2_pgd(kvm);
>  	if (ret)
>  		goto out_fail_alloc;
> @@ -141,6 +148,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  out_free_stage2_pgd:
>  	kvm_free_stage2_pgd(kvm);
>  out_fail_alloc:
> +	free_percpu(kvm->arch.last_vcpu_ran);
> +	kvm->arch.last_vcpu_ran = NULL;
>  	return ret;
>  }
>  
> @@ -168,6 +177,9 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  {
>  	int i;
>  
> +	free_percpu(kvm->arch.last_vcpu_ran);
> +	kvm->arch.last_vcpu_ran = NULL;
> +
>  	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
>  		if (kvm->vcpus[i]) {
>  			kvm_arch_vcpu_free(kvm->vcpus[i]);
> @@ -310,6 +322,27 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	int *last_ran;
> +
> +	last_ran = per_cpu_ptr(vcpu->kvm->arch.last_vcpu_ran, cpu);
> +
> +	/*
> +	 * If we're very unlucky and get preempted before having ran
> +	 * this vcpu for real, we'll end-up in a situation where any
> +	 * vcpu that gets scheduled will perform an invalidation (this
> +	 * vcpu explicitely requires it, and all the others will have
> +	 * a different vcpu_id).
> +	 */
> +	if (*last_ran != vcpu->vcpu_id) {
> +		if (*last_ran != -1)
> +			vcpu->arch.requires_tlbi = true;
> +
> +		*last_ran = vcpu->vcpu_id;
> +	}
> +}
> +
>  void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  {
>  	vcpu->cpu = cpu;
> diff --git a/arch/arm/kvm/hyp/switch.c b/arch/arm/kvm/hyp/switch.c
> index 92678b7..ab8ee3b 100644
> --- a/arch/arm/kvm/hyp/switch.c
> +++ b/arch/arm/kvm/hyp/switch.c
> @@ -75,6 +75,15 @@ static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm *kvm = kern_hyp_va(vcpu->kvm);
>  	write_sysreg(kvm->arch.vttbr, VTTBR);
> +	if (vcpu->arch.requires_tlbi) {
> +		/* Force vttbr to be written */
> +		isb();
> +		/* Local invalidate only for this VMID */
> +		write_sysreg(0, TLBIALL);
> +		dsb(nsh);
> +		vcpu->arch.requires_tlbi = false;
> +	}
> +
>  	write_sysreg(vcpu->arch.midr, VPIDR);
>  }
>  
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index bd94e67..5b42010 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -62,6 +62,8 @@ struct kvm_arch {
>  	/* VTTBR value associated with above pgd and vmid */
>  	u64    vttbr;
>  
> +	int __percpu *last_vcpu_ran;
> +
>  	/* The maximum number of vCPUs depends on the used GIC model */
>  	int max_vcpus;
>  
> @@ -252,6 +254,9 @@ struct kvm_vcpu_arch {
>  	/* vcpu power-off state */
>  	bool power_off;
>  
> +	/* TLBI required */
> +	bool requires_tlbi;
> +
>  	/* Don't run the guest (internal implementation need) */
>  	bool pause;
>  
> @@ -368,7 +373,6 @@ static inline void __cpu_reset_hyp_mode(unsigned long vector_ptr,
>  static inline void kvm_arch_hardware_unsetup(void) {}
>  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
>  static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
> -static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
>  static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>  
>  void kvm_arm_init_debug(void);
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 83037cd..8d9c3eb 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -131,6 +131,14 @@ static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm *kvm = kern_hyp_va(vcpu->kvm);
>  	write_sysreg(kvm->arch.vttbr, vttbr_el2);
> +	if (vcpu->arch.requires_tlbi) {
> +		/* Force vttbr_el2 to be written */
> +		isb();
> +		/* Local invalidate only for this VMID */
> +		asm volatile("tlbi vmalle1" : : );
> +		dsb(nsh);
> +		vcpu->arch.requires_tlbi = false;
> +	}
>  }
>  
>  static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
> -- 
> 2.1.4
>
Marc Zyngier Oct. 27, 2016, 9:49 a.m. UTC | #4
Hi Christoffer,

On 27/10/16 10:19, Christoffer Dall wrote:
> On Mon, Oct 24, 2016 at 04:31:28PM +0100, Marc Zyngier wrote:
>> Architecturally, TLBs are private to the (physical) CPU they're
>> associated with. But when multiple vcpus from the same VM are
>> being multiplexed on the same CPU, the TLBs are not private
>> to the vcpus (and are actually shared across the VMID).
>>
>> Let's consider the following scenario:
>>
>> - vcpu-0 maps PA to VA
>> - vcpu-1 maps PA' to VA
>>
>> If run on the same physical CPU, vcpu-1 can hit TLB entries generated
>> by vcpu-0 accesses, and access the wrong physical page.
>>
>> The solution to this is to keep a per-VM map of which vcpu ran last
>> on each given physical CPU, and invalidate local TLBs when switching
>> to a different vcpu from the same VM.
> 
> Just making sure I understand this:  The reason you cannot rely on the
> guest doing the necessary distinction with ASIDs or invalidating the TLB
> is that a guest (which assumes it's running on hardware) can validly
> defer any neccessary invalidation until it starts running on other
> physical CPUs, but we do this transparently in KVM?

The guest wouldn't have to do any invalidation at all on real HW,
because the TLBs are strictly private to a physical CPU (only the
invalidation can be broadcast to the Inner Shareable domain). But when
we multiplex two vcpus on the same physical CPU, we break the private
semantics, and a vcpu could hit in the TLB entries populated by the
another one.

As we cannot segregate the TLB entries per vcpu (but only per VMID), the
workaround is to nuke all the TLBs for this VMID (locally only - no
broadcast) each time we find that two vcpus are sharing the same
physical CPU.

Is that clearer?

Thanks,

	M.
Christoffer Dall Oct. 27, 2016, 10:04 a.m. UTC | #5
On Thu, Oct 27, 2016 at 10:49:00AM +0100, Marc Zyngier wrote:
> Hi Christoffer,
> 
> On 27/10/16 10:19, Christoffer Dall wrote:
> > On Mon, Oct 24, 2016 at 04:31:28PM +0100, Marc Zyngier wrote:
> >> Architecturally, TLBs are private to the (physical) CPU they're
> >> associated with. But when multiple vcpus from the same VM are
> >> being multiplexed on the same CPU, the TLBs are not private
> >> to the vcpus (and are actually shared across the VMID).
> >>
> >> Let's consider the following scenario:
> >>
> >> - vcpu-0 maps PA to VA
> >> - vcpu-1 maps PA' to VA
> >>
> >> If run on the same physical CPU, vcpu-1 can hit TLB entries generated
> >> by vcpu-0 accesses, and access the wrong physical page.
> >>
> >> The solution to this is to keep a per-VM map of which vcpu ran last
> >> on each given physical CPU, and invalidate local TLBs when switching
> >> to a different vcpu from the same VM.
> > 
> > Just making sure I understand this:  The reason you cannot rely on the
> > guest doing the necessary distinction with ASIDs or invalidating the TLB
> > is that a guest (which assumes it's running on hardware) can validly
> > defer any neccessary invalidation until it starts running on other
> > physical CPUs, but we do this transparently in KVM?
> 
> The guest wouldn't have to do any invalidation at all on real HW,
> because the TLBs are strictly private to a physical CPU (only the
> invalidation can be broadcast to the Inner Shareable domain). But when
> we multiplex two vcpus on the same physical CPU, we break the private
> semantics, and a vcpu could hit in the TLB entries populated by the
> another one.

Such a guest would be using a mapping of the same VA with the same ASID
on two separate CPUs, each pointing to a separate PA.  If it ever were
to, say, migrate a task, it would have to do invalidations then.  Right?

Does Linux or other guests actually do this?

I would suspect Linux has to eventually invalidate those mappins if it
wants the scheduler to be allowed to freely move things around.

> 
> As we cannot segregate the TLB entries per vcpu (but only per VMID), the
> workaround is to nuke all the TLBs for this VMID (locally only - no
> broadcast) each time we find that two vcpus are sharing the same
> physical CPU.
> 
> Is that clearer?

Yes, the fix is clear, just want to make sure I understand that it's a
valid circumstance where this actually happens.  But in either case, we
probably have to fix this to emulate the hardware correctly.

Another fix would be to allocate a VMID per VCPU I suppose, just to
introduce a terrible TLB hit ratio :)

Thanks,
-Christoffer
Marc Zyngier Oct. 27, 2016, 10:40 a.m. UTC | #6
On 27/10/16 11:04, Christoffer Dall wrote:
> On Thu, Oct 27, 2016 at 10:49:00AM +0100, Marc Zyngier wrote:
>> Hi Christoffer,
>>
>> On 27/10/16 10:19, Christoffer Dall wrote:
>>> On Mon, Oct 24, 2016 at 04:31:28PM +0100, Marc Zyngier wrote:
>>>> Architecturally, TLBs are private to the (physical) CPU they're
>>>> associated with. But when multiple vcpus from the same VM are
>>>> being multiplexed on the same CPU, the TLBs are not private
>>>> to the vcpus (and are actually shared across the VMID).
>>>>
>>>> Let's consider the following scenario:
>>>>
>>>> - vcpu-0 maps PA to VA
>>>> - vcpu-1 maps PA' to VA
>>>>
>>>> If run on the same physical CPU, vcpu-1 can hit TLB entries generated
>>>> by vcpu-0 accesses, and access the wrong physical page.
>>>>
>>>> The solution to this is to keep a per-VM map of which vcpu ran last
>>>> on each given physical CPU, and invalidate local TLBs when switching
>>>> to a different vcpu from the same VM.
>>>
>>> Just making sure I understand this:  The reason you cannot rely on the
>>> guest doing the necessary distinction with ASIDs or invalidating the TLB
>>> is that a guest (which assumes it's running on hardware) can validly
>>> defer any neccessary invalidation until it starts running on other
>>> physical CPUs, but we do this transparently in KVM?
>>
>> The guest wouldn't have to do any invalidation at all on real HW,
>> because the TLBs are strictly private to a physical CPU (only the
>> invalidation can be broadcast to the Inner Shareable domain). But when
>> we multiplex two vcpus on the same physical CPU, we break the private
>> semantics, and a vcpu could hit in the TLB entries populated by the
>> another one.
> 
> Such a guest would be using a mapping of the same VA with the same ASID
> on two separate CPUs, each pointing to a separate PA.  If it ever were
> to, say, migrate a task, it would have to do invalidations then.  Right?

This doesn't have to be ASID tagged. Actually, it is more likely to
affect global mappings. Imagine for example that the kernel (which uses
global mappings for its own page tables) decides to create per-cpu
variable using this trick (all the CPUs have the same VA, but use
different PAs). No invalidation at all, everything looks perfectly fine,
until you start virtualizing it.

> Does Linux or other guests actually do this?

Linux may hit it with CPU hotplug, which uses global mappings (which a
vcpu using an ASID tagged mapping could then hit if the VAs overlap).

> 
> I would suspect Linux has to eventually invalidate those mappins if it
> wants the scheduler to be allowed to freely move things around.
> 
>>
>> As we cannot segregate the TLB entries per vcpu (but only per VMID), the
>> workaround is to nuke all the TLBs for this VMID (locally only - no
>> broadcast) each time we find that two vcpus are sharing the same
>> physical CPU.
>>
>> Is that clearer?
> 
> Yes, the fix is clear, just want to make sure I understand that it's a
> valid circumstance where this actually happens.  But in either case, we
> probably have to fix this to emulate the hardware correctly.
> 
> Another fix would be to allocate a VMID per VCPU I suppose, just to
> introduce a terrible TLB hit ratio :)

But that would break TLB invalidations that are broadcast in the Inner
Shareable domain. To do so, you'd have to trap every TBLI, and issue
corresponding invalidations for all the vcpus. I'm not sure I want to
see the performance number of that solution... ;-)

Thanks,

	M.
Mark Rutland Oct. 27, 2016, 10:51 a.m. UTC | #7
Hi Christoffer,

On Thu, Oct 27, 2016 at 12:04:28PM +0200, Christoffer Dall wrote:
> On Thu, Oct 27, 2016 at 10:49:00AM +0100, Marc Zyngier wrote:
> > The guest wouldn't have to do any invalidation at all on real HW,
> > because the TLBs are strictly private to a physical CPU (only the
> > invalidation can be broadcast to the Inner Shareable domain). But when
> > we multiplex two vcpus on the same physical CPU, we break the private
> > semantics, and a vcpu could hit in the TLB entries populated by the
> > another one.
> 
> Such a guest would be using a mapping of the same VA with the same ASID
> on two separate CPUs, each pointing to a separate PA.  If it ever were
> to, say, migrate a task, it would have to do invalidations then.  Right?

An OS (not Linux) could use a different ASID space per-cpu.

e.g. with two single-threaded tasks A and B, you could have ASIDS:

	cpu0	cpu1
A	0	1
B	1	0

... and this would not be a problem, so long as when mappings changed
maintenance were performed appropriately (e.g. perhaps it uses IPIs to
trigger the relevant local TLB invlidation, rather than using broadcast
ops).

> Does Linux or other guests actually do this?

Linux currently doesn't use ASIDs that way, but does use global mappings
in a potentially-confliciting way in the cold-return paths (hotplug-on
and return from idle). With two vCPUs, you could have a sequence like:

	cpu0				cpu1
	Task with ASID x started
					hotplug on
					install global TTBR0 mapping
					global entry allocated into TLB
	Task hits cpu1's global entry

... which cannot happen bare-metal, and there's no point at which the
guest can perform suitable maintenance.

> Another fix would be to allocate a VMID per VCPU I suppose, just to
> introduce a terrible TLB hit ratio :)

That would break broadcast invalidation within the guest, no?

... unless you also trapped all TLB maintenance, and did the IPI-based
broadcast in SW.

Thanks,
Mark.
Christoffer Dall Oct. 27, 2016, 12:27 p.m. UTC | #8
On Thu, Oct 27, 2016 at 11:40:00AM +0100, Marc Zyngier wrote:
> On 27/10/16 11:04, Christoffer Dall wrote:
> > On Thu, Oct 27, 2016 at 10:49:00AM +0100, Marc Zyngier wrote:
> >> Hi Christoffer,
> >>
> >> On 27/10/16 10:19, Christoffer Dall wrote:
> >>> On Mon, Oct 24, 2016 at 04:31:28PM +0100, Marc Zyngier wrote:
> >>>> Architecturally, TLBs are private to the (physical) CPU they're
> >>>> associated with. But when multiple vcpus from the same VM are
> >>>> being multiplexed on the same CPU, the TLBs are not private
> >>>> to the vcpus (and are actually shared across the VMID).
> >>>>
> >>>> Let's consider the following scenario:
> >>>>
> >>>> - vcpu-0 maps PA to VA
> >>>> - vcpu-1 maps PA' to VA
> >>>>
> >>>> If run on the same physical CPU, vcpu-1 can hit TLB entries generated
> >>>> by vcpu-0 accesses, and access the wrong physical page.
> >>>>
> >>>> The solution to this is to keep a per-VM map of which vcpu ran last
> >>>> on each given physical CPU, and invalidate local TLBs when switching
> >>>> to a different vcpu from the same VM.
> >>>
> >>> Just making sure I understand this:  The reason you cannot rely on the
> >>> guest doing the necessary distinction with ASIDs or invalidating the TLB
> >>> is that a guest (which assumes it's running on hardware) can validly
> >>> defer any neccessary invalidation until it starts running on other
> >>> physical CPUs, but we do this transparently in KVM?
> >>
> >> The guest wouldn't have to do any invalidation at all on real HW,
> >> because the TLBs are strictly private to a physical CPU (only the
> >> invalidation can be broadcast to the Inner Shareable domain). But when
> >> we multiplex two vcpus on the same physical CPU, we break the private
> >> semantics, and a vcpu could hit in the TLB entries populated by the
> >> another one.
> > 
> > Such a guest would be using a mapping of the same VA with the same ASID
> > on two separate CPUs, each pointing to a separate PA.  If it ever were
> > to, say, migrate a task, it would have to do invalidations then.  Right?
> 
> This doesn't have to be ASID tagged. Actually, it is more likely to
> affect global mappings. Imagine for example that the kernel (which uses
> global mappings for its own page tables) decides to create per-cpu
> variable using this trick (all the CPUs have the same VA, but use
> different PAs). No invalidation at all, everything looks perfectly fine,
> until you start virtualizing it.
> 
> > Does Linux or other guests actually do this?
> 
> Linux may hit it with CPU hotplug, which uses global mappings (which a
> vcpu using an ASID tagged mapping could then hit if the VAs overlap).
> 

Right, ok, it's more threatening than I first thought.  Thanks for the
explanation.

> > 
> > I would suspect Linux has to eventually invalidate those mappins if it
> > wants the scheduler to be allowed to freely move things around.
> > 
> >>
> >> As we cannot segregate the TLB entries per vcpu (but only per VMID), the
> >> workaround is to nuke all the TLBs for this VMID (locally only - no
> >> broadcast) each time we find that two vcpus are sharing the same
> >> physical CPU.
> >>
> >> Is that clearer?
> > 
> > Yes, the fix is clear, just want to make sure I understand that it's a
> > valid circumstance where this actually happens.  But in either case, we
> > probably have to fix this to emulate the hardware correctly.
> > 
> > Another fix would be to allocate a VMID per VCPU I suppose, just to
> > introduce a terrible TLB hit ratio :)
> 
> But that would break TLB invalidations that are broadcast in the Inner
> Shareable domain. To do so, you'd have to trap every TBLI, and issue
> corresponding invalidations for all the vcpus. I'm not sure I want to
> see the performance number of that solution... ;-)
> 
Ah, yeah, that's ridiculous.  Forget what I said.

Thanks,
-Christoffer
Christoffer Dall Oct. 27, 2016, 12:28 p.m. UTC | #9
On Thu, Oct 27, 2016 at 11:51:26AM +0100, Mark Rutland wrote:
> Hi Christoffer,
> 
> On Thu, Oct 27, 2016 at 12:04:28PM +0200, Christoffer Dall wrote:
> > On Thu, Oct 27, 2016 at 10:49:00AM +0100, Marc Zyngier wrote:
> > > The guest wouldn't have to do any invalidation at all on real HW,
> > > because the TLBs are strictly private to a physical CPU (only the
> > > invalidation can be broadcast to the Inner Shareable domain). But when
> > > we multiplex two vcpus on the same physical CPU, we break the private
> > > semantics, and a vcpu could hit in the TLB entries populated by the
> > > another one.
> > 
> > Such a guest would be using a mapping of the same VA with the same ASID
> > on two separate CPUs, each pointing to a separate PA.  If it ever were
> > to, say, migrate a task, it would have to do invalidations then.  Right?
> 
> An OS (not Linux) could use a different ASID space per-cpu.
> 
> e.g. with two single-threaded tasks A and B, you could have ASIDS:
> 
> 	cpu0	cpu1
> A	0	1
> B	1	0
> 
> ... and this would not be a problem, so long as when mappings changed
> maintenance were performed appropriately (e.g. perhaps it uses IPIs to
> trigger the relevant local TLB invlidation, rather than using broadcast
> ops).
> 
> > Does Linux or other guests actually do this?
> 
> Linux currently doesn't use ASIDs that way, but does use global mappings
> in a potentially-confliciting way in the cold-return paths (hotplug-on
> and return from idle). With two vCPUs, you could have a sequence like:
> 
> 	cpu0				cpu1
> 	Task with ASID x started
> 					hotplug on
> 					install global TTBR0 mapping
> 					global entry allocated into TLB
> 	Task hits cpu1's global entry
> 
> ... which cannot happen bare-metal, and there's no point at which the
> guest can perform suitable maintenance.
> 
> > Another fix would be to allocate a VMID per VCPU I suppose, just to
> > introduce a terrible TLB hit ratio :)
> 
> That would break broadcast invalidation within the guest, no?
> 
> ... unless you also trapped all TLB maintenance, and did the IPI-based
> broadcast in SW.
> 

Thanks for explanations, I'm getting the full picture now.

-Christoffer
diff mbox

Patch

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 2d19e02..035e744 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -57,6 +57,8 @@  struct kvm_arch {
 	/* VTTBR value associated with below pgd and vmid */
 	u64    vttbr;
 
+	int __percpu *last_vcpu_ran;
+
 	/* Timer */
 	struct arch_timer_kvm	timer;
 
@@ -174,6 +176,9 @@  struct kvm_vcpu_arch {
 	/* vcpu power-off state */
 	bool power_off;
 
+	/* TLBI required */
+	bool requires_tlbi;
+
 	 /* Don't run the guest (internal implementation need) */
 	bool pause;
 
diff --git a/arch/arm/include/asm/kvm_hyp.h b/arch/arm/include/asm/kvm_hyp.h
index 343135e..5850890 100644
--- a/arch/arm/include/asm/kvm_hyp.h
+++ b/arch/arm/include/asm/kvm_hyp.h
@@ -71,6 +71,7 @@ 
 #define ICIALLUIS	__ACCESS_CP15(c7, 0, c1, 0)
 #define ATS1CPR		__ACCESS_CP15(c7, 0, c8, 0)
 #define TLBIALLIS	__ACCESS_CP15(c8, 0, c3, 0)
+#define TLBIALL		__ACCESS_CP15(c8, 0, c7, 0)
 #define TLBIALLNSNHIS	__ACCESS_CP15(c8, 4, c3, 4)
 #define PRRR		__ACCESS_CP15(c10, 0, c2, 0)
 #define NMRR		__ACCESS_CP15(c10, 0, c2, 1)
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 03e9273..09942f0 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -114,11 +114,18 @@  void kvm_arch_check_processor_compat(void *rtn)
  */
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
-	int ret = 0;
+	int ret, cpu;
 
 	if (type)
 		return -EINVAL;
 
+	kvm->arch.last_vcpu_ran = alloc_percpu(typeof(*kvm->arch.last_vcpu_ran));
+	if (!kvm->arch.last_vcpu_ran)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu)
+		*per_cpu_ptr(kvm->arch.last_vcpu_ran, cpu) = -1;
+
 	ret = kvm_alloc_stage2_pgd(kvm);
 	if (ret)
 		goto out_fail_alloc;
@@ -141,6 +148,8 @@  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 out_free_stage2_pgd:
 	kvm_free_stage2_pgd(kvm);
 out_fail_alloc:
+	free_percpu(kvm->arch.last_vcpu_ran);
+	kvm->arch.last_vcpu_ran = NULL;
 	return ret;
 }
 
@@ -168,6 +177,9 @@  void kvm_arch_destroy_vm(struct kvm *kvm)
 {
 	int i;
 
+	free_percpu(kvm->arch.last_vcpu_ran);
+	kvm->arch.last_vcpu_ran = NULL;
+
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
 		if (kvm->vcpus[i]) {
 			kvm_arch_vcpu_free(kvm->vcpus[i]);
@@ -310,6 +322,27 @@  int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+	int *last_ran;
+
+	last_ran = per_cpu_ptr(vcpu->kvm->arch.last_vcpu_ran, cpu);
+
+	/*
+	 * If we're very unlucky and get preempted before having ran
+	 * this vcpu for real, we'll end-up in a situation where any
+	 * vcpu that gets scheduled will perform an invalidation (this
+	 * vcpu explicitely requires it, and all the others will have
+	 * a different vcpu_id).
+	 */
+	if (*last_ran != vcpu->vcpu_id) {
+		if (*last_ran != -1)
+			vcpu->arch.requires_tlbi = true;
+
+		*last_ran = vcpu->vcpu_id;
+	}
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	vcpu->cpu = cpu;
diff --git a/arch/arm/kvm/hyp/switch.c b/arch/arm/kvm/hyp/switch.c
index 92678b7..ab8ee3b 100644
--- a/arch/arm/kvm/hyp/switch.c
+++ b/arch/arm/kvm/hyp/switch.c
@@ -75,6 +75,15 @@  static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = kern_hyp_va(vcpu->kvm);
 	write_sysreg(kvm->arch.vttbr, VTTBR);
+	if (vcpu->arch.requires_tlbi) {
+		/* Force vttbr to be written */
+		isb();
+		/* Local invalidate only for this VMID */
+		write_sysreg(0, TLBIALL);
+		dsb(nsh);
+		vcpu->arch.requires_tlbi = false;
+	}
+
 	write_sysreg(vcpu->arch.midr, VPIDR);
 }
 
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index bd94e67..5b42010 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -62,6 +62,8 @@  struct kvm_arch {
 	/* VTTBR value associated with above pgd and vmid */
 	u64    vttbr;
 
+	int __percpu *last_vcpu_ran;
+
 	/* The maximum number of vCPUs depends on the used GIC model */
 	int max_vcpus;
 
@@ -252,6 +254,9 @@  struct kvm_vcpu_arch {
 	/* vcpu power-off state */
 	bool power_off;
 
+	/* TLBI required */
+	bool requires_tlbi;
+
 	/* Don't run the guest (internal implementation need) */
 	bool pause;
 
@@ -368,7 +373,6 @@  static inline void __cpu_reset_hyp_mode(unsigned long vector_ptr,
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
 
 void kvm_arm_init_debug(void);
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 83037cd..8d9c3eb 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -131,6 +131,14 @@  static void __hyp_text __activate_vm(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = kern_hyp_va(vcpu->kvm);
 	write_sysreg(kvm->arch.vttbr, vttbr_el2);
+	if (vcpu->arch.requires_tlbi) {
+		/* Force vttbr_el2 to be written */
+		isb();
+		/* Local invalidate only for this VMID */
+		asm volatile("tlbi vmalle1" : : );
+		dsb(nsh);
+		vcpu->arch.requires_tlbi = false;
+	}
 }
 
 static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)