[PATCHv2] KVM: optimize apic interrupt delivery

Message ID	20120911130225.GN20907@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> Date: Tue, 11 Sep 2012 16:02:25 +0300 From: Gleb Natapov <gleb@redhat.com> To: kvm@vger.kernel.org Cc: avi@redhat.com, mtosatti@redhat.com, mst@redhat.com Subject: [PATCHv2] KVM: optimize apic interrupt delivery Message-ID: <20120911130225.GN20907@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: kvm-owner@vger.kernel.org Precedence: bulk

Gleb Natapov Sept. 11, 2012, 1:02 p.m. UTC

Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 Changelog:

  - v1->v2
   * fix race Avi noticed
   * rcu_read_lock() out of the block as per Avi
   * fix rcu issues pointed to by MST. All but one. Still use
     call_rcu(). Do not think this is serious issue. If it is should be
     solved by RCU subsystem.
   * Fix phys_map overflow pointed to by MST
   * recalculate_apic_map() does not return error any more.
   * add optimization for low prio logical mode with one cpu as dst (it
     happens)

I did not rewrote kvm_irq_delivery_to_apic_fast() MST way since my way
looks cleaner to me.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity Sept. 11, 2012, 1:26 p.m. UTC | #1

On 09/11/2012 04:02 PM, Gleb Natapov wrote:
> Most interrupt are delivered to only one vcpu. Use pre-build tables to
> find interrupt destination instead of looping through all vcpus. In case
> of logical mode loop only through vcpus in a logical cluster irq is sent
> to.
> 
>    * fix rcu issues pointed to by MST. All but one. Still use
>      call_rcu(). Do not think this is serious issue. If it is should be
>      solved by RCU subsystem.

Agree.

Patch looks good but some minor comments follow.

>  struct kvm_arch {
>  	unsigned int n_used_mmu_pages;
>  	unsigned int n_requested_mmu_pages;
> @@ -528,6 +536,8 @@ struct kvm_arch {
>  	struct kvm_ioapic *vioapic;
>  	struct kvm_pit *vpit;
>  	int vapics_in_nmi_mode;
> +	struct kvm_apic_map *apic_map;
> +	struct mutex apic_map_lock;

Reversing the order will make it clearer what the lock protects.

>  
> +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> +		u16 *cid, u16 *lid)
> +{
> +	if (ldr_bits == 32) {
> +		*cid = ldr >> 16;
> +		*lid = ldr & 0xffff;
> +	} else {
> +		ldr = GET_APIC_LOGICAL_ID(ldr);
> +
> +		if (flat) {
> +			*cid = 0;
> +			*lid = ldr;
> +		} else {
> +			*cid = ldr >> 4;
> +			*lid = ldr & 0xf;
> +		}
> +	}
> +}

You could precaclulate lid_shift/lid_mask/cid_shift/cid_mask and have
just one version here.  In fact you could drop the function.

> +
> +static inline void recalculate_apic_map(struct kvm *kvm)
> +{
> +	struct kvm_apic_map *new, *old = NULL;
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> +
> +	mutex_lock(&kvm->arch.apic_map_lock);
> +
> +	if (!new)
> +		goto out;
> +
> +	new->ldr_bits = 8;
> +	new->flat = true;
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		u16 cid, lid;
> +		struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +		if (!kvm_apic_present(vcpu))
> +			continue;
> +
> +		if (apic_x2apic_mode(apic)) {
> +			new->ldr_bits = 32;
> +			new->flat = false;
> +		} else if (kvm_apic_sw_enabled(apic) && new->flat &&
> +				kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER)
> +			new->flat = false;

While a vcpu is being hotplugged in it will be in flat mode.  The code
correctly gives precedence to x2apic and cluster modes over flat mode,
so it is correct in that respect, but the comment describing this is too
short.

> +
> +		new->phys_map[kvm_apic_id(apic)] = apic;
> +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> +				new->flat, new->ldr_bits, &cid, &lid);
> +
> +		if (lid)
> +			new->logical_map[cid][ffs(lid) - 1] = apic;
> +	}
> +out:
> +	old = kvm->arch.apic_map;

rcu_dereference(), just for kicks.

> +	rcu_assign_pointer(kvm->arch.apic_map, new);
> +	mutex_unlock(&kvm->arch.apic_map_lock);
> +
> +	if (old)
> +		kfree_rcu(old, rcu);

Nice, removes the need for rcu_barrier().

> +}
> +
>  
> +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> +		struct kvm_lapic_irq *irq, int *r)
> +{
> +	struct kvm_apic_map *map;
> +	unsigned long bitmap = 1;
> +	struct kvm_lapic **dst;
> +	int i;
> +
> +	*r = -1;
> +
> +	if (irq->shorthand == APIC_DEST_SELF) {
> +		*r = kvm_apic_set_irq(src->vcpu, irq);
> +		return true;
> +	}
> +
> +	if (irq->shorthand)
> +		return false;
> +
> +	rcu_read_lock();
> +	map = rcu_dereference(kvm->arch.apic_map);
> +
> +	if (!map) {
> +		rcu_read_unlock();
> +		return false;
> +	}
> +
> +	if (irq->dest_mode == 0) { /* physical mode */
> +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> +				irq->dest_id == 0xff) {
> +			rcu_read_unlock();
> +			return false;
> +		}

Two error paths with rcu_read_unlock().  Cleaner to have a bool ret =
false; in the beginning and 'goto out_unlock' here, IMO.


> +		dst = &map->phys_map[irq->dest_id & 0xff];
> +	} else {
> +		u16 cid, lid;
> +		u32 mda = irq->dest_id;
> +
> +		if (map->ldr_bits == 8)
> +			mda <<= 24;

mda <<= 32 - map->ldr_bits;

> +
> +		kvm_apic_get_logical_id(mda, map->flat, map->ldr_bits,
> +				&cid, &lid);
> +		dst = map->logical_map[cid];
> +
> +		bitmap = lid;
> +		if (irq->delivery_mode == APIC_DM_LOWEST &&
> +				hweight_long(bitmap) > 1) {
> +			int l = -1;
> +			for_each_set_bit(i, &bitmap, 16) {
> +				if (!dst[i])
> +					continue;
> +				if (l < 0)
> +					l = i;
> +				else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0)
> +					l = i;
> +			}
> +
> +			bitmap = (l >= 0) ? 1 << l : 0;
> +		}
> +	}
> +
> +	for_each_set_bit(i, &bitmap, 16) {
> +		if (!dst[i])
> +			continue;
> +		if (*r < 0)
> +			*r = 0;
> +		*r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> +	}
> +
> +	rcu_read_unlock();
> +	return true;
> +}
> +
>  /*
> @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  		put_page(kvm->arch.apic_access_page);
>  	if (kvm->arch.ept_identity_pagetable)
>  		put_page(kvm->arch.ept_identity_pagetable);
> +	kfree(kvm->arch.apic_map);

rcu_dereference(), even though it cannot be needed here, to shut down
static code checkers.

Michael S. Tsirkin Sept. 11, 2012, 2:02 p.m. UTC | #2

On Tue, Sep 11, 2012 at 04:26:17PM +0300, Avi Kivity wrote:
> On 09/11/2012 04:02 PM, Gleb Natapov wrote:
> > Most interrupt are delivered to only one vcpu. Use pre-build tables to
> > find interrupt destination instead of looping through all vcpus. In case
> > of logical mode loop only through vcpus in a logical cluster irq is sent
> > to.
> > 
> >    * fix rcu issues pointed to by MST. All but one. Still use
> >      call_rcu(). Do not think this is serious issue. If it is should be
> >      solved by RCU subsystem.
> 
> Agree.
> 
> Patch looks good but some minor comments follow.
> 
> >  struct kvm_arch {
> >  	unsigned int n_used_mmu_pages;
> >  	unsigned int n_requested_mmu_pages;
> > @@ -528,6 +536,8 @@ struct kvm_arch {
> >  	struct kvm_ioapic *vioapic;
> >  	struct kvm_pit *vpit;
> >  	int vapics_in_nmi_mode;
> > +	struct kvm_apic_map *apic_map;
> > +	struct mutex apic_map_lock;
> 
> Reversing the order will make it clearer what the lock protects.
> 
> >  
> > +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> > +		u16 *cid, u16 *lid)
> > +{
> > +	if (ldr_bits == 32) {
> > +		*cid = ldr >> 16;
> > +		*lid = ldr & 0xffff;
> > +	} else {
> > +		ldr = GET_APIC_LOGICAL_ID(ldr);
> > +
> > +		if (flat) {
> > +			*cid = 0;
> > +			*lid = ldr;
> > +		} else {
> > +			*cid = ldr >> 4;
> > +			*lid = ldr & 0xf;
> > +		}
> > +	}
> > +}
> 
> You could precaclulate lid_shift/lid_mask/cid_shift/cid_mask and have
> just one version here.  In fact you could drop the function.
> 
> > +
> > +static inline void recalculate_apic_map(struct kvm *kvm)
> > +{
> > +	struct kvm_apic_map *new, *old = NULL;
> > +	struct kvm_vcpu *vcpu;
> > +	int i;
> > +
> > +	new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> > +
> > +	mutex_lock(&kvm->arch.apic_map_lock);
> > +
> > +	if (!new)
> > +		goto out;
> > +
> > +	new->ldr_bits = 8;
> > +	new->flat = true;
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		u16 cid, lid;
> > +		struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +		if (!kvm_apic_present(vcpu))
> > +			continue;
> > +
> > +		if (apic_x2apic_mode(apic)) {
> > +			new->ldr_bits = 32;
> > +			new->flat = false;
> > +		} else if (kvm_apic_sw_enabled(apic) && new->flat &&
> > +				kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER)
> > +			new->flat = false;
> 
> While a vcpu is being hotplugged in it will be in flat mode.  The code
> correctly gives precedence to x2apic and cluster modes over flat mode,
> so it is correct in that respect, but the comment describing this is too
> short.
> 
> > +
> > +		new->phys_map[kvm_apic_id(apic)] = apic;
> > +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> > +				new->flat, new->ldr_bits, &cid, &lid);
> > +
> > +		if (lid)
> > +			new->logical_map[cid][ffs(lid) - 1] = apic;
> > +	}
> > +out:
> > +	old = kvm->arch.apic_map;
> 
> rcu_dereference(), just for kicks.

rcu_dereference will give warnings with lockdep.
Should be rcu_dereference_protected.

> > +	rcu_assign_pointer(kvm->arch.apic_map, new);
> > +	mutex_unlock(&kvm->arch.apic_map_lock);
> > +
> > +	if (old)
> > +		kfree_rcu(old, rcu);
> 
> Nice, removes the need for rcu_barrier().
> 
> > +}
> > +
> >  
> > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > +		struct kvm_lapic_irq *irq, int *r)
> > +{
> > +	struct kvm_apic_map *map;
> > +	unsigned long bitmap = 1;
> > +	struct kvm_lapic **dst;
> > +	int i;
> > +
> > +	*r = -1;
> > +
> > +	if (irq->shorthand == APIC_DEST_SELF) {
> > +		*r = kvm_apic_set_irq(src->vcpu, irq);
> > +		return true;
> > +	}
> > +
> > +	if (irq->shorthand)
> > +		return false;
> > +
> > +	rcu_read_lock();
> > +	map = rcu_dereference(kvm->arch.apic_map);
> > +
> > +	if (!map) {
> > +		rcu_read_unlock();
> > +		return false;
> > +	}
> > +
> > +	if (irq->dest_mode == 0) { /* physical mode */
> > +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> > +				irq->dest_id == 0xff) {
> > +			rcu_read_unlock();
> > +			return false;
> > +		}
> 
> Two error paths with rcu_read_unlock().  Cleaner to have a bool ret =
> false; in the beginning and 'goto out_unlock' here, IMO.

Nod.

> 
> > +		dst = &map->phys_map[irq->dest_id & 0xff];
> > +	} else {
> > +		u16 cid, lid;
> > +		u32 mda = irq->dest_id;
> > +
> > +		if (map->ldr_bits == 8)
> > +			mda <<= 24;
> 
> mda <<= 32 - map->ldr_bits;
> 
> > +
> > +		kvm_apic_get_logical_id(mda, map->flat, map->ldr_bits,
> > +				&cid, &lid);
> > +		dst = map->logical_map[cid];
> > +
> > +		bitmap = lid;
> > +		if (irq->delivery_mode == APIC_DM_LOWEST &&
> > +				hweight_long(bitmap) > 1) {
> > +			int l = -1;
> > +			for_each_set_bit(i, &bitmap, 16) {
> > +				if (!dst[i])
> > +					continue;
> > +				if (l < 0)
> > +					l = i;
> > +				else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0)
> > +					l = i;
> > +			}
> > +
> > +			bitmap = (l >= 0) ? 1 << l : 0;
> > +		}
> > +	}
> > +
> > +	for_each_set_bit(i, &bitmap, 16) {
> > +		if (!dst[i])
> > +			continue;
> > +		if (*r < 0)
> > +			*r = 0;
> > +		*r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> > +	}
> > +
> > +	rcu_read_unlock();
> > +	return true;
> > +}
> > +
> >  /*
> > @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> >  		put_page(kvm->arch.apic_access_page);
> >  	if (kvm->arch.ept_identity_pagetable)
> >  		put_page(kvm->arch.ept_identity_pagetable);
> > +	kfree(kvm->arch.apic_map);
> 
> rcu_dereference(), even though it cannot be needed here, to shut down
> static code checkers.

Here too, rcu_dereference will give warnings with lockdep.
Use rcu_dereference_check(...., 1) instead.


> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin Sept. 11, 2012, 2:10 p.m. UTC | #3

On Tue, Sep 11, 2012 at 04:02:25PM +0300, Gleb Natapov wrote:
> Most interrupt are delivered to only one vcpu. Use pre-build tables to
> find interrupt destination instead of looping through all vcpus. In case
> of logical mode loop only through vcpus in a logical cluster irq is sent
> to.
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>

Added Paul just to make sure we are using RCU correctly.
Paul could you pls answer the question below?

> ---
>  Changelog:
> 
>   - v1->v2
>    * fix race Avi noticed
>    * rcu_read_lock() out of the block as per Avi
>    * fix rcu issues pointed to by MST. All but one. Still use
>      call_rcu(). Do not think this is serious issue. If it is should be
>      solved by RCU subsystem.
>    * Fix phys_map overflow pointed to by MST
>    * recalculate_apic_map() does not return error any more.
>    * add optimization for low prio logical mode with one cpu as dst (it
>      happens)
> 
> I did not rewrote kvm_irq_delivery_to_apic_fast() MST way since my way
> looks cleaner to me.
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 64adb61..3ba8951 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
>  	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
>  };
>  
> +struct kvm_apic_map {
> +	struct rcu_head rcu;
> +	bool flat;
> +	u8 ldr_bits;
> +	struct kvm_lapic *phys_map[256];
> +	struct kvm_lapic *logical_map[16][16];
> +};
> +
>  struct kvm_arch {
>  	unsigned int n_used_mmu_pages;
>  	unsigned int n_requested_mmu_pages;
> @@ -528,6 +536,8 @@ struct kvm_arch {
>  	struct kvm_ioapic *vioapic;
>  	struct kvm_pit *vpit;
>  	int vapics_in_nmi_mode;
> +	struct kvm_apic_map *apic_map;
> +	struct mutex apic_map_lock;
>  
>  	unsigned int tss_addr;
>  	struct page *apic_access_page;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 07ad628..06672e8 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -139,11 +139,92 @@ static inline int apic_enabled(struct kvm_lapic *apic)
>  	(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
>  	 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
>  
> +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> +{
> +	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> +}
> +
>  static inline int kvm_apic_id(struct kvm_lapic *apic)
>  {
>  	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
>  }
>  
> +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> +		u16 *cid, u16 *lid)
> +{
> +	if (ldr_bits == 32) {
> +		*cid = ldr >> 16;
> +		*lid = ldr & 0xffff;
> +	} else {
> +		ldr = GET_APIC_LOGICAL_ID(ldr);
> +
> +		if (flat) {
> +			*cid = 0;
> +			*lid = ldr;
> +		} else {
> +			*cid = ldr >> 4;
> +			*lid = ldr & 0xf;
> +		}
> +	}
> +}
> +
> +static inline void recalculate_apic_map(struct kvm *kvm)
> +{
> +	struct kvm_apic_map *new, *old = NULL;
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> +
> +	mutex_lock(&kvm->arch.apic_map_lock);
> +
> +	if (!new)
> +		goto out;
> +
> +	new->ldr_bits = 8;
> +	new->flat = true;
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		u16 cid, lid;
> +		struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +		if (!kvm_apic_present(vcpu))
> +			continue;
> +
> +		if (apic_x2apic_mode(apic)) {
> +			new->ldr_bits = 32;
> +			new->flat = false;
> +		} else if (kvm_apic_sw_enabled(apic) && new->flat &&
> +				kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER)
> +			new->flat = false;
> +
> +		new->phys_map[kvm_apic_id(apic)] = apic;
> +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> +				new->flat, new->ldr_bits, &cid, &lid);
> +
> +		if (lid)
> +			new->logical_map[cid][ffs(lid) - 1] = apic;
> +	}
> +out:
> +	old = kvm->arch.apic_map;
> +	rcu_assign_pointer(kvm->arch.apic_map, new);
> +	mutex_unlock(&kvm->arch.apic_map_lock);
> +
> +	if (old)
> +		kfree_rcu(old, rcu);
> +}

Paul, I'd like to check something with you here:
this function can be triggered by userspace,
any number of times; we allocate
a 2K chunk of memory that is later freed by
kfree_rcu.

Is there a risk of DOS if RCU is delayed while
lots of memory is queued up in this way?
If yes is this a generic problem with kfree_rcu
that should be addressed in core kernel?

Thanks!

> +
> +static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> +{
> +	apic_set_reg(apic, APIC_ID, id << 24);
> +	recalculate_apic_map(apic->vcpu->kvm);
> +}
> +
> +static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> +{
> +	apic_set_reg(apic, APIC_LDR, id);
> +	recalculate_apic_map(apic->vcpu->kvm);
> +}
> +
>  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
>  {
>  	return !(kvm_apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED);
> @@ -193,11 +274,6 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu)
>  	apic_set_reg(apic, APIC_LVR, v);
>  }
>  
> -static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> -{
> -	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> -}
> -
>  static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
>  	LVT_MASK ,      /* part LVTT mask, timer mode mask added at runtime */
>  	LVT_MASK | APIC_MODE_MASK,	/* LVTTHMR */
> @@ -477,6 +553,79 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
>  	return result;
>  }
>  
> +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> +		struct kvm_lapic_irq *irq, int *r)
> +{
> +	struct kvm_apic_map *map;
> +	unsigned long bitmap = 1;
> +	struct kvm_lapic **dst;
> +	int i;
> +
> +	*r = -1;
> +
> +	if (irq->shorthand == APIC_DEST_SELF) {
> +		*r = kvm_apic_set_irq(src->vcpu, irq);
> +		return true;
> +	}
> +
> +	if (irq->shorthand)
> +		return false;
> +
> +	rcu_read_lock();
> +	map = rcu_dereference(kvm->arch.apic_map);
> +
> +	if (!map) {
> +		rcu_read_unlock();
> +		return false;
> +	}
> +
> +	if (irq->dest_mode == 0) { /* physical mode */
> +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> +				irq->dest_id == 0xff) {
> +			rcu_read_unlock();
> +			return false;
> +		}
> +		dst = &map->phys_map[irq->dest_id & 0xff];
> +	} else {
> +		u16 cid, lid;
> +		u32 mda = irq->dest_id;
> +
> +		if (map->ldr_bits == 8)
> +			mda <<= 24;
> +
> +		kvm_apic_get_logical_id(mda, map->flat, map->ldr_bits,
> +				&cid, &lid);
> +		dst = map->logical_map[cid];
> +
> +		bitmap = lid;
> +		if (irq->delivery_mode == APIC_DM_LOWEST &&
> +				hweight_long(bitmap) > 1) {
> +			int l = -1;
> +			for_each_set_bit(i, &bitmap, 16) {
> +				if (!dst[i])
> +					continue;
> +				if (l < 0)
> +					l = i;
> +				else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0)
> +					l = i;
> +			}
> +
> +			bitmap = (l >= 0) ? 1 << l : 0;
> +		}
> +	}
> +
> +	for_each_set_bit(i, &bitmap, 16) {
> +		if (!dst[i])
> +			continue;
> +		if (*r < 0)
> +			*r = 0;
> +		*r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> +	}
> +
> +	rcu_read_unlock();
> +	return true;
> +}
> +
>  /*
>   * Add a pending IRQ into lapic.
>   * Return 1 if successfully added and 0 if discarded.
> @@ -880,7 +1029,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  	switch (reg) {
>  	case APIC_ID:		/* Local APIC ID */
>  		if (!apic_x2apic_mode(apic))
> -			apic_set_reg(apic, APIC_ID, val);
> +			kvm_apic_set_id(apic, val >> 24);
>  		else
>  			ret = 1;
>  		break;
> @@ -896,15 +1045,16 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  
>  	case APIC_LDR:
>  		if (!apic_x2apic_mode(apic))
> -			apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK);
> +			kvm_apic_set_ldr(apic, val & APIC_LDR_MASK);
>  		else
>  			ret = 1;
>  		break;
>  
>  	case APIC_DFR:
> -		if (!apic_x2apic_mode(apic))
> +		if (!apic_x2apic_mode(apic)) {
>  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
> -		else
> +			recalculate_apic_map(apic->vcpu->kvm);
> +		} else
>  			ret = 1;
>  		break;
>  
> @@ -1135,6 +1285,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  			static_key_slow_dec_deferred(&apic_hw_disabled);
>  		else
>  			static_key_slow_inc(&apic_hw_disabled.key);
> +		recalculate_apic_map(vcpu->kvm);
>  	}
>  
>  	if (!kvm_vcpu_is_bsp(apic->vcpu))
> @@ -1144,7 +1295,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  	if (apic_x2apic_mode(apic)) {
>  		u32 id = kvm_apic_id(apic);
>  		u32 ldr = ((id & ~0xf) << 16) | (1 << (id & 0xf));
> -		apic_set_reg(apic, APIC_LDR, ldr);
> +		kvm_apic_set_ldr(apic, ldr);
>  	}
>  	apic->base_address = apic->vcpu->arch.apic_base &
>  			     MSR_IA32_APICBASE_BASE;
> @@ -1169,7 +1320,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
>  	/* Stop the timer in case it's a reset to an active apic */
>  	hrtimer_cancel(&apic->lapic_timer.timer);
>  
> -	apic_set_reg(apic, APIC_ID, vcpu->vcpu_id << 24);
> +	kvm_apic_set_id(apic, (u8)vcpu->vcpu_id);
>  	kvm_apic_set_version(apic->vcpu);
>  
>  	for (i = 0; i < APIC_LVT_NUM; i++)
> @@ -1180,7 +1331,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
>  	apic_set_reg(apic, APIC_DFR, 0xffffffffU);
>  	apic_set_spiv(apic, 0xff);
>  	apic_set_reg(apic, APIC_TASKPRI, 0);
> -	apic_set_reg(apic, APIC_LDR, 0);
> +	kvm_apic_set_ldr(apic, 0);
>  	apic_set_reg(apic, APIC_ESR, 0);
>  	apic_set_reg(apic, APIC_ICR, 0);
>  	apic_set_reg(apic, APIC_ICR2, 0);
> @@ -1398,6 +1549,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
>  	/* set SPIV separately to get count of SW disabled APICs right */
>  	apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV)));
>  	memcpy(vcpu->arch.apic->regs, s->regs, sizeof *s);
> +	/* call kvm_apic_set_id() to put apic into apic_map */
> +	kvm_apic_set_id(apic, kvm_apic_id(apic));
>  	kvm_apic_set_version(vcpu);
>  
>  	apic_update_ppr(apic);
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 615a8b0..e5ebf9f 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -52,6 +52,9 @@ int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
>  int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
>  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
>  
> +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> +		struct kvm_lapic_irq *irq, int *r);
> +
>  u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
>  void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
>  void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f91e2c9..db6db8f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6269,6 +6269,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
>  
>  	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
> +	mutex_init(&kvm->arch.apic_map_lock);
>  
>  	return 0;
>  }
> @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  		put_page(kvm->arch.apic_access_page);
>  	if (kvm->arch.ept_identity_pagetable)
>  		put_page(kvm->arch.ept_identity_pagetable);
> +	kfree(kvm->arch.apic_map);
>  }
>  
>  void kvm_arch_free_memslot(struct kvm_memory_slot *free,
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 7118be0..3ca89c4 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -68,8 +68,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  	struct kvm_vcpu *vcpu, *lowest = NULL;
>  
>  	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
> -			kvm_is_dm_lowest_prio(irq))
> +			kvm_is_dm_lowest_prio(irq)) {
>  		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
> +		irq->delivery_mode = APIC_DM_FIXED;
> +	}
> +
> +	if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r))
> +		return r;
>  
>  	kvm_for_each_vcpu(i, vcpu, kvm) {
>  		if (!kvm_apic_present(vcpu))
> --
> 			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Sept. 11, 2012, 2:46 p.m. UTC | #4

On Tue, Sep 11, 2012 at 04:26:17PM +0300, Avi Kivity wrote:
> On 09/11/2012 04:02 PM, Gleb Natapov wrote:
> > Most interrupt are delivered to only one vcpu. Use pre-build tables to
> > find interrupt destination instead of looping through all vcpus. In case
> > of logical mode loop only through vcpus in a logical cluster irq is sent
> > to.
> > 
> >    * fix rcu issues pointed to by MST. All but one. Still use
> >      call_rcu(). Do not think this is serious issue. If it is should be
> >      solved by RCU subsystem.
> 
> Agree.
> 
> Patch looks good but some minor comments follow.
> 
> >  struct kvm_arch {
> >  	unsigned int n_used_mmu_pages;
> >  	unsigned int n_requested_mmu_pages;
> > @@ -528,6 +536,8 @@ struct kvm_arch {
> >  	struct kvm_ioapic *vioapic;
> >  	struct kvm_pit *vpit;
> >  	int vapics_in_nmi_mode;
> > +	struct kvm_apic_map *apic_map;
> > +	struct mutex apic_map_lock;
> 
> Reversing the order will make it clearer what the lock protects.
> 
Hmm, OK. I thought names make it clear.

> >  
> > +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> > +		u16 *cid, u16 *lid)
> > +{
> > +	if (ldr_bits == 32) {
> > +		*cid = ldr >> 16;
> > +		*lid = ldr & 0xffff;
> > +	} else {
> > +		ldr = GET_APIC_LOGICAL_ID(ldr);
> > +
> > +		if (flat) {
> > +			*cid = 0;
> > +			*lid = ldr;
> > +		} else {
> > +			*cid = ldr >> 4;
> > +			*lid = ldr & 0xf;
> > +		}
> > +	}
> > +}
> 
> You could precaclulate lid_shift/lid_mask/cid_shift/cid_mask and have
> just one version here.  In fact you could drop the function.
> 
You mean precalculate them in recalculate_apic_map() and store in kvm_apic_map?
 
> > +
> > +static inline void recalculate_apic_map(struct kvm *kvm)
> > +{
> > +	struct kvm_apic_map *new, *old = NULL;
> > +	struct kvm_vcpu *vcpu;
> > +	int i;
> > +
> > +	new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> > +
> > +	mutex_lock(&kvm->arch.apic_map_lock);
> > +
> > +	if (!new)
> > +		goto out;
> > +
> > +	new->ldr_bits = 8;
> > +	new->flat = true;
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		u16 cid, lid;
> > +		struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +		if (!kvm_apic_present(vcpu))
> > +			continue;
> > +
> > +		if (apic_x2apic_mode(apic)) {
> > +			new->ldr_bits = 32;
> > +			new->flat = false;
> > +		} else if (kvm_apic_sw_enabled(apic) && new->flat &&
> > +				kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER)
> > +			new->flat = false;
> 
> While a vcpu is being hotplugged in it will be in flat mode.  The code
> correctly gives precedence to x2apic and cluster modes over flat mode,
> so it is correct in that respect, but the comment describing this is too
> short.
> 
Almost non existent.

> > +
> > +		new->phys_map[kvm_apic_id(apic)] = apic;
> > +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> > +				new->flat, new->ldr_bits, &cid, &lid);
> > +
> > +		if (lid)
> > +			new->logical_map[cid][ffs(lid) - 1] = apic;
> > +	}
> > +out:
> > +	old = kvm->arch.apic_map;
> 
> rcu_dereference(), just for kicks.
> 
MST says rcu_dereference_protected() but honestly I look at it and
rcu_dereference_check(...., 1) and condition they check are so obviously
correct in the code that using them is just a clutter. In more complex
cases, when dereference happens far away from locking it have its point.
If you insist on it here should we add it too irq routing code too?

> > +	rcu_assign_pointer(kvm->arch.apic_map, new);
> > +	mutex_unlock(&kvm->arch.apic_map_lock);
> > +
> > +	if (old)
> > +		kfree_rcu(old, rcu);
> 
> Nice, removes the need for rcu_barrier().
> 
> > +}
> > +
> >  
> > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > +		struct kvm_lapic_irq *irq, int *r)
> > +{
> > +	struct kvm_apic_map *map;
> > +	unsigned long bitmap = 1;
> > +	struct kvm_lapic **dst;
> > +	int i;
> > +
> > +	*r = -1;
> > +
> > +	if (irq->shorthand == APIC_DEST_SELF) {
> > +		*r = kvm_apic_set_irq(src->vcpu, irq);
> > +		return true;
> > +	}
> > +
> > +	if (irq->shorthand)
> > +		return false;
> > +
> > +	rcu_read_lock();
> > +	map = rcu_dereference(kvm->arch.apic_map);
> > +
> > +	if (!map) {
> > +		rcu_read_unlock();
> > +		return false;
> > +	}
> > +
> > +	if (irq->dest_mode == 0) { /* physical mode */
> > +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> > +				irq->dest_id == 0xff) {
> > +			rcu_read_unlock();
> > +			return false;
> > +		}
> 
> Two error paths with rcu_read_unlock().  Cleaner to have a bool ret =
> false; in the beginning and 'goto out_unlock' here, IMO.
> 
That's too against one :(
> 
> > +		dst = &map->phys_map[irq->dest_id & 0xff];
> > +	} else {
> > +		u16 cid, lid;
> > +		u32 mda = irq->dest_id;
> > +
> > +		if (map->ldr_bits == 8)
> > +			mda <<= 24;
> 
> mda <<= 32 - map->ldr_bits;
> 
Nice.

> > +
> > +		kvm_apic_get_logical_id(mda, map->flat, map->ldr_bits,
> > +				&cid, &lid);
> > +		dst = map->logical_map[cid];
> > +
> > +		bitmap = lid;
> > +		if (irq->delivery_mode == APIC_DM_LOWEST &&
> > +				hweight_long(bitmap) > 1) {
> > +			int l = -1;
> > +			for_each_set_bit(i, &bitmap, 16) {
> > +				if (!dst[i])
> > +					continue;
> > +				if (l < 0)
> > +					l = i;
> > +				else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0)
> > +					l = i;
> > +			}
> > +
> > +			bitmap = (l >= 0) ? 1 << l : 0;
> > +		}
> > +	}
> > +
> > +	for_each_set_bit(i, &bitmap, 16) {
> > +		if (!dst[i])
> > +			continue;
> > +		if (*r < 0)
> > +			*r = 0;
> > +		*r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> > +	}
> > +
> > +	rcu_read_unlock();
> > +	return true;
> > +}
> > +
> >  /*
> > @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> >  		put_page(kvm->arch.apic_access_page);
> >  	if (kvm->arch.ept_identity_pagetable)
> >  		put_page(kvm->arch.ept_identity_pagetable);
> > +	kfree(kvm->arch.apic_map);
> 
> rcu_dereference(), even though it cannot be needed here, to shut down
> static code checkers.
> 
How to run those code checkers? Do they complain about irq routing code?
Just curious.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity Sept. 11, 2012, 3:51 p.m. UTC | #5

On 09/11/2012 05:46 PM, Gleb Natapov wrote:
> On Tue, Sep 11, 2012 at 04:26:17PM +0300, Avi Kivity wrote:
>> On 09/11/2012 04:02 PM, Gleb Natapov wrote:
>> > Most interrupt are delivered to only one vcpu. Use pre-build tables to
>> > find interrupt destination instead of looping through all vcpus. In case
>> > of logical mode loop only through vcpus in a logical cluster irq is sent
>> > to.
>> > 
>> >    * fix rcu issues pointed to by MST. All but one. Still use
>> >      call_rcu(). Do not think this is serious issue. If it is should be
>> >      solved by RCU subsystem.
>> 
>> Agree.
>> 
>> Patch looks good but some minor comments follow.
>> 
>> >  struct kvm_arch {
>> >  	unsigned int n_used_mmu_pages;
>> >  	unsigned int n_requested_mmu_pages;
>> > @@ -528,6 +536,8 @@ struct kvm_arch {
>> >  	struct kvm_ioapic *vioapic;
>> >  	struct kvm_pit *vpit;
>> >  	int vapics_in_nmi_mode;
>> > +	struct kvm_apic_map *apic_map;
>> > +	struct mutex apic_map_lock;
>> 
>> Reversing the order will make it clearer what the lock protects.
>> 
> Hmm, OK. I thought names make it clear.

They do, but it is conventional and good practice to put the lock in front.

> 
>> >  
>> > +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
>> > +		u16 *cid, u16 *lid)
>> > +{
>> > +	if (ldr_bits == 32) {
>> > +		*cid = ldr >> 16;
>> > +		*lid = ldr & 0xffff;
>> > +	} else {
>> > +		ldr = GET_APIC_LOGICAL_ID(ldr);
>> > +
>> > +		if (flat) {
>> > +			*cid = 0;
>> > +			*lid = ldr;
>> > +		} else {
>> > +			*cid = ldr >> 4;
>> > +			*lid = ldr & 0xf;
>> > +		}
>> > +	}
>> > +}
>> 
>> You could precaclulate lid_shift/lid_mask/cid_shift/cid_mask and have
>> just one version here.  In fact you could drop the function.
>> 
> You mean precalculate them in recalculate_apic_map() and store in kvm_apic_map?

Yes.

> 
>> > +
>> > +		new->phys_map[kvm_apic_id(apic)] = apic;
>> > +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
>> > +				new->flat, new->ldr_bits, &cid, &lid);
>> > +
>> > +		if (lid)
>> > +			new->logical_map[cid][ffs(lid) - 1] = apic;
>> > +	}
>> > +out:
>> > +	old = kvm->arch.apic_map;
>> 
>> rcu_dereference(), just for kicks.
>> 
> MST says rcu_dereference_protected() but honestly I look at it and
> rcu_dereference_check(...., 1) and condition they check are so obviously
> correct in the code that using them is just a clutter. 

They say to the reader, or to a static checker "this case was considered
and that is the conclusion we arrived at".  Using the variable directly
says nothing.

> In more complex
> cases, when dereference happens far away from locking it have its point.
> If you insist on it here should we add it too irq routing code too?

Thanks.

>> >  /*
>> > @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>> >  		put_page(kvm->arch.apic_access_page);
>> >  	if (kvm->arch.ept_identity_pagetable)
>> >  		put_page(kvm->arch.ept_identity_pagetable);
>> > +	kfree(kvm->arch.apic_map);
>> 
>> rcu_dereference(), even though it cannot be needed here, to shut down
>> static code checkers.
>> 
> How to run those code checkers? Do they complain about irq routing code?
> Just curious.

No idea.  Julia Lawall sends patches from time to time with this kind of
things, so people do run them.

Paul E. McKenney Sept. 11, 2012, 5:13 p.m. UTC | #6

On Tue, Sep 11, 2012 at 05:10:23PM +0300, Michael S. Tsirkin wrote:
> On Tue, Sep 11, 2012 at 04:02:25PM +0300, Gleb Natapov wrote:
> > Most interrupt are delivered to only one vcpu. Use pre-build tables to
> > find interrupt destination instead of looping through all vcpus. In case
> > of logical mode loop only through vcpus in a logical cluster irq is sent
> > to.
> > 
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> 
> Added Paul just to make sure we are using RCU correctly.
> Paul could you pls answer the question below?
> 
> > ---
> >  Changelog:
> > 
> >   - v1->v2
> >    * fix race Avi noticed
> >    * rcu_read_lock() out of the block as per Avi
> >    * fix rcu issues pointed to by MST. All but one. Still use
> >      call_rcu(). Do not think this is serious issue. If it is should be
> >      solved by RCU subsystem.
> >    * Fix phys_map overflow pointed to by MST
> >    * recalculate_apic_map() does not return error any more.
> >    * add optimization for low prio logical mode with one cpu as dst (it
> >      happens)
> > 
> > I did not rewrote kvm_irq_delivery_to_apic_fast() MST way since my way
> > looks cleaner to me.
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 64adb61..3ba8951 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
> >  	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
> >  };
> >  
> > +struct kvm_apic_map {
> > +	struct rcu_head rcu;
> > +	bool flat;
> > +	u8 ldr_bits;
> > +	struct kvm_lapic *phys_map[256];
> > +	struct kvm_lapic *logical_map[16][16];
> > +};
> > +
> >  struct kvm_arch {
> >  	unsigned int n_used_mmu_pages;
> >  	unsigned int n_requested_mmu_pages;
> > @@ -528,6 +536,8 @@ struct kvm_arch {
> >  	struct kvm_ioapic *vioapic;
> >  	struct kvm_pit *vpit;
> >  	int vapics_in_nmi_mode;
> > +	struct kvm_apic_map *apic_map;
> > +	struct mutex apic_map_lock;
> >  
> >  	unsigned int tss_addr;
> >  	struct page *apic_access_page;
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 07ad628..06672e8 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -139,11 +139,92 @@ static inline int apic_enabled(struct kvm_lapic *apic)
> >  	(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
> >  	 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
> >  
> > +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > +{
> > +	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > +}
> > +
> >  static inline int kvm_apic_id(struct kvm_lapic *apic)
> >  {
> >  	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
> >  }
> >  
> > +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> > +		u16 *cid, u16 *lid)
> > +{
> > +	if (ldr_bits == 32) {
> > +		*cid = ldr >> 16;
> > +		*lid = ldr & 0xffff;
> > +	} else {
> > +		ldr = GET_APIC_LOGICAL_ID(ldr);
> > +
> > +		if (flat) {
> > +			*cid = 0;
> > +			*lid = ldr;
> > +		} else {
> > +			*cid = ldr >> 4;
> > +			*lid = ldr & 0xf;
> > +		}
> > +	}
> > +}
> > +
> > +static inline void recalculate_apic_map(struct kvm *kvm)
> > +{
> > +	struct kvm_apic_map *new, *old = NULL;
> > +	struct kvm_vcpu *vcpu;
> > +	int i;
> > +
> > +	new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> > +
> > +	mutex_lock(&kvm->arch.apic_map_lock);
> > +
> > +	if (!new)
> > +		goto out;
> > +
> > +	new->ldr_bits = 8;
> > +	new->flat = true;
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		u16 cid, lid;
> > +		struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +		if (!kvm_apic_present(vcpu))
> > +			continue;
> > +
> > +		if (apic_x2apic_mode(apic)) {
> > +			new->ldr_bits = 32;
> > +			new->flat = false;
> > +		} else if (kvm_apic_sw_enabled(apic) && new->flat &&
> > +				kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER)
> > +			new->flat = false;
> > +
> > +		new->phys_map[kvm_apic_id(apic)] = apic;
> > +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> > +				new->flat, new->ldr_bits, &cid, &lid);
> > +
> > +		if (lid)
> > +			new->logical_map[cid][ffs(lid) - 1] = apic;
> > +	}
> > +out:
> > +	old = kvm->arch.apic_map;
> > +	rcu_assign_pointer(kvm->arch.apic_map, new);
> > +	mutex_unlock(&kvm->arch.apic_map_lock);
> > +
> > +	if (old)
> > +		kfree_rcu(old, rcu);
> > +}
> 
> Paul, I'd like to check something with you here:
> this function can be triggered by userspace,
> any number of times; we allocate
> a 2K chunk of memory that is later freed by
> kfree_rcu.
> 
> Is there a risk of DOS if RCU is delayed while
> lots of memory is queued up in this way?
> If yes is this a generic problem with kfree_rcu
> that should be addressed in core kernel?

There is indeed a risk.  The kfree_rcu() implementation cannot really
decide what to do here, especially given that it is callable with irqs
disabled.

The usual approach is to keep a per-CPU counter and count it down from
some number for each kfree_rcu().  When it reaches zero, invoke
synchronize_rcu() as well as kfree_rcu(), and then reset it to the
"some number" mentioned above.

In theory, I could create an API that did this.  In practice, I have no
idea how to choose the number -- much depends on the size of the object
being freed, for example.

The RCU usage looks otherwise sane at first glance.

							Thanx, Paul

> Thanks!
> 
> > +
> > +static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> > +{
> > +	apic_set_reg(apic, APIC_ID, id << 24);
> > +	recalculate_apic_map(apic->vcpu->kvm);
> > +}
> > +
> > +static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> > +{
> > +	apic_set_reg(apic, APIC_LDR, id);
> > +	recalculate_apic_map(apic->vcpu->kvm);
> > +}
> > +
> >  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
> >  {
> >  	return !(kvm_apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED);
> > @@ -193,11 +274,6 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu)
> >  	apic_set_reg(apic, APIC_LVR, v);
> >  }
> >  
> > -static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > -{
> > -	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > -}
> > -
> >  static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
> >  	LVT_MASK ,      /* part LVTT mask, timer mode mask added at runtime */
> >  	LVT_MASK | APIC_MODE_MASK,	/* LVTTHMR */
> > @@ -477,6 +553,79 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
> >  	return result;
> >  }
> >  
> > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > +		struct kvm_lapic_irq *irq, int *r)
> > +{
> > +	struct kvm_apic_map *map;
> > +	unsigned long bitmap = 1;
> > +	struct kvm_lapic **dst;
> > +	int i;
> > +
> > +	*r = -1;
> > +
> > +	if (irq->shorthand == APIC_DEST_SELF) {
> > +		*r = kvm_apic_set_irq(src->vcpu, irq);
> > +		return true;
> > +	}
> > +
> > +	if (irq->shorthand)
> > +		return false;
> > +
> > +	rcu_read_lock();
> > +	map = rcu_dereference(kvm->arch.apic_map);
> > +
> > +	if (!map) {
> > +		rcu_read_unlock();
> > +		return false;
> > +	}
> > +
> > +	if (irq->dest_mode == 0) { /* physical mode */
> > +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> > +				irq->dest_id == 0xff) {
> > +			rcu_read_unlock();
> > +			return false;
> > +		}
> > +		dst = &map->phys_map[irq->dest_id & 0xff];
> > +	} else {
> > +		u16 cid, lid;
> > +		u32 mda = irq->dest_id;
> > +
> > +		if (map->ldr_bits == 8)
> > +			mda <<= 24;
> > +
> > +		kvm_apic_get_logical_id(mda, map->flat, map->ldr_bits,
> > +				&cid, &lid);
> > +		dst = map->logical_map[cid];
> > +
> > +		bitmap = lid;
> > +		if (irq->delivery_mode == APIC_DM_LOWEST &&
> > +				hweight_long(bitmap) > 1) {
> > +			int l = -1;
> > +			for_each_set_bit(i, &bitmap, 16) {
> > +				if (!dst[i])
> > +					continue;
> > +				if (l < 0)
> > +					l = i;
> > +				else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0)
> > +					l = i;
> > +			}
> > +
> > +			bitmap = (l >= 0) ? 1 << l : 0;
> > +		}
> > +	}
> > +
> > +	for_each_set_bit(i, &bitmap, 16) {
> > +		if (!dst[i])
> > +			continue;
> > +		if (*r < 0)
> > +			*r = 0;
> > +		*r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> > +	}
> > +
> > +	rcu_read_unlock();
> > +	return true;
> > +}
> > +
> >  /*
> >   * Add a pending IRQ into lapic.
> >   * Return 1 if successfully added and 0 if discarded.
> > @@ -880,7 +1029,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  	switch (reg) {
> >  	case APIC_ID:		/* Local APIC ID */
> >  		if (!apic_x2apic_mode(apic))
> > -			apic_set_reg(apic, APIC_ID, val);
> > +			kvm_apic_set_id(apic, val >> 24);
> >  		else
> >  			ret = 1;
> >  		break;
> > @@ -896,15 +1045,16 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  
> >  	case APIC_LDR:
> >  		if (!apic_x2apic_mode(apic))
> > -			apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK);
> > +			kvm_apic_set_ldr(apic, val & APIC_LDR_MASK);
> >  		else
> >  			ret = 1;
> >  		break;
> >  
> >  	case APIC_DFR:
> > -		if (!apic_x2apic_mode(apic))
> > +		if (!apic_x2apic_mode(apic)) {
> >  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
> > -		else
> > +			recalculate_apic_map(apic->vcpu->kvm);
> > +		} else
> >  			ret = 1;
> >  		break;
> >  
> > @@ -1135,6 +1285,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> >  			static_key_slow_dec_deferred(&apic_hw_disabled);
> >  		else
> >  			static_key_slow_inc(&apic_hw_disabled.key);
> > +		recalculate_apic_map(vcpu->kvm);
> >  	}
> >  
> >  	if (!kvm_vcpu_is_bsp(apic->vcpu))
> > @@ -1144,7 +1295,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> >  	if (apic_x2apic_mode(apic)) {
> >  		u32 id = kvm_apic_id(apic);
> >  		u32 ldr = ((id & ~0xf) << 16) | (1 << (id & 0xf));
> > -		apic_set_reg(apic, APIC_LDR, ldr);
> > +		kvm_apic_set_ldr(apic, ldr);
> >  	}
> >  	apic->base_address = apic->vcpu->arch.apic_base &
> >  			     MSR_IA32_APICBASE_BASE;
> > @@ -1169,7 +1320,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> >  	/* Stop the timer in case it's a reset to an active apic */
> >  	hrtimer_cancel(&apic->lapic_timer.timer);
> >  
> > -	apic_set_reg(apic, APIC_ID, vcpu->vcpu_id << 24);
> > +	kvm_apic_set_id(apic, (u8)vcpu->vcpu_id);
> >  	kvm_apic_set_version(apic->vcpu);
> >  
> >  	for (i = 0; i < APIC_LVT_NUM; i++)
> > @@ -1180,7 +1331,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> >  	apic_set_reg(apic, APIC_DFR, 0xffffffffU);
> >  	apic_set_spiv(apic, 0xff);
> >  	apic_set_reg(apic, APIC_TASKPRI, 0);
> > -	apic_set_reg(apic, APIC_LDR, 0);
> > +	kvm_apic_set_ldr(apic, 0);
> >  	apic_set_reg(apic, APIC_ESR, 0);
> >  	apic_set_reg(apic, APIC_ICR, 0);
> >  	apic_set_reg(apic, APIC_ICR2, 0);
> > @@ -1398,6 +1549,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> >  	/* set SPIV separately to get count of SW disabled APICs right */
> >  	apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV)));
> >  	memcpy(vcpu->arch.apic->regs, s->regs, sizeof *s);
> > +	/* call kvm_apic_set_id() to put apic into apic_map */
> > +	kvm_apic_set_id(apic, kvm_apic_id(apic));
> >  	kvm_apic_set_version(vcpu);
> >  
> >  	apic_update_ppr(apic);
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 615a8b0..e5ebf9f 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -52,6 +52,9 @@ int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
> >  int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> >  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
> >  
> > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > +		struct kvm_lapic_irq *irq, int *r);
> > +
> >  u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
> >  void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
> >  void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index f91e2c9..db6db8f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6269,6 +6269,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> >  	set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
> >  
> >  	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
> > +	mutex_init(&kvm->arch.apic_map_lock);
> >  
> >  	return 0;
> >  }
> > @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> >  		put_page(kvm->arch.apic_access_page);
> >  	if (kvm->arch.ept_identity_pagetable)
> >  		put_page(kvm->arch.ept_identity_pagetable);
> > +	kfree(kvm->arch.apic_map);
> >  }
> >  
> >  void kvm_arch_free_memslot(struct kvm_memory_slot *free,
> > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > index 7118be0..3ca89c4 100644
> > --- a/virt/kvm/irq_comm.c
> > +++ b/virt/kvm/irq_comm.c
> > @@ -68,8 +68,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> >  	struct kvm_vcpu *vcpu, *lowest = NULL;
> >  
> >  	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
> > -			kvm_is_dm_lowest_prio(irq))
> > +			kvm_is_dm_lowest_prio(irq)) {
> >  		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
> > +		irq->delivery_mode = APIC_DM_FIXED;
> > +	}
> > +
> > +	if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r))
> > +		return r;
> >  
> >  	kvm_for_each_vcpu(i, vcpu, kvm) {
> >  		if (!kvm_apic_present(vcpu))
> > --
> > 			Gleb.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity Sept. 11, 2012, 8:04 p.m. UTC | #7

On 09/11/2012 08:13 PM, Paul E. McKenney wrote:
> > Is there a risk of DOS if RCU is delayed while
> > lots of memory is queued up in this way?
> > If yes is this a generic problem with kfree_rcu
> > that should be addressed in core kernel?
>
> There is indeed a risk.  The kfree_rcu() implementation cannot really
> decide what to do here, especially given that it is callable with irqs
> disabled.
>
> The usual approach is to keep a per-CPU counter and count it down from
> some number for each kfree_rcu().  When it reaches zero, invoke
> synchronize_rcu() as well as kfree_rcu(), and then reset it to the
> "some number" mentioned above.
>
> In theory, I could create an API that did this.  In practice, I have no
> idea how to choose the number -- much depends on the size of the object
> being freed, for example.

Perhaps approach it from the other direction?  If we are under memory
pressure, start synchronize_rcu()ing, much like the shrinker operates.

Michael S. Tsirkin Sept. 11, 2012, 10:33 p.m. UTC | #8

On Tue, Sep 11, 2012 at 10:13:00AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 11, 2012 at 05:10:23PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Sep 11, 2012 at 04:02:25PM +0300, Gleb Natapov wrote:
> > > Most interrupt are delivered to only one vcpu. Use pre-build tables to
> > > find interrupt destination instead of looping through all vcpus. In case
> > > of logical mode loop only through vcpus in a logical cluster irq is sent
> > > to.
> > > 
> > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > 
> > Added Paul just to make sure we are using RCU correctly.
> > Paul could you pls answer the question below?
> > 
> > > ---
> > >  Changelog:
> > > 
> > >   - v1->v2
> > >    * fix race Avi noticed
> > >    * rcu_read_lock() out of the block as per Avi
> > >    * fix rcu issues pointed to by MST. All but one. Still use
> > >      call_rcu(). Do not think this is serious issue. If it is should be
> > >      solved by RCU subsystem.
> > >    * Fix phys_map overflow pointed to by MST
> > >    * recalculate_apic_map() does not return error any more.
> > >    * add optimization for low prio logical mode with one cpu as dst (it
> > >      happens)
> > > 
> > > I did not rewrote kvm_irq_delivery_to_apic_fast() MST way since my way
> > > looks cleaner to me.
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 64adb61..3ba8951 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
> > >  	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
> > >  };
> > >  
> > > +struct kvm_apic_map {
> > > +	struct rcu_head rcu;
> > > +	bool flat;
> > > +	u8 ldr_bits;
> > > +	struct kvm_lapic *phys_map[256];
> > > +	struct kvm_lapic *logical_map[16][16];
> > > +};
> > > +
> > >  struct kvm_arch {
> > >  	unsigned int n_used_mmu_pages;
> > >  	unsigned int n_requested_mmu_pages;
> > > @@ -528,6 +536,8 @@ struct kvm_arch {
> > >  	struct kvm_ioapic *vioapic;
> > >  	struct kvm_pit *vpit;
> > >  	int vapics_in_nmi_mode;
> > > +	struct kvm_apic_map *apic_map;
> > > +	struct mutex apic_map_lock;
> > >  
> > >  	unsigned int tss_addr;
> > >  	struct page *apic_access_page;
> > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > index 07ad628..06672e8 100644
> > > --- a/arch/x86/kvm/lapic.c
> > > +++ b/arch/x86/kvm/lapic.c
> > > @@ -139,11 +139,92 @@ static inline int apic_enabled(struct kvm_lapic *apic)
> > >  	(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
> > >  	 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
> > >  
> > > +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > > +{
> > > +	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > > +}
> > > +
> > >  static inline int kvm_apic_id(struct kvm_lapic *apic)
> > >  {
> > >  	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
> > >  }
> > >  
> > > +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> > > +		u16 *cid, u16 *lid)
> > > +{
> > > +	if (ldr_bits == 32) {
> > > +		*cid = ldr >> 16;
> > > +		*lid = ldr & 0xffff;
> > > +	} else {
> > > +		ldr = GET_APIC_LOGICAL_ID(ldr);
> > > +
> > > +		if (flat) {
> > > +			*cid = 0;
> > > +			*lid = ldr;
> > > +		} else {
> > > +			*cid = ldr >> 4;
> > > +			*lid = ldr & 0xf;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static inline void recalculate_apic_map(struct kvm *kvm)
> > > +{
> > > +	struct kvm_apic_map *new, *old = NULL;
> > > +	struct kvm_vcpu *vcpu;
> > > +	int i;
> > > +
> > > +	new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> > > +
> > > +	mutex_lock(&kvm->arch.apic_map_lock);
> > > +
> > > +	if (!new)
> > > +		goto out;
> > > +
> > > +	new->ldr_bits = 8;
> > > +	new->flat = true;
> > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +		u16 cid, lid;
> > > +		struct kvm_lapic *apic = vcpu->arch.apic;
> > > +
> > > +		if (!kvm_apic_present(vcpu))
> > > +			continue;
> > > +
> > > +		if (apic_x2apic_mode(apic)) {
> > > +			new->ldr_bits = 32;
> > > +			new->flat = false;
> > > +		} else if (kvm_apic_sw_enabled(apic) && new->flat &&
> > > +				kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER)
> > > +			new->flat = false;
> > > +
> > > +		new->phys_map[kvm_apic_id(apic)] = apic;
> > > +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> > > +				new->flat, new->ldr_bits, &cid, &lid);
> > > +
> > > +		if (lid)
> > > +			new->logical_map[cid][ffs(lid) - 1] = apic;
> > > +	}
> > > +out:
> > > +	old = kvm->arch.apic_map;
> > > +	rcu_assign_pointer(kvm->arch.apic_map, new);
> > > +	mutex_unlock(&kvm->arch.apic_map_lock);
> > > +
> > > +	if (old)
> > > +		kfree_rcu(old, rcu);
> > > +}
> > 
> > Paul, I'd like to check something with you here:
> > this function can be triggered by userspace,
> > any number of times; we allocate
> > a 2K chunk of memory that is later freed by
> > kfree_rcu.
> > 
> > Is there a risk of DOS if RCU is delayed while
> > lots of memory is queued up in this way?
> > If yes is this a generic problem with kfree_rcu
> > that should be addressed in core kernel?
> 
> There is indeed a risk.

In our case it's a 2K object. Is it a practical risk?

> The kfree_rcu() implementation cannot really
> decide what to do here, especially given that it is callable with irqs
> disabled.
> 
> The usual approach is to keep a per-CPU counter and count it down from
> some number for each kfree_rcu().  When it reaches zero, invoke
> synchronize_rcu() as well as kfree_rcu(), and then reset it to the
> "some number" mentioned above.

It is a bit of a concern for me that this will hurt worst-case latency
for realtime guests.  In our case, we return error and this will
fall back on not allocating memory and using slow all-CPU scan.
One possible scheme that relies on this is:
	- increment an atomic counter, per vcpu. If above threshold ->
		return with error
	- call_rcu (+ barrier vcpu destruct)
	- within callback decrement an atomic counter

> In theory, I could create an API that did this.  In practice, I have no
> idea how to choose the number -- much depends on the size of the object
> being freed, for example.

We could pass an object size, no problem :)

> The RCU usage looks otherwise sane at first glance.
> 
> 							Thanx, Paul
> 
> > Thanks!
> > 
> > > +
> > > +static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> > > +{
> > > +	apic_set_reg(apic, APIC_ID, id << 24);
> > > +	recalculate_apic_map(apic->vcpu->kvm);
> > > +}
> > > +
> > > +static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> > > +{
> > > +	apic_set_reg(apic, APIC_LDR, id);
> > > +	recalculate_apic_map(apic->vcpu->kvm);
> > > +}
> > > +
> > >  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
> > >  {
> > >  	return !(kvm_apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED);
> > > @@ -193,11 +274,6 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu)
> > >  	apic_set_reg(apic, APIC_LVR, v);
> > >  }
> > >  
> > > -static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > > -{
> > > -	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > > -}
> > > -
> > >  static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
> > >  	LVT_MASK ,      /* part LVTT mask, timer mode mask added at runtime */
> > >  	LVT_MASK | APIC_MODE_MASK,	/* LVTTHMR */
> > > @@ -477,6 +553,79 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
> > >  	return result;
> > >  }
> > >  
> > > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > > +		struct kvm_lapic_irq *irq, int *r)
> > > +{
> > > +	struct kvm_apic_map *map;
> > > +	unsigned long bitmap = 1;
> > > +	struct kvm_lapic **dst;
> > > +	int i;
> > > +
> > > +	*r = -1;
> > > +
> > > +	if (irq->shorthand == APIC_DEST_SELF) {
> > > +		*r = kvm_apic_set_irq(src->vcpu, irq);
> > > +		return true;
> > > +	}
> > > +
> > > +	if (irq->shorthand)
> > > +		return false;
> > > +
> > > +	rcu_read_lock();
> > > +	map = rcu_dereference(kvm->arch.apic_map);
> > > +
> > > +	if (!map) {
> > > +		rcu_read_unlock();
> > > +		return false;
> > > +	}
> > > +
> > > +	if (irq->dest_mode == 0) { /* physical mode */
> > > +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> > > +				irq->dest_id == 0xff) {
> > > +			rcu_read_unlock();
> > > +			return false;
> > > +		}
> > > +		dst = &map->phys_map[irq->dest_id & 0xff];
> > > +	} else {
> > > +		u16 cid, lid;
> > > +		u32 mda = irq->dest_id;
> > > +
> > > +		if (map->ldr_bits == 8)
> > > +			mda <<= 24;
> > > +
> > > +		kvm_apic_get_logical_id(mda, map->flat, map->ldr_bits,
> > > +				&cid, &lid);
> > > +		dst = map->logical_map[cid];
> > > +
> > > +		bitmap = lid;
> > > +		if (irq->delivery_mode == APIC_DM_LOWEST &&
> > > +				hweight_long(bitmap) > 1) {
> > > +			int l = -1;
> > > +			for_each_set_bit(i, &bitmap, 16) {
> > > +				if (!dst[i])
> > > +					continue;
> > > +				if (l < 0)
> > > +					l = i;
> > > +				else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0)
> > > +					l = i;
> > > +			}
> > > +
> > > +			bitmap = (l >= 0) ? 1 << l : 0;
> > > +		}
> > > +	}
> > > +
> > > +	for_each_set_bit(i, &bitmap, 16) {
> > > +		if (!dst[i])
> > > +			continue;
> > > +		if (*r < 0)
> > > +			*r = 0;
> > > +		*r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> > > +	}
> > > +
> > > +	rcu_read_unlock();
> > > +	return true;
> > > +}
> > > +
> > >  /*
> > >   * Add a pending IRQ into lapic.
> > >   * Return 1 if successfully added and 0 if discarded.
> > > @@ -880,7 +1029,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > >  	switch (reg) {
> > >  	case APIC_ID:		/* Local APIC ID */
> > >  		if (!apic_x2apic_mode(apic))
> > > -			apic_set_reg(apic, APIC_ID, val);
> > > +			kvm_apic_set_id(apic, val >> 24);
> > >  		else
> > >  			ret = 1;
> > >  		break;
> > > @@ -896,15 +1045,16 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > >  
> > >  	case APIC_LDR:
> > >  		if (!apic_x2apic_mode(apic))
> > > -			apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK);
> > > +			kvm_apic_set_ldr(apic, val & APIC_LDR_MASK);
> > >  		else
> > >  			ret = 1;
> > >  		break;
> > >  
> > >  	case APIC_DFR:
> > > -		if (!apic_x2apic_mode(apic))
> > > +		if (!apic_x2apic_mode(apic)) {
> > >  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
> > > -		else
> > > +			recalculate_apic_map(apic->vcpu->kvm);
> > > +		} else
> > >  			ret = 1;
> > >  		break;
> > >  
> > > @@ -1135,6 +1285,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> > >  			static_key_slow_dec_deferred(&apic_hw_disabled);
> > >  		else
> > >  			static_key_slow_inc(&apic_hw_disabled.key);
> > > +		recalculate_apic_map(vcpu->kvm);
> > >  	}
> > >  
> > >  	if (!kvm_vcpu_is_bsp(apic->vcpu))
> > > @@ -1144,7 +1295,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> > >  	if (apic_x2apic_mode(apic)) {
> > >  		u32 id = kvm_apic_id(apic);
> > >  		u32 ldr = ((id & ~0xf) << 16) | (1 << (id & 0xf));
> > > -		apic_set_reg(apic, APIC_LDR, ldr);
> > > +		kvm_apic_set_ldr(apic, ldr);
> > >  	}
> > >  	apic->base_address = apic->vcpu->arch.apic_base &
> > >  			     MSR_IA32_APICBASE_BASE;
> > > @@ -1169,7 +1320,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> > >  	/* Stop the timer in case it's a reset to an active apic */
> > >  	hrtimer_cancel(&apic->lapic_timer.timer);
> > >  
> > > -	apic_set_reg(apic, APIC_ID, vcpu->vcpu_id << 24);
> > > +	kvm_apic_set_id(apic, (u8)vcpu->vcpu_id);
> > >  	kvm_apic_set_version(apic->vcpu);
> > >  
> > >  	for (i = 0; i < APIC_LVT_NUM; i++)
> > > @@ -1180,7 +1331,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> > >  	apic_set_reg(apic, APIC_DFR, 0xffffffffU);
> > >  	apic_set_spiv(apic, 0xff);
> > >  	apic_set_reg(apic, APIC_TASKPRI, 0);
> > > -	apic_set_reg(apic, APIC_LDR, 0);
> > > +	kvm_apic_set_ldr(apic, 0);
> > >  	apic_set_reg(apic, APIC_ESR, 0);
> > >  	apic_set_reg(apic, APIC_ICR, 0);
> > >  	apic_set_reg(apic, APIC_ICR2, 0);
> > > @@ -1398,6 +1549,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> > >  	/* set SPIV separately to get count of SW disabled APICs right */
> > >  	apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV)));
> > >  	memcpy(vcpu->arch.apic->regs, s->regs, sizeof *s);
> > > +	/* call kvm_apic_set_id() to put apic into apic_map */
> > > +	kvm_apic_set_id(apic, kvm_apic_id(apic));
> > >  	kvm_apic_set_version(vcpu);
> > >  
> > >  	apic_update_ppr(apic);
> > > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > > index 615a8b0..e5ebf9f 100644
> > > --- a/arch/x86/kvm/lapic.h
> > > +++ b/arch/x86/kvm/lapic.h
> > > @@ -52,6 +52,9 @@ int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
> > >  int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> > >  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
> > >  
> > > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > > +		struct kvm_lapic_irq *irq, int *r);
> > > +
> > >  u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
> > >  void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
> > >  void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index f91e2c9..db6db8f 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -6269,6 +6269,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> > >  	set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
> > >  
> > >  	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
> > > +	mutex_init(&kvm->arch.apic_map_lock);
> > >  
> > >  	return 0;
> > >  }
> > > @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> > >  		put_page(kvm->arch.apic_access_page);
> > >  	if (kvm->arch.ept_identity_pagetable)
> > >  		put_page(kvm->arch.ept_identity_pagetable);
> > > +	kfree(kvm->arch.apic_map);
> > >  }
> > >  
> > >  void kvm_arch_free_memslot(struct kvm_memory_slot *free,
> > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > index 7118be0..3ca89c4 100644
> > > --- a/virt/kvm/irq_comm.c
> > > +++ b/virt/kvm/irq_comm.c
> > > @@ -68,8 +68,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> > >  	struct kvm_vcpu *vcpu, *lowest = NULL;
> > >  
> > >  	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
> > > -			kvm_is_dm_lowest_prio(irq))
> > > +			kvm_is_dm_lowest_prio(irq)) {
> > >  		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
> > > +		irq->delivery_mode = APIC_DM_FIXED;
> > > +	}
> > > +
> > > +	if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r))
> > > +		return r;
> > >  
> > >  	kvm_for_each_vcpu(i, vcpu, kvm) {
> > >  		if (!kvm_apic_present(vcpu))
> > > --
> > > 			Gleb.
> > 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin Sept. 11, 2012, 10:39 p.m. UTC | #9

On Tue, Sep 11, 2012 at 11:04:59PM +0300, Avi Kivity wrote:
> On 09/11/2012 08:13 PM, Paul E. McKenney wrote:
> > > Is there a risk of DOS if RCU is delayed while
> > > lots of memory is queued up in this way?
> > > If yes is this a generic problem with kfree_rcu
> > > that should be addressed in core kernel?
> >
> > There is indeed a risk.  The kfree_rcu() implementation cannot really
> > decide what to do here, especially given that it is callable with irqs
> > disabled.
> >
> > The usual approach is to keep a per-CPU counter and count it down from
> > some number for each kfree_rcu().  When it reaches zero, invoke
> > synchronize_rcu() as well as kfree_rcu(), and then reset it to the
> > "some number" mentioned above.
> >
> > In theory, I could create an API that did this.  In practice, I have no
> > idea how to choose the number -- much depends on the size of the object
> > being freed, for example.
> 
> Perhaps approach it from the other direction?  If we are under memory
> pressure, start synchronize_rcu()ing, much like the shrinker operates.
> 

Tricky ...

For now, how about we call synchronize_rcu_expedited in kvm and call it a day?
Also has an advantage that apic map is guaranteed to be in sync
with guest - while it seems that it's already correct as is,
synchronous operation is way simpler.

We can add a tracepoint so that we can detect it if this starts
happening a lot for some guest.

> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Sept. 12, 2012, 1:03 a.m. UTC | #10

On Wed, Sep 12, 2012 at 01:33:37AM +0300, Michael S. Tsirkin wrote:
> On Tue, Sep 11, 2012 at 10:13:00AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 11, 2012 at 05:10:23PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Sep 11, 2012 at 04:02:25PM +0300, Gleb Natapov wrote:
> > > > Most interrupt are delivered to only one vcpu. Use pre-build tables to
> > > > find interrupt destination instead of looping through all vcpus. In case
> > > > of logical mode loop only through vcpus in a logical cluster irq is sent
> > > > to.
> > > > 
> > > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > 
> > > Added Paul just to make sure we are using RCU correctly.
> > > Paul could you pls answer the question below?
> > > 
> > > > ---
> > > >  Changelog:
> > > > 
> > > >   - v1->v2
> > > >    * fix race Avi noticed
> > > >    * rcu_read_lock() out of the block as per Avi
> > > >    * fix rcu issues pointed to by MST. All but one. Still use
> > > >      call_rcu(). Do not think this is serious issue. If it is should be
> > > >      solved by RCU subsystem.
> > > >    * Fix phys_map overflow pointed to by MST
> > > >    * recalculate_apic_map() does not return error any more.
> > > >    * add optimization for low prio logical mode with one cpu as dst (it
> > > >      happens)
> > > > 
> > > > I did not rewrote kvm_irq_delivery_to_apic_fast() MST way since my way
> > > > looks cleaner to me.
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 64adb61..3ba8951 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
> > > >  	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
> > > >  };
> > > >  
> > > > +struct kvm_apic_map {
> > > > +	struct rcu_head rcu;
> > > > +	bool flat;
> > > > +	u8 ldr_bits;
> > > > +	struct kvm_lapic *phys_map[256];
> > > > +	struct kvm_lapic *logical_map[16][16];
> > > > +};
> > > > +
> > > >  struct kvm_arch {
> > > >  	unsigned int n_used_mmu_pages;
> > > >  	unsigned int n_requested_mmu_pages;
> > > > @@ -528,6 +536,8 @@ struct kvm_arch {
> > > >  	struct kvm_ioapic *vioapic;
> > > >  	struct kvm_pit *vpit;
> > > >  	int vapics_in_nmi_mode;
> > > > +	struct kvm_apic_map *apic_map;
> > > > +	struct mutex apic_map_lock;
> > > >  
> > > >  	unsigned int tss_addr;
> > > >  	struct page *apic_access_page;
> > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > > index 07ad628..06672e8 100644
> > > > --- a/arch/x86/kvm/lapic.c
> > > > +++ b/arch/x86/kvm/lapic.c
> > > > @@ -139,11 +139,92 @@ static inline int apic_enabled(struct kvm_lapic *apic)
> > > >  	(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
> > > >  	 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
> > > >  
> > > > +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > > > +{
> > > > +	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > > > +}
> > > > +
> > > >  static inline int kvm_apic_id(struct kvm_lapic *apic)
> > > >  {
> > > >  	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
> > > >  }
> > > >  
> > > > +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> > > > +		u16 *cid, u16 *lid)
> > > > +{
> > > > +	if (ldr_bits == 32) {
> > > > +		*cid = ldr >> 16;
> > > > +		*lid = ldr & 0xffff;
> > > > +	} else {
> > > > +		ldr = GET_APIC_LOGICAL_ID(ldr);
> > > > +
> > > > +		if (flat) {
> > > > +			*cid = 0;
> > > > +			*lid = ldr;
> > > > +		} else {
> > > > +			*cid = ldr >> 4;
> > > > +			*lid = ldr & 0xf;
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +static inline void recalculate_apic_map(struct kvm *kvm)
> > > > +{
> > > > +	struct kvm_apic_map *new, *old = NULL;
> > > > +	struct kvm_vcpu *vcpu;
> > > > +	int i;
> > > > +
> > > > +	new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> > > > +
> > > > +	mutex_lock(&kvm->arch.apic_map_lock);
> > > > +
> > > > +	if (!new)
> > > > +		goto out;
> > > > +
> > > > +	new->ldr_bits = 8;
> > > > +	new->flat = true;
> > > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > > +		u16 cid, lid;
> > > > +		struct kvm_lapic *apic = vcpu->arch.apic;
> > > > +
> > > > +		if (!kvm_apic_present(vcpu))
> > > > +			continue;
> > > > +
> > > > +		if (apic_x2apic_mode(apic)) {
> > > > +			new->ldr_bits = 32;
> > > > +			new->flat = false;
> > > > +		} else if (kvm_apic_sw_enabled(apic) && new->flat &&
> > > > +				kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER)
> > > > +			new->flat = false;
> > > > +
> > > > +		new->phys_map[kvm_apic_id(apic)] = apic;
> > > > +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> > > > +				new->flat, new->ldr_bits, &cid, &lid);
> > > > +
> > > > +		if (lid)
> > > > +			new->logical_map[cid][ffs(lid) - 1] = apic;
> > > > +	}
> > > > +out:
> > > > +	old = kvm->arch.apic_map;
> > > > +	rcu_assign_pointer(kvm->arch.apic_map, new);
> > > > +	mutex_unlock(&kvm->arch.apic_map_lock);
> > > > +
> > > > +	if (old)
> > > > +		kfree_rcu(old, rcu);
> > > > +}
> > > 
> > > Paul, I'd like to check something with you here:
> > > this function can be triggered by userspace,
> > > any number of times; we allocate
> > > a 2K chunk of memory that is later freed by
> > > kfree_rcu.
> > > 
> > > Is there a risk of DOS if RCU is delayed while
> > > lots of memory is queued up in this way?
> > > If yes is this a generic problem with kfree_rcu
> > > that should be addressed in core kernel?
> > 
> > There is indeed a risk.
> 
> In our case it's a 2K object. Is it a practical risk?

How many kfree_rcu()s per second can a given user cause to happen?

> > The kfree_rcu() implementation cannot really
> > decide what to do here, especially given that it is callable with irqs
> > disabled.
> > 
> > The usual approach is to keep a per-CPU counter and count it down from
> > some number for each kfree_rcu().  When it reaches zero, invoke
> > synchronize_rcu() as well as kfree_rcu(), and then reset it to the
> > "some number" mentioned above.
> 
> It is a bit of a concern for me that this will hurt worst-case latency
> for realtime guests.  In our case, we return error and this will
> fall back on not allocating memory and using slow all-CPU scan.
> One possible scheme that relies on this is:
> 	- increment an atomic counter, per vcpu. If above threshold ->
> 		return with error
> 	- call_rcu (+ barrier vcpu destruct)
> 	- within callback decrement an atomic counter

That certainly is a possibility, but...

> > In theory, I could create an API that did this.  In practice, I have no
> > idea how to choose the number -- much depends on the size of the object
> > being freed, for example.
> 
> We could pass an object size, no problem :)

... before putting too much additional effort into possible solutions,
why not force the problem to occur and see what actually happens?  We
would then be in a much better position to work out what should be done.

							Thanx, Paul

> > The RCU usage looks otherwise sane at first glance.
> > 
> > 							Thanx, Paul
> > 
> > > Thanks!
> > > 
> > > > +
> > > > +static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> > > > +{
> > > > +	apic_set_reg(apic, APIC_ID, id << 24);
> > > > +	recalculate_apic_map(apic->vcpu->kvm);
> > > > +}
> > > > +
> > > > +static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> > > > +{
> > > > +	apic_set_reg(apic, APIC_LDR, id);
> > > > +	recalculate_apic_map(apic->vcpu->kvm);
> > > > +}
> > > > +
> > > >  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
> > > >  {
> > > >  	return !(kvm_apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED);
> > > > @@ -193,11 +274,6 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu)
> > > >  	apic_set_reg(apic, APIC_LVR, v);
> > > >  }
> > > >  
> > > > -static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > > > -{
> > > > -	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > > > -}
> > > > -
> > > >  static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
> > > >  	LVT_MASK ,      /* part LVTT mask, timer mode mask added at runtime */
> > > >  	LVT_MASK | APIC_MODE_MASK,	/* LVTTHMR */
> > > > @@ -477,6 +553,79 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
> > > >  	return result;
> > > >  }
> > > >  
> > > > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > > > +		struct kvm_lapic_irq *irq, int *r)
> > > > +{
> > > > +	struct kvm_apic_map *map;
> > > > +	unsigned long bitmap = 1;
> > > > +	struct kvm_lapic **dst;
> > > > +	int i;
> > > > +
> > > > +	*r = -1;
> > > > +
> > > > +	if (irq->shorthand == APIC_DEST_SELF) {
> > > > +		*r = kvm_apic_set_irq(src->vcpu, irq);
> > > > +		return true;
> > > > +	}
> > > > +
> > > > +	if (irq->shorthand)
> > > > +		return false;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	map = rcu_dereference(kvm->arch.apic_map);
> > > > +
> > > > +	if (!map) {
> > > > +		rcu_read_unlock();
> > > > +		return false;
> > > > +	}
> > > > +
> > > > +	if (irq->dest_mode == 0) { /* physical mode */
> > > > +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> > > > +				irq->dest_id == 0xff) {
> > > > +			rcu_read_unlock();
> > > > +			return false;
> > > > +		}
> > > > +		dst = &map->phys_map[irq->dest_id & 0xff];
> > > > +	} else {
> > > > +		u16 cid, lid;
> > > > +		u32 mda = irq->dest_id;
> > > > +
> > > > +		if (map->ldr_bits == 8)
> > > > +			mda <<= 24;
> > > > +
> > > > +		kvm_apic_get_logical_id(mda, map->flat, map->ldr_bits,
> > > > +				&cid, &lid);
> > > > +		dst = map->logical_map[cid];
> > > > +
> > > > +		bitmap = lid;
> > > > +		if (irq->delivery_mode == APIC_DM_LOWEST &&
> > > > +				hweight_long(bitmap) > 1) {
> > > > +			int l = -1;
> > > > +			for_each_set_bit(i, &bitmap, 16) {
> > > > +				if (!dst[i])
> > > > +					continue;
> > > > +				if (l < 0)
> > > > +					l = i;
> > > > +				else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0)
> > > > +					l = i;
> > > > +			}
> > > > +
> > > > +			bitmap = (l >= 0) ? 1 << l : 0;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	for_each_set_bit(i, &bitmap, 16) {
> > > > +		if (!dst[i])
> > > > +			continue;
> > > > +		if (*r < 0)
> > > > +			*r = 0;
> > > > +		*r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> > > > +	}
> > > > +
> > > > +	rcu_read_unlock();
> > > > +	return true;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Add a pending IRQ into lapic.
> > > >   * Return 1 if successfully added and 0 if discarded.
> > > > @@ -880,7 +1029,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > > >  	switch (reg) {
> > > >  	case APIC_ID:		/* Local APIC ID */
> > > >  		if (!apic_x2apic_mode(apic))
> > > > -			apic_set_reg(apic, APIC_ID, val);
> > > > +			kvm_apic_set_id(apic, val >> 24);
> > > >  		else
> > > >  			ret = 1;
> > > >  		break;
> > > > @@ -896,15 +1045,16 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > > >  
> > > >  	case APIC_LDR:
> > > >  		if (!apic_x2apic_mode(apic))
> > > > -			apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK);
> > > > +			kvm_apic_set_ldr(apic, val & APIC_LDR_MASK);
> > > >  		else
> > > >  			ret = 1;
> > > >  		break;
> > > >  
> > > >  	case APIC_DFR:
> > > > -		if (!apic_x2apic_mode(apic))
> > > > +		if (!apic_x2apic_mode(apic)) {
> > > >  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
> > > > -		else
> > > > +			recalculate_apic_map(apic->vcpu->kvm);
> > > > +		} else
> > > >  			ret = 1;
> > > >  		break;
> > > >  
> > > > @@ -1135,6 +1285,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> > > >  			static_key_slow_dec_deferred(&apic_hw_disabled);
> > > >  		else
> > > >  			static_key_slow_inc(&apic_hw_disabled.key);
> > > > +		recalculate_apic_map(vcpu->kvm);
> > > >  	}
> > > >  
> > > >  	if (!kvm_vcpu_is_bsp(apic->vcpu))
> > > > @@ -1144,7 +1295,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> > > >  	if (apic_x2apic_mode(apic)) {
> > > >  		u32 id = kvm_apic_id(apic);
> > > >  		u32 ldr = ((id & ~0xf) << 16) | (1 << (id & 0xf));
> > > > -		apic_set_reg(apic, APIC_LDR, ldr);
> > > > +		kvm_apic_set_ldr(apic, ldr);
> > > >  	}
> > > >  	apic->base_address = apic->vcpu->arch.apic_base &
> > > >  			     MSR_IA32_APICBASE_BASE;
> > > > @@ -1169,7 +1320,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> > > >  	/* Stop the timer in case it's a reset to an active apic */
> > > >  	hrtimer_cancel(&apic->lapic_timer.timer);
> > > >  
> > > > -	apic_set_reg(apic, APIC_ID, vcpu->vcpu_id << 24);
> > > > +	kvm_apic_set_id(apic, (u8)vcpu->vcpu_id);
> > > >  	kvm_apic_set_version(apic->vcpu);
> > > >  
> > > >  	for (i = 0; i < APIC_LVT_NUM; i++)
> > > > @@ -1180,7 +1331,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> > > >  	apic_set_reg(apic, APIC_DFR, 0xffffffffU);
> > > >  	apic_set_spiv(apic, 0xff);
> > > >  	apic_set_reg(apic, APIC_TASKPRI, 0);
> > > > -	apic_set_reg(apic, APIC_LDR, 0);
> > > > +	kvm_apic_set_ldr(apic, 0);
> > > >  	apic_set_reg(apic, APIC_ESR, 0);
> > > >  	apic_set_reg(apic, APIC_ICR, 0);
> > > >  	apic_set_reg(apic, APIC_ICR2, 0);
> > > > @@ -1398,6 +1549,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> > > >  	/* set SPIV separately to get count of SW disabled APICs right */
> > > >  	apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV)));
> > > >  	memcpy(vcpu->arch.apic->regs, s->regs, sizeof *s);
> > > > +	/* call kvm_apic_set_id() to put apic into apic_map */
> > > > +	kvm_apic_set_id(apic, kvm_apic_id(apic));
> > > >  	kvm_apic_set_version(vcpu);
> > > >  
> > > >  	apic_update_ppr(apic);
> > > > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > > > index 615a8b0..e5ebf9f 100644
> > > > --- a/arch/x86/kvm/lapic.h
> > > > +++ b/arch/x86/kvm/lapic.h
> > > > @@ -52,6 +52,9 @@ int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
> > > >  int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> > > >  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
> > > >  
> > > > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > > > +		struct kvm_lapic_irq *irq, int *r);
> > > > +
> > > >  u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
> > > >  void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
> > > >  void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index f91e2c9..db6db8f 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -6269,6 +6269,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> > > >  	set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
> > > >  
> > > >  	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
> > > > +	mutex_init(&kvm->arch.apic_map_lock);
> > > >  
> > > >  	return 0;
> > > >  }
> > > > @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> > > >  		put_page(kvm->arch.apic_access_page);
> > > >  	if (kvm->arch.ept_identity_pagetable)
> > > >  		put_page(kvm->arch.ept_identity_pagetable);
> > > > +	kfree(kvm->arch.apic_map);
> > > >  }
> > > >  
> > > >  void kvm_arch_free_memslot(struct kvm_memory_slot *free,
> > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > index 7118be0..3ca89c4 100644
> > > > --- a/virt/kvm/irq_comm.c
> > > > +++ b/virt/kvm/irq_comm.c
> > > > @@ -68,8 +68,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> > > >  	struct kvm_vcpu *vcpu, *lowest = NULL;
> > > >  
> > > >  	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
> > > > -			kvm_is_dm_lowest_prio(irq))
> > > > +			kvm_is_dm_lowest_prio(irq)) {
> > > >  		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
> > > > +		irq->delivery_mode = APIC_DM_FIXED;
> > > > +	}
> > > > +
> > > > +	if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r))
> > > > +		return r;
> > > >  
> > > >  	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > >  		if (!kvm_apic_present(vcpu))
> > > > --
> > > > 			Gleb.
> > > 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Avi Kivity Sept. 12, 2012, 7:41 a.m. UTC | #11

On 09/12/2012 01:39 AM, Michael S. Tsirkin wrote:
> On Tue, Sep 11, 2012 at 11:04:59PM +0300, Avi Kivity wrote:
>> On 09/11/2012 08:13 PM, Paul E. McKenney wrote:
>> > > Is there a risk of DOS if RCU is delayed while
>> > > lots of memory is queued up in this way?
>> > > If yes is this a generic problem with kfree_rcu
>> > > that should be addressed in core kernel?
>> >
>> > There is indeed a risk.  The kfree_rcu() implementation cannot really
>> > decide what to do here, especially given that it is callable with irqs
>> > disabled.
>> >
>> > The usual approach is to keep a per-CPU counter and count it down from
>> > some number for each kfree_rcu().  When it reaches zero, invoke
>> > synchronize_rcu() as well as kfree_rcu(), and then reset it to the
>> > "some number" mentioned above.
>> >
>> > In theory, I could create an API that did this.  In practice, I have no
>> > idea how to choose the number -- much depends on the size of the object
>> > being freed, for example.
>> 
>> Perhaps approach it from the other direction?  If we are under memory
>> pressure, start synchronize_rcu()ing, much like the shrinker operates.
>> 
> 
> Tricky ...
> 
> For now, how about we call synchronize_rcu_expedited in kvm and call it a day?

I prefer to let the rcu people fix it.

> Also has an advantage that apic map is guaranteed to be in sync
> with guest - while it seems that it's already correct as is,
> synchronous operation is way simpler.

It works exactly the same way.  Interrupts started in parallel with an
ID update will use either map.  Interrupts started afterwards will use
the new map.

> 
> We can add a tracepoint so that we can detect it if this starts
> happening a lot for some guest.
>

No point, guests don't update their APIC ID (or related) after booting.

Avi Kivity Sept. 12, 2012, 7:45 a.m. UTC | #12

On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
>> > > Paul, I'd like to check something with you here:
>> > > this function can be triggered by userspace,
>> > > any number of times; we allocate
>> > > a 2K chunk of memory that is later freed by
>> > > kfree_rcu.
>> > > 
>> > > Is there a risk of DOS if RCU is delayed while
>> > > lots of memory is queued up in this way?
>> > > If yes is this a generic problem with kfree_rcu
>> > > that should be addressed in core kernel?
>> > 
>> > There is indeed a risk.
>> 
>> In our case it's a 2K object. Is it a practical risk?
> 
> How many kfree_rcu()s per second can a given user cause to happen?

Not much more than a few hundred thousand per second per process (normal
operation is zero).

> 
>> > The kfree_rcu() implementation cannot really
>> > decide what to do here, especially given that it is callable with irqs
>> > disabled.
>> > 
>> > The usual approach is to keep a per-CPU counter and count it down from
>> > some number for each kfree_rcu().  When it reaches zero, invoke
>> > synchronize_rcu() as well as kfree_rcu(), and then reset it to the
>> > "some number" mentioned above.
>> 
>> It is a bit of a concern for me that this will hurt worst-case latency
>> for realtime guests.  In our case, we return error and this will
>> fall back on not allocating memory and using slow all-CPU scan.
>> One possible scheme that relies on this is:
>> 	- increment an atomic counter, per vcpu. If above threshold ->
>> 		return with error
>> 	- call_rcu (+ barrier vcpu destruct)
>> 	- within callback decrement an atomic counter
> 
> That certainly is a possibility, but...
> 
>> > In theory, I could create an API that did this.  In practice, I have no
>> > idea how to choose the number -- much depends on the size of the object
>> > being freed, for example.
>> 
>> We could pass an object size, no problem :)
> 
> ... before putting too much additional effort into possible solutions,
> why not force the problem to occur and see what actually happens?  We
> would then be in a much better position to work out what should be done.

Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
to the APIC ID register in a loop.

Gleb Natapov Sept. 12, 2012, 12:17 p.m. UTC | #13

On Tue, Sep 11, 2012 at 10:13:00AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 11, 2012 at 05:10:23PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Sep 11, 2012 at 04:02:25PM +0300, Gleb Natapov wrote:
> > > Most interrupt are delivered to only one vcpu. Use pre-build tables to
> > > find interrupt destination instead of looping through all vcpus. In case
> > > of logical mode loop only through vcpus in a logical cluster irq is sent
> > > to.
> > > 
> > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > 
> > Added Paul just to make sure we are using RCU correctly.
> > Paul could you pls answer the question below?
> > 
> > > ---
> > >  Changelog:
> > > 
> > >   - v1->v2
> > >    * fix race Avi noticed
> > >    * rcu_read_lock() out of the block as per Avi
> > >    * fix rcu issues pointed to by MST. All but one. Still use
> > >      call_rcu(). Do not think this is serious issue. If it is should be
> > >      solved by RCU subsystem.
> > >    * Fix phys_map overflow pointed to by MST
> > >    * recalculate_apic_map() does not return error any more.
> > >    * add optimization for low prio logical mode with one cpu as dst (it
> > >      happens)
> > > 
> > > I did not rewrote kvm_irq_delivery_to_apic_fast() MST way since my way
> > > looks cleaner to me.
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 64adb61..3ba8951 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
> > >  	struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
> > >  };
> > >  
> > > +struct kvm_apic_map {
> > > +	struct rcu_head rcu;
> > > +	bool flat;
> > > +	u8 ldr_bits;
> > > +	struct kvm_lapic *phys_map[256];
> > > +	struct kvm_lapic *logical_map[16][16];
> > > +};
> > > +
> > >  struct kvm_arch {
> > >  	unsigned int n_used_mmu_pages;
> > >  	unsigned int n_requested_mmu_pages;
> > > @@ -528,6 +536,8 @@ struct kvm_arch {
> > >  	struct kvm_ioapic *vioapic;
> > >  	struct kvm_pit *vpit;
> > >  	int vapics_in_nmi_mode;
> > > +	struct kvm_apic_map *apic_map;
> > > +	struct mutex apic_map_lock;
> > >  
> > >  	unsigned int tss_addr;
> > >  	struct page *apic_access_page;
> > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > index 07ad628..06672e8 100644
> > > --- a/arch/x86/kvm/lapic.c
> > > +++ b/arch/x86/kvm/lapic.c
> > > @@ -139,11 +139,92 @@ static inline int apic_enabled(struct kvm_lapic *apic)
> > >  	(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
> > >  	 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
> > >  
> > > +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > > +{
> > > +	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > > +}
> > > +
> > >  static inline int kvm_apic_id(struct kvm_lapic *apic)
> > >  {
> > >  	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
> > >  }
> > >  
> > > +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> > > +		u16 *cid, u16 *lid)
> > > +{
> > > +	if (ldr_bits == 32) {
> > > +		*cid = ldr >> 16;
> > > +		*lid = ldr & 0xffff;
> > > +	} else {
> > > +		ldr = GET_APIC_LOGICAL_ID(ldr);
> > > +
> > > +		if (flat) {
> > > +			*cid = 0;
> > > +			*lid = ldr;
> > > +		} else {
> > > +			*cid = ldr >> 4;
> > > +			*lid = ldr & 0xf;
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static inline void recalculate_apic_map(struct kvm *kvm)
> > > +{
> > > +	struct kvm_apic_map *new, *old = NULL;
> > > +	struct kvm_vcpu *vcpu;
> > > +	int i;
> > > +
> > > +	new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> > > +
> > > +	mutex_lock(&kvm->arch.apic_map_lock);
> > > +
> > > +	if (!new)
> > > +		goto out;
> > > +
> > > +	new->ldr_bits = 8;
> > > +	new->flat = true;
> > > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > > +		u16 cid, lid;
> > > +		struct kvm_lapic *apic = vcpu->arch.apic;
> > > +
> > > +		if (!kvm_apic_present(vcpu))
> > > +			continue;
> > > +
> > > +		if (apic_x2apic_mode(apic)) {
> > > +			new->ldr_bits = 32;
> > > +			new->flat = false;
> > > +		} else if (kvm_apic_sw_enabled(apic) && new->flat &&
> > > +				kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER)
> > > +			new->flat = false;
> > > +
> > > +		new->phys_map[kvm_apic_id(apic)] = apic;
> > > +		kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> > > +				new->flat, new->ldr_bits, &cid, &lid);
> > > +
> > > +		if (lid)
> > > +			new->logical_map[cid][ffs(lid) - 1] = apic;
> > > +	}
> > > +out:
> > > +	old = kvm->arch.apic_map;
> > > +	rcu_assign_pointer(kvm->arch.apic_map, new);
> > > +	mutex_unlock(&kvm->arch.apic_map_lock);
> > > +
> > > +	if (old)
> > > +		kfree_rcu(old, rcu);
> > > +}
> > 
> > Paul, I'd like to check something with you here:
> > this function can be triggered by userspace,
> > any number of times; we allocate
> > a 2K chunk of memory that is later freed by
> > kfree_rcu.
> > 
> > Is there a risk of DOS if RCU is delayed while
> > lots of memory is queued up in this way?
> > If yes is this a generic problem with kfree_rcu
> > that should be addressed in core kernel?
> 
> There is indeed a risk.  The kfree_rcu() implementation cannot really
> decide what to do here, especially given that it is callable with irqs
> disabled.
> 
> The usual approach is to keep a per-CPU counter and count it down from
> some number for each kfree_rcu().  When it reaches zero, invoke
> synchronize_rcu() as well as kfree_rcu(), and then reset it to the
> "some number" mentioned above.
> 
This is simple to do, but is it so much different from just calling
synchronize_rcu(); kfree(p) every time instead of kfree_rcu()? The point
of call_rcu(), as I see it, is to continue doing whatever we were doing
instead of sleeping for undefined period. If code can handle sleeping
once in a while it can handle it each and every time too.

I looked through couple of users of kfree_rcu/call_rcu and haven't found
one that uses the counter approach. May be I missed it or looked at
those that do not need it though.

> In theory, I could create an API that did this.  In practice, I have no
> idea how to choose the number -- much depends on the size of the object
> being freed, for example.
> 
> The RCU usage looks otherwise sane at first glance.
> 
Thanks!

> 							Thanx, Paul
> 
> > Thanks!
> > 
> > > +
> > > +static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> > > +{
> > > +	apic_set_reg(apic, APIC_ID, id << 24);
> > > +	recalculate_apic_map(apic->vcpu->kvm);
> > > +}
> > > +
> > > +static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> > > +{
> > > +	apic_set_reg(apic, APIC_LDR, id);
> > > +	recalculate_apic_map(apic->vcpu->kvm);
> > > +}
> > > +
> > >  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
> > >  {
> > >  	return !(kvm_apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED);
> > > @@ -193,11 +274,6 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu)
> > >  	apic_set_reg(apic, APIC_LVR, v);
> > >  }
> > >  
> > > -static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > > -{
> > > -	return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > > -}
> > > -
> > >  static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
> > >  	LVT_MASK ,      /* part LVTT mask, timer mode mask added at runtime */
> > >  	LVT_MASK | APIC_MODE_MASK,	/* LVTTHMR */
> > > @@ -477,6 +553,79 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
> > >  	return result;
> > >  }
> > >  
> > > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > > +		struct kvm_lapic_irq *irq, int *r)
> > > +{
> > > +	struct kvm_apic_map *map;
> > > +	unsigned long bitmap = 1;
> > > +	struct kvm_lapic **dst;
> > > +	int i;
> > > +
> > > +	*r = -1;
> > > +
> > > +	if (irq->shorthand == APIC_DEST_SELF) {
> > > +		*r = kvm_apic_set_irq(src->vcpu, irq);
> > > +		return true;
> > > +	}
> > > +
> > > +	if (irq->shorthand)
> > > +		return false;
> > > +
> > > +	rcu_read_lock();
> > > +	map = rcu_dereference(kvm->arch.apic_map);
> > > +
> > > +	if (!map) {
> > > +		rcu_read_unlock();
> > > +		return false;
> > > +	}
> > > +
> > > +	if (irq->dest_mode == 0) { /* physical mode */
> > > +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> > > +				irq->dest_id == 0xff) {
> > > +			rcu_read_unlock();
> > > +			return false;
> > > +		}
> > > +		dst = &map->phys_map[irq->dest_id & 0xff];
> > > +	} else {
> > > +		u16 cid, lid;
> > > +		u32 mda = irq->dest_id;
> > > +
> > > +		if (map->ldr_bits == 8)
> > > +			mda <<= 24;
> > > +
> > > +		kvm_apic_get_logical_id(mda, map->flat, map->ldr_bits,
> > > +				&cid, &lid);
> > > +		dst = map->logical_map[cid];
> > > +
> > > +		bitmap = lid;
> > > +		if (irq->delivery_mode == APIC_DM_LOWEST &&
> > > +				hweight_long(bitmap) > 1) {
> > > +			int l = -1;
> > > +			for_each_set_bit(i, &bitmap, 16) {
> > > +				if (!dst[i])
> > > +					continue;
> > > +				if (l < 0)
> > > +					l = i;
> > > +				else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0)
> > > +					l = i;
> > > +			}
> > > +
> > > +			bitmap = (l >= 0) ? 1 << l : 0;
> > > +		}
> > > +	}
> > > +
> > > +	for_each_set_bit(i, &bitmap, 16) {
> > > +		if (!dst[i])
> > > +			continue;
> > > +		if (*r < 0)
> > > +			*r = 0;
> > > +		*r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> > > +	}
> > > +
> > > +	rcu_read_unlock();
> > > +	return true;
> > > +}
> > > +
> > >  /*
> > >   * Add a pending IRQ into lapic.
> > >   * Return 1 if successfully added and 0 if discarded.
> > > @@ -880,7 +1029,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > >  	switch (reg) {
> > >  	case APIC_ID:		/* Local APIC ID */
> > >  		if (!apic_x2apic_mode(apic))
> > > -			apic_set_reg(apic, APIC_ID, val);
> > > +			kvm_apic_set_id(apic, val >> 24);
> > >  		else
> > >  			ret = 1;
> > >  		break;
> > > @@ -896,15 +1045,16 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > >  
> > >  	case APIC_LDR:
> > >  		if (!apic_x2apic_mode(apic))
> > > -			apic_set_reg(apic, APIC_LDR, val & APIC_LDR_MASK);
> > > +			kvm_apic_set_ldr(apic, val & APIC_LDR_MASK);
> > >  		else
> > >  			ret = 1;
> > >  		break;
> > >  
> > >  	case APIC_DFR:
> > > -		if (!apic_x2apic_mode(apic))
> > > +		if (!apic_x2apic_mode(apic)) {
> > >  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
> > > -		else
> > > +			recalculate_apic_map(apic->vcpu->kvm);
> > > +		} else
> > >  			ret = 1;
> > >  		break;
> > >  
> > > @@ -1135,6 +1285,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> > >  			static_key_slow_dec_deferred(&apic_hw_disabled);
> > >  		else
> > >  			static_key_slow_inc(&apic_hw_disabled.key);
> > > +		recalculate_apic_map(vcpu->kvm);
> > >  	}
> > >  
> > >  	if (!kvm_vcpu_is_bsp(apic->vcpu))
> > > @@ -1144,7 +1295,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> > >  	if (apic_x2apic_mode(apic)) {
> > >  		u32 id = kvm_apic_id(apic);
> > >  		u32 ldr = ((id & ~0xf) << 16) | (1 << (id & 0xf));
> > > -		apic_set_reg(apic, APIC_LDR, ldr);
> > > +		kvm_apic_set_ldr(apic, ldr);
> > >  	}
> > >  	apic->base_address = apic->vcpu->arch.apic_base &
> > >  			     MSR_IA32_APICBASE_BASE;
> > > @@ -1169,7 +1320,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> > >  	/* Stop the timer in case it's a reset to an active apic */
> > >  	hrtimer_cancel(&apic->lapic_timer.timer);
> > >  
> > > -	apic_set_reg(apic, APIC_ID, vcpu->vcpu_id << 24);
> > > +	kvm_apic_set_id(apic, (u8)vcpu->vcpu_id);
> > >  	kvm_apic_set_version(apic->vcpu);
> > >  
> > >  	for (i = 0; i < APIC_LVT_NUM; i++)
> > > @@ -1180,7 +1331,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
> > >  	apic_set_reg(apic, APIC_DFR, 0xffffffffU);
> > >  	apic_set_spiv(apic, 0xff);
> > >  	apic_set_reg(apic, APIC_TASKPRI, 0);
> > > -	apic_set_reg(apic, APIC_LDR, 0);
> > > +	kvm_apic_set_ldr(apic, 0);
> > >  	apic_set_reg(apic, APIC_ESR, 0);
> > >  	apic_set_reg(apic, APIC_ICR, 0);
> > >  	apic_set_reg(apic, APIC_ICR2, 0);
> > > @@ -1398,6 +1549,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> > >  	/* set SPIV separately to get count of SW disabled APICs right */
> > >  	apic_set_spiv(apic, *((u32 *)(s->regs + APIC_SPIV)));
> > >  	memcpy(vcpu->arch.apic->regs, s->regs, sizeof *s);
> > > +	/* call kvm_apic_set_id() to put apic into apic_map */
> > > +	kvm_apic_set_id(apic, kvm_apic_id(apic));
> > >  	kvm_apic_set_version(vcpu);
> > >  
> > >  	apic_update_ppr(apic);
> > > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > > index 615a8b0..e5ebf9f 100644
> > > --- a/arch/x86/kvm/lapic.h
> > > +++ b/arch/x86/kvm/lapic.h
> > > @@ -52,6 +52,9 @@ int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
> > >  int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> > >  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
> > >  
> > > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > > +		struct kvm_lapic_irq *irq, int *r);
> > > +
> > >  u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
> > >  void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
> > >  void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index f91e2c9..db6db8f 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -6269,6 +6269,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> > >  	set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
> > >  
> > >  	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
> > > +	mutex_init(&kvm->arch.apic_map_lock);
> > >  
> > >  	return 0;
> > >  }
> > > @@ -6319,6 +6320,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> > >  		put_page(kvm->arch.apic_access_page);
> > >  	if (kvm->arch.ept_identity_pagetable)
> > >  		put_page(kvm->arch.ept_identity_pagetable);
> > > +	kfree(kvm->arch.apic_map);
> > >  }
> > >  
> > >  void kvm_arch_free_memslot(struct kvm_memory_slot *free,
> > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > index 7118be0..3ca89c4 100644
> > > --- a/virt/kvm/irq_comm.c
> > > +++ b/virt/kvm/irq_comm.c
> > > @@ -68,8 +68,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> > >  	struct kvm_vcpu *vcpu, *lowest = NULL;
> > >  
> > >  	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
> > > -			kvm_is_dm_lowest_prio(irq))
> > > +			kvm_is_dm_lowest_prio(irq)) {
> > >  		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
> > > +		irq->delivery_mode = APIC_DM_FIXED;
> > > +	}
> > > +
> > > +	if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, &r))
> > > +		return r;
> > >  
> > >  	kvm_for_each_vcpu(i, vcpu, kvm) {
> > >  		if (!kvm_apic_present(vcpu))
> > > --
> > > 			Gleb.
> > 

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Sept. 12, 2012, 12:34 p.m. UTC | #14

On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> >> > > Paul, I'd like to check something with you here:
> >> > > this function can be triggered by userspace,
> >> > > any number of times; we allocate
> >> > > a 2K chunk of memory that is later freed by
> >> > > kfree_rcu.
> >> > > 
> >> > > Is there a risk of DOS if RCU is delayed while
> >> > > lots of memory is queued up in this way?
> >> > > If yes is this a generic problem with kfree_rcu
> >> > > that should be addressed in core kernel?
> >> > 
> >> > There is indeed a risk.
> >> 
> >> In our case it's a 2K object. Is it a practical risk?
> > 
> > How many kfree_rcu()s per second can a given user cause to happen?
> 
> Not much more than a few hundred thousand per second per process (normal
> operation is zero).
> 
I managed to do 21466 per second.

> > 
> >> > The kfree_rcu() implementation cannot really
> >> > decide what to do here, especially given that it is callable with irqs
> >> > disabled.
> >> > 
> >> > The usual approach is to keep a per-CPU counter and count it down from
> >> > some number for each kfree_rcu().  When it reaches zero, invoke
> >> > synchronize_rcu() as well as kfree_rcu(), and then reset it to the
> >> > "some number" mentioned above.
> >> 
> >> It is a bit of a concern for me that this will hurt worst-case latency
> >> for realtime guests.  In our case, we return error and this will
> >> fall back on not allocating memory and using slow all-CPU scan.
> >> One possible scheme that relies on this is:
> >> 	- increment an atomic counter, per vcpu. If above threshold ->
> >> 		return with error
> >> 	- call_rcu (+ barrier vcpu destruct)
> >> 	- within callback decrement an atomic counter
> > 
> > That certainly is a possibility, but...
> > 
> >> > In theory, I could create an API that did this.  In practice, I have no
> >> > idea how to choose the number -- much depends on the size of the object
> >> > being freed, for example.
> >> 
> >> We could pass an object size, no problem :)
> > 
> > ... before putting too much additional effort into possible solutions,
> > why not force the problem to occur and see what actually happens?  We
> > would then be in a much better position to work out what should be done.
> 
> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> to the APIC ID register in a loop.
> 
I did. Memory consumption does not grow on otherwise idle host.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Sept. 12, 2012, 12:44 p.m. UTC | #15

On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
> On 09/12/2012 03:34 PM, Gleb Natapov wrote:
> > On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> >> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> >> >> > > Paul, I'd like to check something with you here:
> >> >> > > this function can be triggered by userspace,
> >> >> > > any number of times; we allocate
> >> >> > > a 2K chunk of memory that is later freed by
> >> >> > > kfree_rcu.
> >> >> > > 
> >> >> > > Is there a risk of DOS if RCU is delayed while
> >> >> > > lots of memory is queued up in this way?
> >> >> > > If yes is this a generic problem with kfree_rcu
> >> >> > > that should be addressed in core kernel?
> >> >> > 
> >> >> > There is indeed a risk.
> >> >> 
> >> >> In our case it's a 2K object. Is it a practical risk?
> >> > 
> >> > How many kfree_rcu()s per second can a given user cause to happen?
> >> 
> >> Not much more than a few hundred thousand per second per process (normal
> >> operation is zero).
> >> 
> > I managed to do 21466 per second.
> 
> Strange, why so slow?
> 
Because ftrace buffer overflows :) With bigger buffer I get 169940.

> >> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> >> to the APIC ID register in a loop.
> >> 
> > I did. Memory consumption does not grow on otherwise idle host.
> 
> Ok, thanks.
> 
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Sept. 12, 2012, 3:13 p.m. UTC | #16

On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
> On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
> > On 09/12/2012 03:34 PM, Gleb Natapov wrote:
> > > On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> > >> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> > >> >> > > Paul, I'd like to check something with you here:
> > >> >> > > this function can be triggered by userspace,
> > >> >> > > any number of times; we allocate
> > >> >> > > a 2K chunk of memory that is later freed by
> > >> >> > > kfree_rcu.
> > >> >> > > 
> > >> >> > > Is there a risk of DOS if RCU is delayed while
> > >> >> > > lots of memory is queued up in this way?
> > >> >> > > If yes is this a generic problem with kfree_rcu
> > >> >> > > that should be addressed in core kernel?
> > >> >> > 
> > >> >> > There is indeed a risk.
> > >> >> 
> > >> >> In our case it's a 2K object. Is it a practical risk?
> > >> > 
> > >> > How many kfree_rcu()s per second can a given user cause to happen?
> > >> 
> > >> Not much more than a few hundred thousand per second per process (normal
> > >> operation is zero).
> > >> 
> > > I managed to do 21466 per second.
> > 
> > Strange, why so slow?
> > 
> Because ftrace buffer overflows :) With bigger buffer I get 169940.

Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
a tight loop, you could probably do in excess of 100M per CPU per second.
Now -that- might be a problem.

Well, it -might- be a problem if you somehow figured out how to allocate
memory that quickly in a steady-state manner.  ;-)

> > >> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> > >> to the APIC ID register in a loop.
> > >> 
> > > I did. Memory consumption does not grow on otherwise idle host.

Very good -- the checks in __call_rcu(), which is common code invoked by
kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
counter, which can be adjusted via rcutree.blimit, which defaults
to taking evasive action if more than 10K callbacks are waiting on a
given CPU.

My concern was that you might be overrunning that limit in way less
than a grace period (as in about a hundred microseconds.  My concern
was of course unfounded -- you take several grace periods in push 10K
callbacks through.

							Thanx, Paul

> > Ok, thanks.
> > 
> > 
> > -- 
> > error compiling committee.c: too many arguments to function
> 
> --
> 			Gleb.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin Nov. 26, 2013, 4:24 p.m. UTC | #17

On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
> > On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
> > > On 09/12/2012 03:34 PM, Gleb Natapov wrote:
> > > > On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> > > >> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> > > >> >> > > Paul, I'd like to check something with you here:
> > > >> >> > > this function can be triggered by userspace,
> > > >> >> > > any number of times; we allocate
> > > >> >> > > a 2K chunk of memory that is later freed by
> > > >> >> > > kfree_rcu.
> > > >> >> > > 
> > > >> >> > > Is there a risk of DOS if RCU is delayed while
> > > >> >> > > lots of memory is queued up in this way?
> > > >> >> > > If yes is this a generic problem with kfree_rcu
> > > >> >> > > that should be addressed in core kernel?
> > > >> >> > 
> > > >> >> > There is indeed a risk.
> > > >> >> 
> > > >> >> In our case it's a 2K object. Is it a practical risk?
> > > >> > 
> > > >> > How many kfree_rcu()s per second can a given user cause to happen?
> > > >> 
> > > >> Not much more than a few hundred thousand per second per process (normal
> > > >> operation is zero).
> > > >> 
> > > > I managed to do 21466 per second.
> > > 
> > > Strange, why so slow?
> > > 
> > Because ftrace buffer overflows :) With bigger buffer I get 169940.
> 
> Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
> a tight loop, you could probably do in excess of 100M per CPU per second.
> Now -that- might be a problem.
> 
> Well, it -might- be a problem if you somehow figured out how to allocate
> memory that quickly in a steady-state manner.  ;-)
> 
> > > >> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> > > >> to the APIC ID register in a loop.
> > > >> 
> > > > I did. Memory consumption does not grow on otherwise idle host.
> 
> Very good -- the checks in __call_rcu(), which is common code invoked by
> kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
> counter, which can be adjusted via rcutree.blimit, which defaults
> to taking evasive action if more than 10K callbacks are waiting on a
> given CPU.
> 
> My concern was that you might be overrunning that limit in way less
> than a grace period (as in about a hundred microseconds.  My concern
> was of course unfounded -- you take several grace periods in push 10K
> callbacks through.
> 
> 							Thanx, Paul

Gleb noted that Documentation/RCU/checklist.txt has this text:

        An especially important property of the synchronize_rcu()
        primitive is that it automatically self-limits: if grace periods
        are delayed for whatever reason, then the synchronize_rcu()
        primitive will correspondingly delay updates.  In contrast,
        code using call_rcu() should explicitly limit update rate in
        cases where grace periods are delayed, as failing to do so can
        result in excessive realtime latencies or even OOM conditions.

If call_rcu is self-limiting maybe this should be documented ...

> > > Ok, thanks.
> > > 
> > > 
> > > -- 
> > > error compiling committee.c: too many arguments to function
> > 
> > --
> > 			Gleb.
> > 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Nov. 26, 2013, 7:35 p.m. UTC | #18

On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote:
> On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
> > > On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
> > > > On 09/12/2012 03:34 PM, Gleb Natapov wrote:
> > > > > On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> > > > >> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> > > > >> >> > > Paul, I'd like to check something with you here:
> > > > >> >> > > this function can be triggered by userspace,
> > > > >> >> > > any number of times; we allocate
> > > > >> >> > > a 2K chunk of memory that is later freed by
> > > > >> >> > > kfree_rcu.
> > > > >> >> > > 
> > > > >> >> > > Is there a risk of DOS if RCU is delayed while
> > > > >> >> > > lots of memory is queued up in this way?
> > > > >> >> > > If yes is this a generic problem with kfree_rcu
> > > > >> >> > > that should be addressed in core kernel?
> > > > >> >> > 
> > > > >> >> > There is indeed a risk.
> > > > >> >> 
> > > > >> >> In our case it's a 2K object. Is it a practical risk?
> > > > >> > 
> > > > >> > How many kfree_rcu()s per second can a given user cause to happen?
> > > > >> 
> > > > >> Not much more than a few hundred thousand per second per process (normal
> > > > >> operation is zero).
> > > > >> 
> > > > > I managed to do 21466 per second.
> > > > 
> > > > Strange, why so slow?
> > > > 
> > > Because ftrace buffer overflows :) With bigger buffer I get 169940.
> > 
> > Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
> > a tight loop, you could probably do in excess of 100M per CPU per second.
> > Now -that- might be a problem.
> > 
> > Well, it -might- be a problem if you somehow figured out how to allocate
> > memory that quickly in a steady-state manner.  ;-)
> > 
> > > > >> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> > > > >> to the APIC ID register in a loop.
> > > > >> 
> > > > > I did. Memory consumption does not grow on otherwise idle host.
> > 
> > Very good -- the checks in __call_rcu(), which is common code invoked by
> > kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
> > counter, which can be adjusted via rcutree.blimit, which defaults
> > to taking evasive action if more than 10K callbacks are waiting on a
> > given CPU.
> > 
> > My concern was that you might be overrunning that limit in way less
> > than a grace period (as in about a hundred microseconds.  My concern
> > was of course unfounded -- you take several grace periods in push 10K
> > callbacks through.
> > 
> > 							Thanx, Paul
> 
> Gleb noted that Documentation/RCU/checklist.txt has this text:
> 
>         An especially important property of the synchronize_rcu()
>         primitive is that it automatically self-limits: if grace periods
>         are delayed for whatever reason, then the synchronize_rcu()
>         primitive will correspondingly delay updates.  In contrast,
>         code using call_rcu() should explicitly limit update rate in
>         cases where grace periods are delayed, as failing to do so can
>         result in excessive realtime latencies or even OOM conditions.
> 
> If call_rcu is self-limiting maybe this should be documented ...

It would be more accurate to say that takes has some measures to limit
the damage -- you can overwhelm these measures if you try hard enough.

And I guess I could say something to that effect.  ;-)

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marcelo Tosatti Nov. 26, 2013, 8:07 p.m. UTC | #19

On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote:
> On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
> > > On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
> > > > On 09/12/2012 03:34 PM, Gleb Natapov wrote:
> > > > > On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> > > > >> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> > > > >> >> > > Paul, I'd like to check something with you here:
> > > > >> >> > > this function can be triggered by userspace,
> > > > >> >> > > any number of times; we allocate
> > > > >> >> > > a 2K chunk of memory that is later freed by
> > > > >> >> > > kfree_rcu.
> > > > >> >> > > 
> > > > >> >> > > Is there a risk of DOS if RCU is delayed while
> > > > >> >> > > lots of memory is queued up in this way?
> > > > >> >> > > If yes is this a generic problem with kfree_rcu
> > > > >> >> > > that should be addressed in core kernel?
> > > > >> >> > 
> > > > >> >> > There is indeed a risk.
> > > > >> >> 
> > > > >> >> In our case it's a 2K object. Is it a practical risk?
> > > > >> > 
> > > > >> > How many kfree_rcu()s per second can a given user cause to happen?
> > > > >> 
> > > > >> Not much more than a few hundred thousand per second per process (normal
> > > > >> operation is zero).
> > > > >> 
> > > > > I managed to do 21466 per second.
> > > > 
> > > > Strange, why so slow?
> > > > 
> > > Because ftrace buffer overflows :) With bigger buffer I get 169940.
> > 
> > Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
> > a tight loop, you could probably do in excess of 100M per CPU per second.
> > Now -that- might be a problem.
> > 
> > Well, it -might- be a problem if you somehow figured out how to allocate
> > memory that quickly in a steady-state manner.  ;-)
> > 
> > > > >> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> > > > >> to the APIC ID register in a loop.
> > > > >> 
> > > > > I did. Memory consumption does not grow on otherwise idle host.
> > 
> > Very good -- the checks in __call_rcu(), which is common code invoked by
> > kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
> > counter, which can be adjusted via rcutree.blimit, which defaults
> > to taking evasive action if more than 10K callbacks are waiting on a
> > given CPU.
> > 
> > My concern was that you might be overrunning that limit in way less
> > than a grace period (as in about a hundred microseconds.  My concern
> > was of course unfounded -- you take several grace periods in push 10K
> > callbacks through.
> > 
> > 							Thanx, Paul
> 
> Gleb noted that Documentation/RCU/checklist.txt has this text:
> 
>         An especially important property of the synchronize_rcu()
>         primitive is that it automatically self-limits: if grace periods
>         are delayed for whatever reason, then the synchronize_rcu()
>         primitive will correspondingly delay updates.  In contrast,
>         code using call_rcu() should explicitly limit update rate in
>         cases where grace periods are delayed, as failing to do so can
>         result in excessive realtime latencies or even OOM conditions.
> 
> If call_rcu is self-limiting maybe this should be documented ...

The documentation should be fixed, rather, to not mention that
call_rcu() must be rate-limited by the user.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Nov. 27, 2013, 8 a.m. UTC | #20

On Tue, Nov 26, 2013 at 11:35:06AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
> > > On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
> > > > On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
> > > > > On 09/12/2012 03:34 PM, Gleb Natapov wrote:
> > > > > > On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> > > > > >> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> > > > > >> >> > > Paul, I'd like to check something with you here:
> > > > > >> >> > > this function can be triggered by userspace,
> > > > > >> >> > > any number of times; we allocate
> > > > > >> >> > > a 2K chunk of memory that is later freed by
> > > > > >> >> > > kfree_rcu.
> > > > > >> >> > > 
> > > > > >> >> > > Is there a risk of DOS if RCU is delayed while
> > > > > >> >> > > lots of memory is queued up in this way?
> > > > > >> >> > > If yes is this a generic problem with kfree_rcu
> > > > > >> >> > > that should be addressed in core kernel?
> > > > > >> >> > 
> > > > > >> >> > There is indeed a risk.
> > > > > >> >> 
> > > > > >> >> In our case it's a 2K object. Is it a practical risk?
> > > > > >> > 
> > > > > >> > How many kfree_rcu()s per second can a given user cause to happen?
> > > > > >> 
> > > > > >> Not much more than a few hundred thousand per second per process (normal
> > > > > >> operation is zero).
> > > > > >> 
> > > > > > I managed to do 21466 per second.
> > > > > 
> > > > > Strange, why so slow?
> > > > > 
> > > > Because ftrace buffer overflows :) With bigger buffer I get 169940.
> > > 
> > > Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
> > > a tight loop, you could probably do in excess of 100M per CPU per second.
> > > Now -that- might be a problem.
> > > 
> > > Well, it -might- be a problem if you somehow figured out how to allocate
> > > memory that quickly in a steady-state manner.  ;-)
> > > 
> > > > > >> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> > > > > >> to the APIC ID register in a loop.
> > > > > >> 
> > > > > > I did. Memory consumption does not grow on otherwise idle host.
> > > 
> > > Very good -- the checks in __call_rcu(), which is common code invoked by
> > > kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
> > > counter, which can be adjusted via rcutree.blimit, which defaults
> > > to taking evasive action if more than 10K callbacks are waiting on a
> > > given CPU.
> > > 
> > > My concern was that you might be overrunning that limit in way less
> > > than a grace period (as in about a hundred microseconds.  My concern
> > > was of course unfounded -- you take several grace periods in push 10K
> > > callbacks through.
> > > 
> > > 							Thanx, Paul
> > 
> > Gleb noted that Documentation/RCU/checklist.txt has this text:
> > 
> >         An especially important property of the synchronize_rcu()
> >         primitive is that it automatically self-limits: if grace periods
> >         are delayed for whatever reason, then the synchronize_rcu()
> >         primitive will correspondingly delay updates.  In contrast,
> >         code using call_rcu() should explicitly limit update rate in
> >         cases where grace periods are delayed, as failing to do so can
> >         result in excessive realtime latencies or even OOM conditions.
> > 
> > If call_rcu is self-limiting maybe this should be documented ...
> 
> It would be more accurate to say that takes has some measures to limit
> the damage -- you can overwhelm these measures if you try hard enough.
> 
The question is: Is it safe to have a call_rcu() without any additional rate limiting
on user triggerable code path?

> And I guess I could say something to that effect.  ;-)
> 
> 							Thanx, Paul

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Nov. 27, 2013, 5:06 p.m. UTC | #21

On Wed, Nov 27, 2013 at 10:00:09AM +0200, Gleb Natapov wrote:
> On Tue, Nov 26, 2013 at 11:35:06AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
> > > > > On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
> > > > > > On 09/12/2012 03:34 PM, Gleb Natapov wrote:
> > > > > > > On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> > > > > > >> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> > > > > > >> >> > > Paul, I'd like to check something with you here:
> > > > > > >> >> > > this function can be triggered by userspace,
> > > > > > >> >> > > any number of times; we allocate
> > > > > > >> >> > > a 2K chunk of memory that is later freed by
> > > > > > >> >> > > kfree_rcu.
> > > > > > >> >> > > 
> > > > > > >> >> > > Is there a risk of DOS if RCU is delayed while
> > > > > > >> >> > > lots of memory is queued up in this way?
> > > > > > >> >> > > If yes is this a generic problem with kfree_rcu
> > > > > > >> >> > > that should be addressed in core kernel?
> > > > > > >> >> > 
> > > > > > >> >> > There is indeed a risk.
> > > > > > >> >> 
> > > > > > >> >> In our case it's a 2K object. Is it a practical risk?
> > > > > > >> > 
> > > > > > >> > How many kfree_rcu()s per second can a given user cause to happen?
> > > > > > >> 
> > > > > > >> Not much more than a few hundred thousand per second per process (normal
> > > > > > >> operation is zero).
> > > > > > >> 
> > > > > > > I managed to do 21466 per second.
> > > > > > 
> > > > > > Strange, why so slow?
> > > > > > 
> > > > > Because ftrace buffer overflows :) With bigger buffer I get 169940.
> > > > 
> > > > Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
> > > > a tight loop, you could probably do in excess of 100M per CPU per second.
> > > > Now -that- might be a problem.
> > > > 
> > > > Well, it -might- be a problem if you somehow figured out how to allocate
> > > > memory that quickly in a steady-state manner.  ;-)
> > > > 
> > > > > > >> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> > > > > > >> to the APIC ID register in a loop.
> > > > > > >> 
> > > > > > > I did. Memory consumption does not grow on otherwise idle host.
> > > > 
> > > > Very good -- the checks in __call_rcu(), which is common code invoked by
> > > > kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
> > > > counter, which can be adjusted via rcutree.blimit, which defaults
> > > > to taking evasive action if more than 10K callbacks are waiting on a
> > > > given CPU.
> > > > 
> > > > My concern was that you might be overrunning that limit in way less
> > > > than a grace period (as in about a hundred microseconds.  My concern
> > > > was of course unfounded -- you take several grace periods in push 10K
> > > > callbacks through.
> > > > 
> > > > 							Thanx, Paul
> > > 
> > > Gleb noted that Documentation/RCU/checklist.txt has this text:
> > > 
> > >         An especially important property of the synchronize_rcu()
> > >         primitive is that it automatically self-limits: if grace periods
> > >         are delayed for whatever reason, then the synchronize_rcu()
> > >         primitive will correspondingly delay updates.  In contrast,
> > >         code using call_rcu() should explicitly limit update rate in
> > >         cases where grace periods are delayed, as failing to do so can
> > >         result in excessive realtime latencies or even OOM conditions.
> > > 
> > > If call_rcu is self-limiting maybe this should be documented ...
> > 
> > It would be more accurate to say that takes has some measures to limit
> > the damage -- you can overwhelm these measures if you try hard enough.
> > 
> The question is: Is it safe to have a call_rcu() without any additional rate limiting
> on user triggerable code path?

That would be a good way to allow users to run your system out of memory,
especially on systems with limited memory.  (If you have several GB of
free space, you might be OK.)

							Thanx, Paul

> > And I guess I could say something to that effect.  ;-)
> > 
> > 							Thanx, Paul
> 
> --
> 			Gleb.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gleb Natapov Nov. 28, 2013, 8:55 a.m. UTC | #22

On Wed, Nov 27, 2013 at 09:06:36AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 27, 2013 at 10:00:09AM +0200, Gleb Natapov wrote:
> > On Tue, Nov 26, 2013 at 11:35:06AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 26, 2013 at 06:24:13PM +0200, Michael S. Tsirkin wrote:
> > > > On Wed, Sep 12, 2012 at 08:13:54AM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Sep 12, 2012 at 03:44:26PM +0300, Gleb Natapov wrote:
> > > > > > On Wed, Sep 12, 2012 at 03:36:57PM +0300, Avi Kivity wrote:
> > > > > > > On 09/12/2012 03:34 PM, Gleb Natapov wrote:
> > > > > > > > On Wed, Sep 12, 2012 at 10:45:22AM +0300, Avi Kivity wrote:
> > > > > > > >> On 09/12/2012 04:03 AM, Paul E. McKenney wrote:
> > > > > > > >> >> > > Paul, I'd like to check something with you here:
> > > > > > > >> >> > > this function can be triggered by userspace,
> > > > > > > >> >> > > any number of times; we allocate
> > > > > > > >> >> > > a 2K chunk of memory that is later freed by
> > > > > > > >> >> > > kfree_rcu.
> > > > > > > >> >> > > 
> > > > > > > >> >> > > Is there a risk of DOS if RCU is delayed while
> > > > > > > >> >> > > lots of memory is queued up in this way?
> > > > > > > >> >> > > If yes is this a generic problem with kfree_rcu
> > > > > > > >> >> > > that should be addressed in core kernel?
> > > > > > > >> >> > 
> > > > > > > >> >> > There is indeed a risk.
> > > > > > > >> >> 
> > > > > > > >> >> In our case it's a 2K object. Is it a practical risk?
> > > > > > > >> > 
> > > > > > > >> > How many kfree_rcu()s per second can a given user cause to happen?
> > > > > > > >> 
> > > > > > > >> Not much more than a few hundred thousand per second per process (normal
> > > > > > > >> operation is zero).
> > > > > > > >> 
> > > > > > > > I managed to do 21466 per second.
> > > > > > > 
> > > > > > > Strange, why so slow?
> > > > > > > 
> > > > > > Because ftrace buffer overflows :) With bigger buffer I get 169940.
> > > > > 
> > > > > Ah, good, should not be a problem.  In contrast, if you ran kfree_rcu() in
> > > > > a tight loop, you could probably do in excess of 100M per CPU per second.
> > > > > Now -that- might be a problem.
> > > > > 
> > > > > Well, it -might- be a problem if you somehow figured out how to allocate
> > > > > memory that quickly in a steady-state manner.  ;-)
> > > > > 
> > > > > > > >> Good idea.  Michael, is should be easy to modify kvm-unit-tests to write
> > > > > > > >> to the APIC ID register in a loop.
> > > > > > > >> 
> > > > > > > > I did. Memory consumption does not grow on otherwise idle host.
> > > > > 
> > > > > Very good -- the checks in __call_rcu(), which is common code invoked by
> > > > > kfree_rcu(), seem to be doing their job, then.  These do keep a per-CPU
> > > > > counter, which can be adjusted via rcutree.blimit, which defaults
> > > > > to taking evasive action if more than 10K callbacks are waiting on a
> > > > > given CPU.
> > > > > 
> > > > > My concern was that you might be overrunning that limit in way less
> > > > > than a grace period (as in about a hundred microseconds.  My concern
> > > > > was of course unfounded -- you take several grace periods in push 10K
> > > > > callbacks through.
> > > > > 
> > > > > 							Thanx, Paul
> > > > 
> > > > Gleb noted that Documentation/RCU/checklist.txt has this text:
> > > > 
> > > >         An especially important property of the synchronize_rcu()
> > > >         primitive is that it automatically self-limits: if grace periods
> > > >         are delayed for whatever reason, then the synchronize_rcu()
> > > >         primitive will correspondingly delay updates.  In contrast,
> > > >         code using call_rcu() should explicitly limit update rate in
> > > >         cases where grace periods are delayed, as failing to do so can
> > > >         result in excessive realtime latencies or even OOM conditions.
> > > > 
> > > > If call_rcu is self-limiting maybe this should be documented ...
> > > 
> > > It would be more accurate to say that takes has some measures to limit
> > > the damage -- you can overwhelm these measures if you try hard enough.
> > > 
> > The question is: Is it safe to have a call_rcu() without any additional rate limiting
> > on user triggerable code path?
> 
> That would be a good way to allow users to run your system out of memory,
> especially on systems with limited memory.  (If you have several GB of
> free space, you might be OK.)
> 
Thanks! Got it.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv2] KVM: optimize apic interrupt delivery

Commit Message

Comments

Patch