diff mbox

[PATCH/RFC] KVM: halt_polling: provide a way to qualify wakeups during poll

Message ID 1462185753-14634-1-git-send-email-borntraeger@de.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Christian Borntraeger May 2, 2016, 10:42 a.m. UTC
Radim, Paolo,

can you have a look at this patch? If you are ok with it, I want to
submit this patch with my next s390 pull request. It touches KVM common
code, but I tried to make it a nop for everything but s390.

Christian

----snip----


Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.

For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU comes first. To limit the halt polling we only mark the
woken up CPU as a valid poll. This code will also cover several other
wakeup reasons like IPI or expired timers. This will of course also mark
some events as not sucessful. As  KVM on z runs always as a 2nd level
hypervisor, we prefer to not poll, unless we are really sure, though.

So we start with a minimal set and will provide additional patches in
the future that mark additional code paths as valid wakeups, if that
turns out to be necessary.

This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 arch/s390/kvm/Kconfig     |  1 +
 arch/s390/kvm/interrupt.c |  8 ++++++++
 include/linux/kvm_host.h  | 34 ++++++++++++++++++++++++++++++++++
 virt/kvm/Kconfig          |  4 ++++
 virt/kvm/kvm_main.c       |  9 ++++++---
 5 files changed, 53 insertions(+), 3 deletions(-)

Comments

David Hildenbrand May 2, 2016, 10:45 a.m. UTC | #1
> Radim, Paolo,
> 
> can you have a look at this patch? If you are ok with it, I want to
> submit this patch with my next s390 pull request. It touches KVM common
> code, but I tried to make it a nop for everything but s390.
> 
> Christian
> 
> ----snip----
> 
> 
> Some wakeups should not be considered a sucessful poll. For example on
> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
> would be considered runnable - letting all vCPUs poll all the time for
> transactional like workload, even if one vCPU would be enough.
> This can result in huge CPU usage for large guests.
> This patch lets architectures provide a way to qualify wakeups if they
> should be considered a good/bad wakeups in regard to polls.
> 
> For s390 the implementation will fence of halt polling for anything but
> known good, single vCPU events. The s390 implementation for floating
> interrupts does a wakeup for one vCPU, but the interrupt will be delivered
> by whatever CPU comes first. To limit the halt polling we only mark the
> woken up CPU as a valid poll. This code will also cover several other
> wakeup reasons like IPI or expired timers. This will of course also mark
> some events as not sucessful. As  KVM on z runs always as a 2nd level
> hypervisor, we prefer to not poll, unless we are really sure, though.
> 
> So we start with a minimal set and will provide additional patches in
> the future that mark additional code paths as valid wakeups, if that
> turns out to be necessary.
> 
> This patch successfully limits the CPU usage for cases like uperf 1byte
> transactional ping pong workload or wakeup heavy workload like OLTP
> while still providing a proper speedup.
> 
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>  arch/s390/kvm/Kconfig     |  1 +
>  arch/s390/kvm/interrupt.c |  8 ++++++++
>  include/linux/kvm_host.h  | 34 ++++++++++++++++++++++++++++++++++
>  virt/kvm/Kconfig          |  4 ++++
>  virt/kvm/kvm_main.c       |  9 ++++++---
>  5 files changed, 53 insertions(+), 3 deletions(-)
> 

Acked-by: David Hildenbrand <dahi@linux.vnet.ibm.com>

David

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cornelia Huck May 2, 2016, 11:46 a.m. UTC | #2
On Mon,  2 May 2016 12:42:33 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:


> @@ -2038,14 +2039,16 @@ out:
>  		if (block_ns <= vcpu->halt_poll_ns)
>  			;
>  		/* we had a long block, shrink polling */
> -		else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
> +		else if (!vcpu_valid_wakeup(vcpu) ||
> +			(vcpu->halt_poll_ns && block_ns > halt_poll_ns))
>  			shrink_halt_poll_ns(vcpu);
>  		/* we had a short halt and our poll time is too small */
>  		else if (vcpu->halt_poll_ns < halt_poll_ns &&
> -			block_ns < halt_poll_ns)
> +			block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))

Isn't that vcpu_valid_wakeup() check superflous, as we have collected
all !vcpu_valid_wakeup() cases in the previous if?

>  			grow_halt_poll_ns(vcpu);
>  	} else
>  		vcpu->halt_poll_ns = 0;
> +	vcpu_reset_wakeup(vcpu);
> 
>  	trace_kvm_vcpu_wakeup(block_ns, waited);
>  }

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger May 2, 2016, 11:50 a.m. UTC | #3
On 05/02/2016 01:46 PM, Cornelia Huck wrote:
> On Mon,  2 May 2016 12:42:33 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
> 
>> @@ -2038,14 +2039,16 @@ out:
>>  		if (block_ns <= vcpu->halt_poll_ns)
>>  			;
>>  		/* we had a long block, shrink polling */
>> -		else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
>> +		else if (!vcpu_valid_wakeup(vcpu) ||
>> +			(vcpu->halt_poll_ns && block_ns > halt_poll_ns))
>>  			shrink_halt_poll_ns(vcpu);
>>  		/* we had a short halt and our poll time is too small */
>>  		else if (vcpu->halt_poll_ns < halt_poll_ns &&
>> -			block_ns < halt_poll_ns)
>> +			block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
> 
> Isn't that vcpu_valid_wakeup() check superflous, as we have collected
> all !vcpu_valid_wakeup() cases in the previous if?

Yes, will fix in v2.


> 
>>  			grow_halt_poll_ns(vcpu);
>>  	} else
>>  		vcpu->halt_poll_ns = 0;
>> +	vcpu_reset_wakeup(vcpu);
>>
>>  	trace_kvm_vcpu_wakeup(block_ns, waited);
>>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář May 2, 2016, 1:34 p.m. UTC | #4
2016-05-02 12:42+0200, Christian Borntraeger:
> Radim, Paolo,
> 
> can you have a look at this patch? If you are ok with it, I want to
> submit this patch with my next s390 pull request. It touches KVM common
> code, but I tried to make it a nop for everything but s390.

(I have few questions and will ack the solution if you stand behind it.)

> Christian
> 
> ----snip----
> 
> 
> Some wakeups should not be considered a sucessful poll. For example on
> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
> would be considered runnable - letting all vCPUs poll all the time for
> transactional like workload, even if one vCPU would be enough.
> 
> This can result in huge CPU usage for large guests.
> This patch lets architectures provide a way to qualify wakeups if they
> should be considered a good/bad wakeups in regard to polls.
> 
> For s390 the implementation will fence of halt polling for anything but
> known good, single vCPU events. The s390 implementation for floating
> interrupts does a wakeup for one vCPU, but the interrupt will be delivered
> by whatever CPU comes first. To limit the halt polling we only mark the
> woken up CPU as a valid poll. This code will also cover several other
> wakeup reasons like IPI or expired timers. This will of course also mark
> some events as not sucessful. As  KVM on z runs always as a 2nd level
> hypervisor, we prefer to not poll, unless we are really sure, though.
> 
> So we start with a minimal set and will provide additional patches in
> the future that mark additional code paths as valid wakeups, if that
> turns out to be necessary.
> 
> This patch successfully limits the CPU usage for cases like uperf 1byte
> transactional ping pong workload or wakeup heavy workload like OLTP
> while still providing a proper speedup.
> 
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
> @@ -976,6 +976,14 @@ no_timer:
>  
>  void kvm_s390_vcpu_wakeup(struct kvm_vcpu *vcpu)
>  {
> +	/*
> +	 * This is outside of the if because we want to mark the wakeup
> +	 * as valid for vCPUs that
> +	 * a: do polling right now
> +	 * b: do sleep right now
> +	 * otherwise we would never grow the poll interval properly
> +	 */
> +	vcpu_set_valid_wakeup(vcpu);
>  	if (waitqueue_active(&vcpu->wq)) {

(Can't kvm_s390_vcpu_wakeup() be called when the vcpu isn't in
 kvm_vcpu_block()?  Either this condition is useless or we'd the set
 vcpu_set_valid_wakeup() for any future wakeup.)

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> @@ -224,6 +224,7 @@ struct kvm_vcpu {
>  	sigset_t sigset;
>  	struct kvm_vcpu_stat stat;
>  	unsigned int halt_poll_ns;
> +	bool valid_wakeup;
>  
>  #ifdef CONFIG_HAS_IOMEM
>  	int mmio_needed;
> @@ -1178,4 +1179,37 @@ int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
>  				  uint32_t guest_irq, bool set);
>  #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
>  
> +#ifdef CONFIG_HAVE_KVM_INVALID_POLLS
> +/* If we wakeup during the poll time, was it a sucessful poll? */
> +static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)

(smp barriers?)

> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> @@ -41,6 +41,10 @@ config KVM_VFIO
> +config HAVE_KVM_INVALID_POLLS
> +       bool
> +
> +

(One newline is enough.)

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> @@ -2008,7 +2008,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>  			 * arrives.
>  			 */
>  			if (kvm_vcpu_check_block(vcpu) < 0) {
> -				++vcpu->stat.halt_successful_poll;
> +				if (vcpu_valid_wakeup(vcpu))
> +					++vcpu->stat.halt_successful_poll;

KVM didn't call schedule(), so it's still a successful poll, IMO, just
invalid.

>  				goto out;
>  			}
>  			cur = ktime_get();
> @@ -2038,14 +2039,16 @@ out:
>  		if (block_ns <= vcpu->halt_poll_ns)
>  			;
>  		/* we had a long block, shrink polling */
> -		else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
> +		else if (!vcpu_valid_wakeup(vcpu) ||
> +			(vcpu->halt_poll_ns && block_ns > halt_poll_ns))
>  			shrink_halt_poll_ns(vcpu);

Is the shrinking important?

>  		/* we had a short halt and our poll time is too small */
>  		else if (vcpu->halt_poll_ns < halt_poll_ns &&
> -			block_ns < halt_poll_ns)
> +			block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
>  			grow_halt_poll_ns(vcpu);

IIUC, the problem comes from overgrown halt_poll_ns, so couldn't we just
ignore all invalid wakeups?

It would make more sense to me, because we are not interested in latency
of invalid wakeups, so they shouldn't affect valid ones.

>  	} else
>  		vcpu->halt_poll_ns = 0;
> +	vcpu_reset_wakeup(vcpu);
>  
>  	trace_kvm_vcpu_wakeup(block_ns, waited);

(Tracing valid/invalid wakeups could be useful.)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger May 2, 2016, 2:30 p.m. UTC | #5
On 05/02/2016 03:34 PM, Radim Kr?má? wrote:
> 2016-05-02 12:42+0200, Christian Borntraeger:
>> Radim, Paolo,
>>
>> can you have a look at this patch? If you are ok with it, I want to
>> submit this patch with my next s390 pull request. It touches KVM common
>> code, but I tried to make it a nop for everything but s390.
> 
> (I have few questions and will ack the solution if you stand behind it.)
> 
>> Christian
>>
>> ----snip----
>>
>>
>> Some wakeups should not be considered a sucessful poll. For example on
>> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
>> would be considered runnable - letting all vCPUs poll all the time for
>> transactional like workload, even if one vCPU would be enough.
>>
>> This can result in huge CPU usage for large guests.
>> This patch lets architectures provide a way to qualify wakeups if they
>> should be considered a good/bad wakeups in regard to polls.
>>
>> For s390 the implementation will fence of halt polling for anything but
>> known good, single vCPU events. The s390 implementation for floating
>> interrupts does a wakeup for one vCPU, but the interrupt will be delivered
>> by whatever CPU comes first. To limit the halt polling we only mark the
>> woken up CPU as a valid poll. This code will also cover several other
>> wakeup reasons like IPI or expired timers. This will of course also mark
>> some events as not sucessful. As  KVM on z runs always as a 2nd level
>> hypervisor, we prefer to not poll, unless we are really sure, though.
>>
>> So we start with a minimal set and will provide additional patches in
>> the future that mark additional code paths as valid wakeups, if that
>> turns out to be necessary.
>>
>> This patch successfully limits the CPU usage for cases like uperf 1byte
>> transactional ping pong workload or wakeup heavy workload like OLTP
>> while still providing a proper speedup.
>>
>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>> ---
>> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
>> @@ -976,6 +976,14 @@ no_timer:
>>  
>>  void kvm_s390_vcpu_wakeup(struct kvm_vcpu *vcpu)
>>  {
>> +	/*
>> +	 * This is outside of the if because we want to mark the wakeup
>> +	 * as valid for vCPUs that
>> +	 * a: do polling right now
>> +	 * b: do sleep right now
>> +	 * otherwise we would never grow the poll interval properly
>> +	 */
>> +	vcpu_set_valid_wakeup(vcpu);
>>  	if (waitqueue_active(&vcpu->wq)) {
> 
> (Can't kvm_s390_vcpu_wakeup() be called when the vcpu isn't in
>  kvm_vcpu_block()?  Either this condition is useless or we'd the set
>  vcpu_set_valid_wakeup() for any future wakeup.)

Yes, for example a timer might expire (see  kvm_s390_idle_wakeup) AND the
vcpu was already woken up by an I/O interrupt and we are in the process of
leaving kvm_vcpu_block. And yes, we might overindicate and set valid wakeup
in that case, but this is fine as this is jut a heuristics which will recover.
 
The problem is, that I cannot move vcpu_set_valid_wakeup inside the if,
because then a VCPU can be inside kvm_vcpu_block (polling) but the waitqueue
is not yet active. (in other words, the poll interval will be 0, or grow
once just to be reset to 0 afterwards.)

> 
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> @@ -224,6 +224,7 @@ struct kvm_vcpu {
>>  	sigset_t sigset;
>>  	struct kvm_vcpu_stat stat;
>>  	unsigned int halt_poll_ns;
>> +	bool valid_wakeup;
>>  
>>  #ifdef CONFIG_HAS_IOMEM
>>  	int mmio_needed;
>> @@ -1178,4 +1179,37 @@ int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
>>  				  uint32_t guest_irq, bool set);
>>  #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
>>  
>> +#ifdef CONFIG_HAVE_KVM_INVALID_POLLS
>> +/* If we wakeup during the poll time, was it a sucessful poll? */
>> +static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
> 
> (smp barriers?)

Not sure. Do we need to order valid_wakeup against other stores/reads?
To me it looks like the order of stores/fetches for the different values
should not matter.
I can certainly add smp_rmb/wmb to getters/setters, but I can not see a
problematic case right now and barriers require comments. Can you elaborate
what you see as potential issue?

> 
>> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
>> @@ -41,6 +41,10 @@ config KVM_VFIO
>> +config HAVE_KVM_INVALID_POLLS
>> +       bool
>> +
>> +
> 
> (One newline is enough.)

sure.
> 

>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> @@ -2008,7 +2008,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>  			 * arrives.
>>  			 */
>>  			if (kvm_vcpu_check_block(vcpu) < 0) {
>> -				++vcpu->stat.halt_successful_poll;
>> +				if (vcpu_valid_wakeup(vcpu))
>> +					++vcpu->stat.halt_successful_poll;
> 
> KVM didn't call schedule(), so it's still a successful poll, IMO, just
> invalid.

so just always do ++vcpu->stat.halt_successful_poll; and add another counter 
that counts polls that will not be used for growing/shrinking?
like
			++vcpu->stat.halt_successful_poll;
			if (!vcpu_valid_wakeup(vcpu))
				++vcpu->stat.halt_poll_no_tuning; 

?



> 
>>  				goto out;
>>  			}
>>  			cur = ktime_get();
>> @@ -2038,14 +2039,16 @@ out:
>>  		if (block_ns <= vcpu->halt_poll_ns)
>>  			;
>>  		/* we had a long block, shrink polling */
>> -		else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
>> +		else if (!vcpu_valid_wakeup(vcpu) ||
>> +			(vcpu->halt_poll_ns && block_ns > halt_poll_ns))
>>  			shrink_halt_poll_ns(vcpu);
> 
> Is the shrinking important?
> 
>>  		/* we had a short halt and our poll time is too small */
>>  		else if (vcpu->halt_poll_ns < halt_poll_ns &&
>> -			block_ns < halt_poll_ns)
>> +			block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
>>  			grow_halt_poll_ns(vcpu);
> 
> IIUC, the problem comes from overgrown halt_poll_ns, so couldn't we just
> ignore all invalid wakeups?

I have some pathological cases where I can easily get all CPUs to poll all
the time without the shrinking part of the patch. (e.g. guest with 16 CPUs,
8 null block devices and 64 dd reading small blocks with O_DIRECT from these disks)
which cause permanent exits which consumes all 16 host CPUs. Limiting the grow
did not seem to be enough in my testing, but when I also made shrinking more
aggressive things improved.

But I am certainly open for other ideas how to tune this.



> 
> It would make more sense to me, because we are not interested in latency
> of invalid wakeups, so they shouldn't affect valid ones.
> 
>>  	} else
>>  		vcpu->halt_poll_ns = 0;
>> +	vcpu_reset_wakeup(vcpu);
>>  
>>  	trace_kvm_vcpu_wakeup(block_ns, waited);
> 
> (Tracing valid/invalid wakeups could be useful.)

As an extension of the old trace events?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář May 2, 2016, 3:25 p.m. UTC | #6
2016-05-02 16:30+0200, Christian Borntraeger:
> On 05/02/2016 03:34 PM, Radim Kr?má? wrote:
>> 2016-05-02 12:42+0200, Christian Borntraeger:
>>> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
>>> @@ -976,6 +976,14 @@ no_timer:
>>>  
>>>  void kvm_s390_vcpu_wakeup(struct kvm_vcpu *vcpu)
>>>  {
>>> +	/*
>>> +	 * This is outside of the if because we want to mark the wakeup
>>> +	 * as valid for vCPUs that
>>> +	 * a: do polling right now
>>> +	 * b: do sleep right now
>>> +	 * otherwise we would never grow the poll interval properly
>>> +	 */
>>> +	vcpu_set_valid_wakeup(vcpu);
>>>  	if (waitqueue_active(&vcpu->wq)) {
>> 
>> (Can't kvm_s390_vcpu_wakeup() be called when the vcpu isn't in
>>  kvm_vcpu_block()?  Either this condition is useless or we'd the set
>>  vcpu_set_valid_wakeup() for any future wakeup.)
> 
> Yes, for example a timer might expire (see  kvm_s390_idle_wakeup) AND the
> vcpu was already woken up by an I/O interrupt and we are in the process of
> leaving kvm_vcpu_block. And yes, we might overindicate and set valid wakeup
> in that case, but this is fine as this is jut a heuristics which will recover.
>  
> The problem is, that I cannot move vcpu_set_valid_wakeup inside the if,
> because then a VCPU can be inside kvm_vcpu_block (polling) but the waitqueue
> is not yet active. (in other words, the poll interval will be 0, or grow
> once just to be reset to 0 afterwards.)

I see, thanks.

>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>> @@ -224,6 +224,7 @@ struct kvm_vcpu {
>>>  	sigset_t sigset;
>>>  	struct kvm_vcpu_stat stat;
>>>  	unsigned int halt_poll_ns;
>>> +	bool valid_wakeup;
>>>  
>>>  #ifdef CONFIG_HAS_IOMEM
>>>  	int mmio_needed;
>>> @@ -1178,4 +1179,37 @@ int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
>>>  				  uint32_t guest_irq, bool set);
>>>  #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
>>>  
>>> +#ifdef CONFIG_HAVE_KVM_INVALID_POLLS
>>> +/* If we wakeup during the poll time, was it a sucessful poll? */
>>> +static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
>> 
>> (smp barriers?)
> 
> Not sure. Do we need to order valid_wakeup against other stores/reads?
> To me it looks like the order of stores/fetches for the different values
> should not matter.

Yeah, I was forgetting that polling doesn't need to be perfect.

> I can certainly add smp_rmb/wmb to getters/setters, but I can not see a
> problematic case right now and barriers require comments. Can you elaborate
> what you see as potential issue?

I agree that it's fine to believe in GCC and CPU, because it is just a
heuristic.

To the ignorable issue itself: The proper protocol for wakeup is
  1) set valid_wakeup to true
  2) set wakeup condition for kvm_vcpu_check_block().
  3) potentially wake up the vcpu
because we never check valid_wakeup without kvm_vcpu_check_block(),
hence we shouldn't allow read-ahead of valid_wakeup or late-setting of
valid_wakeup to avoid treating valid wakeups as invalid.

>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> @@ -2008,7 +2008,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>>  			 * arrives.
>>>  			 */
>>>  			if (kvm_vcpu_check_block(vcpu) < 0) {
>>> -				++vcpu->stat.halt_successful_poll;
>>> +				if (vcpu_valid_wakeup(vcpu))
>>> +					++vcpu->stat.halt_successful_poll;
>> 
>> KVM didn't call schedule(), so it's still a successful poll, IMO, just
>> invalid.
> 
> so just always do ++vcpu->stat.halt_successful_poll; and add another counter 
> that counts polls that will not be used for growing/shrinking?
> like
> 			++vcpu->stat.halt_successful_poll;
> 			if (!vcpu_valid_wakeup(vcpu))
> 				++vcpu->stat.halt_poll_no_tuning; 
> 
> ?

Looks good.  Large numbers in halt_poll_no_tuning relative to
halt_successful_poll is a clearer warning flag.

>> 
>>>  				goto out;
>>>  			}
>>>  			cur = ktime_get();
>>> @@ -2038,14 +2039,16 @@ out:
>>>  		if (block_ns <= vcpu->halt_poll_ns)
>>>  			;
>>>  		/* we had a long block, shrink polling */
>>> -		else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
>>> +		else if (!vcpu_valid_wakeup(vcpu) ||
>>> +			(vcpu->halt_poll_ns && block_ns > halt_poll_ns))
>>>  			shrink_halt_poll_ns(vcpu);
>> 
>> Is the shrinking important?
>> 
>>>  		/* we had a short halt and our poll time is too small */
>>>  		else if (vcpu->halt_poll_ns < halt_poll_ns &&
>>> -			block_ns < halt_poll_ns)
>>> +			block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
>>>  			grow_halt_poll_ns(vcpu);
>> 
>> IIUC, the problem comes from overgrown halt_poll_ns, so couldn't we just
>> ignore all invalid wakeups?
> 
> I have some pathological cases where I can easily get all CPUs to poll all
> the time without the shrinking part of the patch. (e.g. guest with 16 CPUs,
> 8 null block devices and 64 dd reading small blocks with O_DIRECT from these disks)
> which cause permanent exits which consumes all 16 host CPUs. Limiting the grow
> did not seem to be enough in my testing, but when I also made shrinking more
> aggressive things improved.

So the problem is that a large number of VCPUs and devices will often
have a floating irq and the polling always succeeds unless halt_poll_ns
is very small.  Poll window doesn't change if the poll succeds,
therefore we need a very agressive shrinker in order to avoid polling?

> But I am certainly open for other ideas how to tune this.

I don't see good improvements ... the problem seems to lie elsewhere:
Couldn't we exclude floating irqs from kvm_vcpu_check_block()?

(A VCPU running for other reasons could still handle a floating irq and
 we always kick one VCPU, so VM won't starve and other VCPUs won't be
 prevented from sleeping.)

>> It would make more sense to me, because we are not interested in latency
>> of invalid wakeups, so they shouldn't affect valid ones.
>> 
>>>  	} else
>>>  		vcpu->halt_poll_ns = 0;
>>> +	vcpu_reset_wakeup(vcpu);
>>>  
>>>  	trace_kvm_vcpu_wakeup(block_ns, waited);
>> 
>> (Tracing valid/invalid wakeups could be useful.)
> 
> As an extension of the old trace events?

Yes.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Matlack May 2, 2016, 7:44 p.m. UTC | #7
On Mon, May 2, 2016 at 3:42 AM, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
> Radim, Paolo,
>
> can you have a look at this patch? If you are ok with it, I want to
> submit this patch with my next s390 pull request. It touches KVM common
> code, but I tried to make it a nop for everything but s390.
>
> Christian
>
> ----snip----
>
>
> Some wakeups should not be considered a sucessful poll. For example on
> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
> would be considered runnable - letting all vCPUs poll all the time for
> transactional like workload, even if one vCPU would be enough.
> This can result in huge CPU usage for large guests.
> This patch lets architectures provide a way to qualify wakeups if they
> should be considered a good/bad wakeups in regard to polls.
>
> For s390 the implementation will fence of halt polling for anything but
> known good, single vCPU events. The s390 implementation for floating
> interrupts does a wakeup for one vCPU, but the interrupt will be delivered
> by whatever CPU comes first.

Can the delivery of the floating interrupt to the "first CPU" be done
by kvm_vcpu_check_block? If so, then kvm_vcpu_check_block can return
false for all other CPUs and the polling problem goes away.

> To limit the halt polling we only mark the
> woken up CPU as a valid poll. This code will also cover several other
> wakeup reasons like IPI or expired timers. This will of course also mark
> some events as not sucessful. As  KVM on z runs always as a 2nd level
> hypervisor, we prefer to not poll, unless we are really sure, though.
>
> So we start with a minimal set and will provide additional patches in
> the future that mark additional code paths as valid wakeups, if that
> turns out to be necessary.
>
> This patch successfully limits the CPU usage for cases like uperf 1byte
> transactional ping pong workload or wakeup heavy workload like OLTP
> while still providing a proper speedup.
>
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

Reviewed-By: David Matlack <dmatlack@google.com>
(I reviewed the non-s390 case, to make sure that this change is a nop.)

Request to be cc'd on halt-polling patches in the future. Thanks!

> ---
>  arch/s390/kvm/Kconfig     |  1 +
>  arch/s390/kvm/interrupt.c |  8 ++++++++
>  include/linux/kvm_host.h  | 34 ++++++++++++++++++++++++++++++++++
>  virt/kvm/Kconfig          |  4 ++++
>  virt/kvm/kvm_main.c       |  9 ++++++---
>  5 files changed, 53 insertions(+), 3 deletions(-)
>
> diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
> index 5ea5af3..ccfe6f6 100644
> --- a/arch/s390/kvm/Kconfig
> +++ b/arch/s390/kvm/Kconfig
> @@ -28,6 +28,7 @@ config KVM
>         select HAVE_KVM_IRQCHIP
>         select HAVE_KVM_IRQFD
>         select HAVE_KVM_IRQ_ROUTING
> +       select HAVE_KVM_INVALID_POLLS
>         select SRCU
>         select KVM_VFIO
>         ---help---
> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
> index 2130299..fade1b4 100644
> --- a/arch/s390/kvm/interrupt.c
> +++ b/arch/s390/kvm/interrupt.c
> @@ -976,6 +976,14 @@ no_timer:
>
>  void kvm_s390_vcpu_wakeup(struct kvm_vcpu *vcpu)
>  {
> +       /*
> +        * This is outside of the if because we want to mark the wakeup
> +        * as valid for vCPUs that
> +        * a: do polling right now
> +        * b: do sleep right now
> +        * otherwise we would never grow the poll interval properly
> +        */
> +       vcpu_set_valid_wakeup(vcpu);
>         if (waitqueue_active(&vcpu->wq)) {
>                 /*
>                  * The vcpu gave up the cpu voluntarily, mark it as a good
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 861f690..550beec 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -224,6 +224,7 @@ struct kvm_vcpu {
>         sigset_t sigset;
>         struct kvm_vcpu_stat stat;
>         unsigned int halt_poll_ns;
> +       bool valid_wakeup;
>
>  #ifdef CONFIG_HAS_IOMEM
>         int mmio_needed;
> @@ -1178,4 +1179,37 @@ int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
>                                   uint32_t guest_irq, bool set);
>  #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
>
> +#ifdef CONFIG_HAVE_KVM_INVALID_POLLS
> +/* If we wakeup during the poll time, was it a sucessful poll? */
> +static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
> +{
> +       return vcpu->valid_wakeup;
> +}
> +
> +/* Mark the next wakeup as a non-sucessful poll */
> +static inline void vcpu_reset_wakeup(struct kvm_vcpu *vcpu)
> +{
> +       vcpu->valid_wakeup = false;
> +}
> +
> +/* Mark the next wakeup as a sucessful poll */
> +static inline void vcpu_set_valid_wakeup(struct kvm_vcpu *vcpu)
> +{
> +       vcpu->valid_wakeup = true;
> +}
> +#else
> +static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
> +{
> +       return true;
> +}
> +
> +static inline void vcpu_reset_wakeup(struct kvm_vcpu *vcpu)
> +{
> +}
> +
> +static inline void vcpu_set_valid_wakeup(struct kvm_vcpu *vcpu)
> +{
> +}
> +#endif /* CONFIG_HAVE_KVM_INVALID_POLLS */
> +
>  #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 7a79b68..b9edb51 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -41,6 +41,10 @@ config KVM_VFIO
>  config HAVE_KVM_ARCH_TLB_FLUSH_ALL
>         bool
>
> +config HAVE_KVM_INVALID_POLLS
> +       bool
> +
> +
>  config KVM_GENERIC_DIRTYLOG_READ_PROTECT
>         bool
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9102ae1..d63ea60 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2008,7 +2008,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>                          * arrives.
>                          */
>                         if (kvm_vcpu_check_block(vcpu) < 0) {
> -                               ++vcpu->stat.halt_successful_poll;
> +                               if (vcpu_valid_wakeup(vcpu))
> +                                       ++vcpu->stat.halt_successful_poll;
>                                 goto out;
>                         }
>                         cur = ktime_get();
> @@ -2038,14 +2039,16 @@ out:
>                 if (block_ns <= vcpu->halt_poll_ns)
>                         ;
>                 /* we had a long block, shrink polling */
> -               else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
> +               else if (!vcpu_valid_wakeup(vcpu) ||
> +                       (vcpu->halt_poll_ns && block_ns > halt_poll_ns))
>                         shrink_halt_poll_ns(vcpu);
>                 /* we had a short halt and our poll time is too small */
>                 else if (vcpu->halt_poll_ns < halt_poll_ns &&
> -                       block_ns < halt_poll_ns)
> +                       block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
>                         grow_halt_poll_ns(vcpu);
>         } else
>                 vcpu->halt_poll_ns = 0;
> +       vcpu_reset_wakeup(vcpu);
>
>         trace_kvm_vcpu_wakeup(block_ns, waited);
>  }
> --
> 2.3.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wanpeng Li May 3, 2016, 5:42 a.m. UTC | #8
2016-05-02 18:42 GMT+08:00 Christian Borntraeger <borntraeger@de.ibm.com>:
[...]
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9102ae1..d63ea60 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2008,7 +2008,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>                          * arrives.
>                          */
>                         if (kvm_vcpu_check_block(vcpu) < 0) {
> -                               ++vcpu->stat.halt_successful_poll;
> +                               if (vcpu_valid_wakeup(vcpu))
> +                                       ++vcpu->stat.halt_successful_poll;
>                                 goto out;
>                         }
>                         cur = ktime_get();
> @@ -2038,14 +2039,16 @@ out:
>                 if (block_ns <= vcpu->halt_poll_ns)
>                         ;
>                 /* we had a long block, shrink polling */
> -               else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
> +               else if (!vcpu_valid_wakeup(vcpu) ||
> +                       (vcpu->halt_poll_ns && block_ns > halt_poll_ns))
>                         shrink_halt_poll_ns(vcpu);
>                 /* we had a short halt and our poll time is too small */
>                 else if (vcpu->halt_poll_ns < halt_poll_ns &&
> -                       block_ns < halt_poll_ns)
> +                       block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
>                         grow_halt_poll_ns(vcpu);
>         } else
>                 vcpu->halt_poll_ns = 0;
> +       vcpu_reset_wakeup(vcpu);

Why mark the next wakeup as a non-sucessful poll?

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger May 3, 2016, 7 a.m. UTC | #9
On 05/03/2016 07:42 AM, Wanpeng Li wrote:
> 2016-05-02 18:42 GMT+08:00 Christian Borntraeger <borntraeger@de.ibm.com>:
> [...]
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 9102ae1..d63ea60 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -2008,7 +2008,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>>                          * arrives.
>>                          */
>>                         if (kvm_vcpu_check_block(vcpu) < 0) {
>> -                               ++vcpu->stat.halt_successful_poll;
>> +                               if (vcpu_valid_wakeup(vcpu))
>> +                                       ++vcpu->stat.halt_successful_poll;
>>                                 goto out;
>>                         }
>>                         cur = ktime_get();
>> @@ -2038,14 +2039,16 @@ out:
>>                 if (block_ns <= vcpu->halt_poll_ns)
>>                         ;
>>                 /* we had a long block, shrink polling */
>> -               else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
>> +               else if (!vcpu_valid_wakeup(vcpu) ||
>> +                       (vcpu->halt_poll_ns && block_ns > halt_poll_ns))
>>                         shrink_halt_poll_ns(vcpu);
>>                 /* we had a short halt and our poll time is too small */
>>                 else if (vcpu->halt_poll_ns < halt_poll_ns &&
>> -                       block_ns < halt_poll_ns)
>> +                       block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
>>                         grow_halt_poll_ns(vcpu);
>>         } else
>>                 vcpu->halt_poll_ns = 0;
>> +       vcpu_reset_wakeup(vcpu);
> 
> Why mark the next wakeup as a non-sucessful poll?

It is basically only used for s390 and used as a mean to implement the "default off,
only on for selected cases". But yes, if somebody else wants to use it this might 
need to be changed.
So what about changing this into
kvm_arch_vcpu_block_finish(vcpu)
which is a reset on s390 and a no for others?

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wanpeng Li May 3, 2016, 7:50 a.m. UTC | #10
2016-05-02 18:42 GMT+08:00 Christian Borntraeger <borntraeger@de.ibm.com>:
[...]
> Some wakeups should not be considered a sucessful poll. For example on
> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
> would be considered runnable - letting all vCPUs poll all the time for
> transactional like workload, even if one vCPU would be enough.
> This can result in huge CPU usage for large guests.
> This patch lets architectures provide a way to qualify wakeups if they
> should be considered a good/bad wakeups in regard to polls.
>
> For s390 the implementation will fence of halt polling for anything but
> known good, single vCPU events. The s390 implementation for floating
> interrupts does a wakeup for one vCPU, but the interrupt will be delivered
> by whatever CPU comes first. To limit the halt polling we only mark the

If the floating interrupt means that the 'CPU comes first' will
deliver the interrupt to all vCPUs?

Regrads,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cornelia Huck May 3, 2016, 8 a.m. UTC | #11
On Tue, 3 May 2016 15:50:25 +0800
Wanpeng Li <kernellwp@gmail.com> wrote:

> 2016-05-02 18:42 GMT+08:00 Christian Borntraeger <borntraeger@de.ibm.com>:
> [...]
> > Some wakeups should not be considered a sucessful poll. For example on
> > s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
> > would be considered runnable - letting all vCPUs poll all the time for
> > transactional like workload, even if one vCPU would be enough.
> > This can result in huge CPU usage for large guests.
> > This patch lets architectures provide a way to qualify wakeups if they
> > should be considered a good/bad wakeups in regard to polls.
> >
> > For s390 the implementation will fence of halt polling for anything but
> > known good, single vCPU events. The s390 implementation for floating
> > interrupts does a wakeup for one vCPU, but the interrupt will be delivered
> > by whatever CPU comes first. To limit the halt polling we only mark the
> 
> If the floating interrupt means that the 'CPU comes first' will
> deliver the interrupt to all vCPUs?

Floating interrupt on s390 means "deliver on any vcpu that matches the
criteria, but only on one".

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger May 3, 2016, 8 a.m. UTC | #12
On 05/03/2016 09:50 AM, Wanpeng Li wrote:
> 2016-05-02 18:42 GMT+08:00 Christian Borntraeger <borntraeger@de.ibm.com>:
> [...]
>> Some wakeups should not be considered a sucessful poll. For example on
>> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
>> would be considered runnable - letting all vCPUs poll all the time for
>> transactional like workload, even if one vCPU would be enough.
>> This can result in huge CPU usage for large guests.
>> This patch lets architectures provide a way to qualify wakeups if they
>> should be considered a good/bad wakeups in regard to polls.
>>
>> For s390 the implementation will fence of halt polling for anything but
>> known good, single vCPU events. The s390 implementation for floating
>> interrupts does a wakeup for one vCPU, but the interrupt will be delivered
>> by whatever CPU comes first. To limit the halt polling we only mark the
> 
> If the floating interrupt means that the 'CPU comes first' will
> deliver the interrupt to all vCPUs?

No. All CPUs do the normal vcpu_run loop. And before entering the guest
every CPU will try to deliver pending interrupts

static int __vcpu_run(struct kvm_vcpu *vcpu)
{
[...]
        do {
                rc = vcpu_pre_run(vcpu);   ---------------------+
                if (rc)						|
                        break;					|
[...]								|
                exit_reason = sie64a(vcpu->arch.sie_block,	|
                                     vcpu->run->s.regs.gprs);	|
[...]								|
}								|
								|
static int vcpu_pre_run(struct kvm_vcpu *vcpu)		<-------+
{
[...]
        if (!kvm_is_ucontrol(vcpu->kvm)) {
                rc = kvm_s390_deliver_pending_interrupts(vcpu); <----
                if (rc)
                        return rc;
[...]
}



and whichever comes first, will dequeue that interrupt and deliver it
by doing the PSW swap (jumping to the interrupt handler address)

(other CPUs will then not deliver this interrupt as it is already dequeued)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger May 3, 2016, 8:46 a.m. UTC | #13
On 05/02/2016 09:44 PM, David Matlack wrote:
> On Mon, May 2, 2016 at 3:42 AM, Christian Borntraeger
> <borntraeger@de.ibm.com> wrote:
>> Radim, Paolo,
>>
>> can you have a look at this patch? If you are ok with it, I want to
>> submit this patch with my next s390 pull request. It touches KVM common
>> code, but I tried to make it a nop for everything but s390.
>>
>> Christian
>>
>> ----snip----
>>
>>
>> Some wakeups should not be considered a sucessful poll. For example on
>> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
>> would be considered runnable - letting all vCPUs poll all the time for
>> transactional like workload, even if one vCPU would be enough.
>> This can result in huge CPU usage for large guests.
>> This patch lets architectures provide a way to qualify wakeups if they
>> should be considered a good/bad wakeups in regard to polls.
>>
>> For s390 the implementation will fence of halt polling for anything but
>> known good, single vCPU events. The s390 implementation for floating
>> interrupts does a wakeup for one vCPU, but the interrupt will be delivered
>> by whatever CPU comes first.
> 
> Can the delivery of the floating interrupt to the "first CPU" be done
> by kvm_vcpu_check_block? If so, then kvm_vcpu_check_block can return
> false for all other CPUs and the polling problem goes away.
> 

The delivery of interrupts is always done inside the __vcpu_run function.
So when we leave kvm_vpcu_block we will come back to __vcpu_run and 
deliver pending interrupts (if not masked by PSW or control registers) 
according to their priority. 
I remember that some time ago we had a reason why we could not deliver
in kvm_vcpu_block but I forgot why :-/



>> To limit the halt polling we only mark the
>> woken up CPU as a valid poll. This code will also cover several other
>> wakeup reasons like IPI or expired timers. This will of course also mark
>> some events as not sucessful. As  KVM on z runs always as a 2nd level
>> hypervisor, we prefer to not poll, unless we are really sure, though.
>>
>> So we start with a minimal set and will provide additional patches in
>> the future that mark additional code paths as valid wakeups, if that
>> turns out to be necessary.
>>
>> This patch successfully limits the CPU usage for cases like uperf 1byte
>> transactional ping pong workload or wakeup heavy workload like OLTP
>> while still providing a proper speedup.
>>
>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> 
> Reviewed-By: David Matlack <dmatlack@google.com>
> (I reviewed the non-s390 case, to make sure that this change is a nop.)
> 
> Request to be cc'd on halt-polling patches in the future. Thanks!

Sure.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Hildenbrand May 3, 2016, 8:48 a.m. UTC | #14
> On 05/03/2016 09:50 AM, Wanpeng Li wrote:
> > 2016-05-02 18:42 GMT+08:00 Christian Borntraeger <borntraeger@de.ibm.com>:
> > [...]  
> >> Some wakeups should not be considered a sucessful poll. For example on
> >> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
> >> would be considered runnable - letting all vCPUs poll all the time for
> >> transactional like workload, even if one vCPU would be enough.
> >> This can result in huge CPU usage for large guests.
> >> This patch lets architectures provide a way to qualify wakeups if they
> >> should be considered a good/bad wakeups in regard to polls.
> >>
> >> For s390 the implementation will fence of halt polling for anything but
> >> known good, single vCPU events. The s390 implementation for floating
> >> interrupts does a wakeup for one vCPU, but the interrupt will be delivered
> >> by whatever CPU comes first. To limit the halt polling we only mark the  
> > 
> > If the floating interrupt means that the 'CPU comes first' will
> > deliver the interrupt to all vCPUs?  
> 
> No. All CPUs do the normal vcpu_run loop. And before entering the guest
> every CPU will try to deliver pending interrupts
> 
> static int __vcpu_run(struct kvm_vcpu *vcpu)
> {
> [...]
>         do {
>                 rc = vcpu_pre_run(vcpu);   ---------------------+
>                 if (rc)						|
>                         break;					|
> [...]								|
>                 exit_reason = sie64a(vcpu->arch.sie_block,	|
>                                      vcpu->run->s.regs.gprs);	|
> [...]								|
> }								|
> 								|
> static int vcpu_pre_run(struct kvm_vcpu *vcpu)		<-------+
> {
> [...]
>         if (!kvm_is_ucontrol(vcpu->kvm)) {
>                 rc = kvm_s390_deliver_pending_interrupts(vcpu); <----
>                 if (rc)
>                         return rc;
> [...]
> }
> 
> 
> 
> and whichever comes first, will dequeue that interrupt and deliver it
> by doing the PSW swap (jumping to the interrupt handler address)
> 
> (other CPUs will then not deliver this interrupt as it is already dequeued)
> 
> 

And regarding to questions if we should exclude floating IRQs from the blocked
check: That must not be done as floating IRQ (groups) can be disabled only on
certain CPUs.

An operating system is free to setup only certain CPUs to check for interrupts
(and disable it on the others). So all VCPUs have to check for floating IRQs,
otherwise situations might be provoked where floating IRQs are pending but not
delivered to any VCPU.

David

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christian Borntraeger May 3, 2016, 8:55 a.m. UTC | #15
On 05/02/2016 05:25 PM, Radim Kr?má? wrote:
[...]
>> I have some pathological cases where I can easily get all CPUs to poll all
>> the time without the shrinking part of the patch. (e.g. guest with 16 CPUs,
>> 8 null block devices and 64 dd reading small blocks with O_DIRECT from these disks)
>> which cause permanent exits which consumes all 16 host CPUs. Limiting the grow
>> did not seem to be enough in my testing, but when I also made shrinking more
>> aggressive things improved.
> 
> So the problem is that a large number of VCPUs and devices will often
> have a floating irq and the polling always succeeds unless halt_poll_ns
> is very small.  Poll window doesn't change if the poll succeds,
> therefore we need a very agressive shrinker in order to avoid polling?

Yes, thats what I concluded after experimenting.

> 
>> But I am certainly open for other ideas how to tune this.
> 
> I don't see good improvements ... the problem seems to lie elsewhere:
> Couldn't we exclude floating irqs from kvm_vcpu_check_block()?
> 
> (A VCPU running for other reasons could still handle a floating irq and
>  we always kick one VCPU, so VM won't starve and other VCPUs won't be
>  prevented from sleeping.)


I thought about that in my first experiments, but we really have to leave
vcpu_block for all cases otherwise we might add huge latencies or even 
starve the delivery. For example the other CPUs can block specific 
interruption subclass via the control register 6. 


>>> It would make more sense to me, because we are not interested in latency
>>> of invalid wakeups, so they shouldn't affect valid ones.
>>>
>>>>  	} else
>>>>  		vcpu->halt_poll_ns = 0;
>>>> +	vcpu_reset_wakeup(vcpu);
>>>>  
>>>>  	trace_kvm_vcpu_wakeup(block_ns, waited);
>>>
>>> (Tracing valid/invalid wakeups could be useful.)
>>
>> As an extension of the old trace events?
> 
> Yes.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cornelia Huck May 3, 2016, 9:19 a.m. UTC | #16
On Tue, 3 May 2016 09:00:41 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 05/03/2016 07:42 AM, Wanpeng Li wrote:
> > 2016-05-02 18:42 GMT+08:00 Christian Borntraeger <borntraeger@de.ibm.com>:
> > [...]
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index 9102ae1..d63ea60 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -2008,7 +2008,8 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> >>                          * arrives.
> >>                          */
> >>                         if (kvm_vcpu_check_block(vcpu) < 0) {
> >> -                               ++vcpu->stat.halt_successful_poll;
> >> +                               if (vcpu_valid_wakeup(vcpu))
> >> +                                       ++vcpu->stat.halt_successful_poll;
> >>                                 goto out;
> >>                         }
> >>                         cur = ktime_get();
> >> @@ -2038,14 +2039,16 @@ out:
> >>                 if (block_ns <= vcpu->halt_poll_ns)
> >>                         ;
> >>                 /* we had a long block, shrink polling */
> >> -               else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
> >> +               else if (!vcpu_valid_wakeup(vcpu) ||
> >> +                       (vcpu->halt_poll_ns && block_ns > halt_poll_ns))
> >>                         shrink_halt_poll_ns(vcpu);
> >>                 /* we had a short halt and our poll time is too small */
> >>                 else if (vcpu->halt_poll_ns < halt_poll_ns &&
> >> -                       block_ns < halt_poll_ns)
> >> +                       block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
> >>                         grow_halt_poll_ns(vcpu);
> >>         } else
> >>                 vcpu->halt_poll_ns = 0;
> >> +       vcpu_reset_wakeup(vcpu);
> > 
> > Why mark the next wakeup as a non-sucessful poll?
> 
> It is basically only used for s390 and used as a mean to implement the "default off,
> only on for selected cases". But yes, if somebody else wants to use it this might 
> need to be changed.
> So what about changing this into
> kvm_arch_vcpu_block_finish(vcpu)
> which is a reset on s390 and a no for others?

I like that idea.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 10, 2016, 1:54 p.m. UTC | #17
On 03/05/2016 09:00, Christian Borntraeger wrote:
>>> >> +       vcpu_reset_wakeup(vcpu);
>> > 
>> > Why mark the next wakeup as a non-sucessful poll?
> It is basically only used for s390 and used as a mean to implement the "default off,
> only on for selected cases". But yes, if somebody else wants to use it this might 
> need to be changed.
> So what about changing this into
> kvm_arch_vcpu_block_finish(vcpu)
> which is a reset on s390 and a no for others?

I think this is okay as is.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 5ea5af3..ccfe6f6 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -28,6 +28,7 @@  config KVM
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQFD
 	select HAVE_KVM_IRQ_ROUTING
+	select HAVE_KVM_INVALID_POLLS
 	select SRCU
 	select KVM_VFIO
 	---help---
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 2130299..fade1b4 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -976,6 +976,14 @@  no_timer:
 
 void kvm_s390_vcpu_wakeup(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * This is outside of the if because we want to mark the wakeup
+	 * as valid for vCPUs that
+	 * a: do polling right now
+	 * b: do sleep right now
+	 * otherwise we would never grow the poll interval properly
+	 */
+	vcpu_set_valid_wakeup(vcpu);
 	if (waitqueue_active(&vcpu->wq)) {
 		/*
 		 * The vcpu gave up the cpu voluntarily, mark it as a good
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 861f690..550beec 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -224,6 +224,7 @@  struct kvm_vcpu {
 	sigset_t sigset;
 	struct kvm_vcpu_stat stat;
 	unsigned int halt_poll_ns;
+	bool valid_wakeup;
 
 #ifdef CONFIG_HAS_IOMEM
 	int mmio_needed;
@@ -1178,4 +1179,37 @@  int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
 				  uint32_t guest_irq, bool set);
 #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
 
+#ifdef CONFIG_HAVE_KVM_INVALID_POLLS
+/* If we wakeup during the poll time, was it a sucessful poll? */
+static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
+{
+	return vcpu->valid_wakeup;
+}
+
+/* Mark the next wakeup as a non-sucessful poll */
+static inline void vcpu_reset_wakeup(struct kvm_vcpu *vcpu)
+{
+	vcpu->valid_wakeup = false;
+}
+
+/* Mark the next wakeup as a sucessful poll */
+static inline void vcpu_set_valid_wakeup(struct kvm_vcpu *vcpu)
+{
+	vcpu->valid_wakeup = true;
+}
+#else
+static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
+{
+	return true;
+}
+
+static inline void vcpu_reset_wakeup(struct kvm_vcpu *vcpu)
+{
+}
+
+static inline void vcpu_set_valid_wakeup(struct kvm_vcpu *vcpu)
+{
+}
+#endif /* CONFIG_HAVE_KVM_INVALID_POLLS */
+
 #endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 7a79b68..b9edb51 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -41,6 +41,10 @@  config KVM_VFIO
 config HAVE_KVM_ARCH_TLB_FLUSH_ALL
        bool
 
+config HAVE_KVM_INVALID_POLLS
+       bool
+
+
 config KVM_GENERIC_DIRTYLOG_READ_PROTECT
        bool
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9102ae1..d63ea60 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2008,7 +2008,8 @@  void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 			 * arrives.
 			 */
 			if (kvm_vcpu_check_block(vcpu) < 0) {
-				++vcpu->stat.halt_successful_poll;
+				if (vcpu_valid_wakeup(vcpu))
+					++vcpu->stat.halt_successful_poll;
 				goto out;
 			}
 			cur = ktime_get();
@@ -2038,14 +2039,16 @@  out:
 		if (block_ns <= vcpu->halt_poll_ns)
 			;
 		/* we had a long block, shrink polling */
-		else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns)
+		else if (!vcpu_valid_wakeup(vcpu) ||
+			(vcpu->halt_poll_ns && block_ns > halt_poll_ns))
 			shrink_halt_poll_ns(vcpu);
 		/* we had a short halt and our poll time is too small */
 		else if (vcpu->halt_poll_ns < halt_poll_ns &&
-			block_ns < halt_poll_ns)
+			block_ns < halt_poll_ns && vcpu_valid_wakeup(vcpu))
 			grow_halt_poll_ns(vcpu);
 	} else
 		vcpu->halt_poll_ns = 0;
+	vcpu_reset_wakeup(vcpu);
 
 	trace_kvm_vcpu_wakeup(block_ns, waited);
 }