diff mbox

[RFC] kvm: x86: export vCPU halted state to sysfs

Message ID 20180201125441.2f5b4fdd@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Luiz Capitulino Feb. 1, 2018, 5:54 p.m. UTC
Libvirt needs to know when a vCPU is halted. To get this information,
libvirt has started using the query-cpus command from QEMU. However,
if in kernel irqchip is in use, query-cpus will force all vCPUs
to user-space since they have to issue the KVM_GET_MP_STATE ioctl.
This has catastrophic implications to low-latency workloads like
KVM-RT and zero packet loss with DPDK. To make matters worse, there's
an OpenStack service called ceilometer that causes libvirt to
issue query-cpus every few minutes.

The solution proposed in this patch is to export the vCPU
halted state in the already existing vcpu directory in sysfs.
This way, libvirt can read the vCPU halted state from sysfs and avoid
using the query-cpus command. This solution seems to be sufficient
for libvirt needs, but it has the following cons:

 * vcpu information in sysfs lives in a debug directory, so
   libvirt would be basing its API on debug info

 * Currently, only x86 supports the vcpu dir in sysfs, so
   we'd have to expand this to other archs (should be doable)

If we agree that this solution is feasible, I'll work on extending
the vcpu debug information to other archs for my next posting.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 arch/x86/kvm/debugfs.c   | 15 +++++++++++++++
 arch/x86/kvm/x86.c       |  2 ++
 include/linux/kvm_host.h | 11 +++++++++++
 3 files changed, 28 insertions(+)

Comments

Radim Krčmář Feb. 1, 2018, 8:15 p.m. UTC | #1
2018-02-01 12:54-0500, Luiz Capitulino:
> 
> Libvirt needs to know when a vCPU is halted. To get this information,

I don't see why upper level management should care about that, a single
bit about halted state that can be incorrect at the time it is processed
seems of very limited use.

(A much more sensible data point would be the fraction of time when VCPU
 was running or runnable, which is roughly what you get by sampling the
 halted state.)

A halted vCPU it might even be halted in guest mode, so KVM doesn't know
about that state (unless you force a VM exit), which would complicate
the collection a bit more ... but really, what is the data being used
for?

User might care about the state, for obscure reasons, but that isn't a
performance problem.

> libvirt has started using the query-cpus command from QEMU. However,
> if in kernel irqchip is in use, query-cpus will force all vCPUs
> to user-space since they have to issue the KVM_GET_MP_STATE ioctl.

Libvirt knows if KVM exits to userspace on halts, so it can just query
QEMU in that case and in the other case, there is a very dirty
"solution" that works on all architectures right now:

  grep kvm_vcpu_block /proc/$vcpu_task/stack

If you get something, the vcpu is halted in KVM.

> This has catastrophic implications to low-latency workloads like
> KVM-RT and zero packet loss with DPDK. To make matters worse, there's
> an OpenStack service called ceilometer that causes libvirt to
> issue query-cpus every few minutes.

I'd expect that people running these workloads can setup the system. :(

I bet that ceilometer just mindlessly collects everything, so we should
be able to configure libvirt to collect only some stats.  Either libvirt
or upper layer would decide what is too expensive for its usefulness.

> The solution proposed in this patch is to export the vCPU
> halted state in the already existing vcpu directory in sysfs.
> This way, libvirt can read the vCPU halted state from sysfs and avoid
> using the query-cpus command. This solution seems to be sufficient
> for libvirt needs, but it has the following cons:
> 
>  * vcpu information in sysfs lives in a debug directory, so
>    libvirt would be basing its API on debug info

(It pains me to say there probably already are tools that depend on
 kvm/debug.)

It's slightly better than the stack hack, but needs more code in kernel
and the interface is in a gray compatibility zone, so I'd like to know
why does userspace do that in the first place.

>  * Currently, only x86 supports the vcpu dir in sysfs, so
>    we'd have to expand this to other archs (should be doable)
> 
> If we agree that this solution is feasible, I'll work on extending
> the vcpu debug information to other archs for my next posting.
> 
> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> ---
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -6273,6 +6273,7 @@ void kvm_arch_exit(void)
>  
>  int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
>  {
> +	kvm_vcpu_set_halted(vcpu);

There is no point to care about !lapic_in_kernel().  I'd move the logic
into vcpu_block() to be shared among all architectures.

>  	++vcpu->stat.halt_exits;
>  	if (lapic_in_kernel(vcpu)) {
>  		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
Eduardo Habkost Feb. 1, 2018, 8:26 p.m. UTC | #2
On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:
> 2018-02-01 12:54-0500, Luiz Capitulino:
> > 
> > Libvirt needs to know when a vCPU is halted. To get this information,
> 
> I don't see why upper level management should care about that, a single
> bit about halted state that can be incorrect at the time it is processed
> seems of very limited use.

I don't see why, either.

I'm CCing libvir-list and the people involved in the code that
added halt state to libvirt domain statistics.

> 
> (A much more sensible data point would be the fraction of time when VCPU
>  was running or runnable, which is roughly what you get by sampling the
>  halted state.)
> 
> A halted vCPU it might even be halted in guest mode, so KVM doesn't know
> about that state (unless you force a VM exit), which would complicate
> the collection a bit more ... but really, what is the data being used
> for?
> 
> User might care about the state, for obscure reasons, but that isn't a
> performance problem.
> 
> > libvirt has started using the query-cpus command from QEMU. However,
> > if in kernel irqchip is in use, query-cpus will force all vCPUs
> > to user-space since they have to issue the KVM_GET_MP_STATE ioctl.
> 
> Libvirt knows if KVM exits to userspace on halts, so it can just query
> QEMU in that case and in the other case, there is a very dirty
> "solution" that works on all architectures right now:
> 
>   grep kvm_vcpu_block /proc/$vcpu_task/stack
> 
> If you get something, the vcpu is halted in KVM.

Nice.


> 
> > This has catastrophic implications to low-latency workloads like
> > KVM-RT and zero packet loss with DPDK. To make matters worse, there's
> > an OpenStack service called ceilometer that causes libvirt to
> > issue query-cpus every few minutes.
> 
> I'd expect that people running these workloads can setup the system. :(
> 
> I bet that ceilometer just mindlessly collects everything, so we should
> be able to configure libvirt to collect only some stats.  Either libvirt
> or upper layer would decide what is too expensive for its usefulness.

Yes.  Including expensive-to-collect halt state in
VIR_DOMAIN_STATS_VCPU is a serious performance regression in
libvirt.

> 
> > The solution proposed in this patch is to export the vCPU
> > halted state in the already existing vcpu directory in sysfs.
> > This way, libvirt can read the vCPU halted state from sysfs and avoid
> > using the query-cpus command. This solution seems to be sufficient
> > for libvirt needs, but it has the following cons:
> > 
> >  * vcpu information in sysfs lives in a debug directory, so
> >    libvirt would be basing its API on debug info
> 
> (It pains me to say there probably already are tools that depend on
>  kvm/debug.)
> 
> It's slightly better than the stack hack, but needs more code in kernel
> and the interface is in a gray compatibility zone, so I'd like to know
> why does userspace do that in the first place.
> 
> >  * Currently, only x86 supports the vcpu dir in sysfs, so
> >    we'd have to expand this to other archs (should be doable)
> > 
> > If we agree that this solution is feasible, I'll work on extending
> > the vcpu debug information to other archs for my next posting.
> > 
> > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> > ---
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > @@ -6273,6 +6273,7 @@ void kvm_arch_exit(void)
> >  
> >  int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
> >  {
> > +	kvm_vcpu_set_halted(vcpu);
> 
> There is no point to care about !lapic_in_kernel().  I'd move the logic
> into vcpu_block() to be shared among all architectures.
> 
> >  	++vcpu->stat.halt_exits;
> >  	if (lapic_in_kernel(vcpu)) {
> >  		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
Daniel P. Berrangé Feb. 2, 2018, 12:47 p.m. UTC | #3
On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:
> 2018-02-01 12:54-0500, Luiz Capitulino:
> > 
> > Libvirt needs to know when a vCPU is halted. To get this information,
> 
> I don't see why upper level management should care about that, a single
> bit about halted state that can be incorrect at the time it is processed
> seems of very limited use.
> 
> (A much more sensible data point would be the fraction of time when VCPU
>  was running or runnable, which is roughly what you get by sampling the
>  halted state.)
> 
> A halted vCPU it might even be halted in guest mode, so KVM doesn't know
> about that state (unless you force a VM exit), which would complicate
> the collection a bit more ... but really, what is the data being used
> for?

I'm unclear what usage is being made of the halted state info, as it was an
addition to libvirt from a 3rd party. My guess is that it was used to
see if the vCPU had been brought online by the guest OS at all, after being
hotplugged. As you say, the info is pretty useless once the vCPU is actually
online & in use, becuase it changes so quickly.

> 
> User might care about the state, for obscure reasons, but that isn't a
> performance problem.
> 
> > libvirt has started using the query-cpus command from QEMU. However,
> > if in kernel irqchip is in use, query-cpus will force all vCPUs
> > to user-space since they have to issue the KVM_GET_MP_STATE ioctl.
> 
> Libvirt knows if KVM exits to userspace on halts, so it can just query
> QEMU in that case and in the other case, there is a very dirty
> "solution" that works on all architectures right now:
> 
>   grep kvm_vcpu_block /proc/$vcpu_task/stack
> 
> If you get something, the vcpu is halted in KVM.
> 
> > This has catastrophic implications to low-latency workloads like
> > KVM-RT and zero packet loss with DPDK. To make matters worse, there's
> > an OpenStack service called ceilometer that causes libvirt to
> > issue query-cpus every few minutes.
> 
> I'd expect that people running these workloads can setup the system. :(
> 
> I bet that ceilometer just mindlessly collects everything, so we should
> be able to configure libvirt to collect only some stats.  Either libvirt
> or upper layer would decide what is too expensive for its usefulness.
> 
> > The solution proposed in this patch is to export the vCPU
> > halted state in the already existing vcpu directory in sysfs.
> > This way, libvirt can read the vCPU halted state from sysfs and avoid
> > using the query-cpus command. This solution seems to be sufficient
> > for libvirt needs, but it has the following cons:
> > 
> >  * vcpu information in sysfs lives in a debug directory, so
> >    libvirt would be basing its API on debug info
> 
> (It pains me to say there probably already are tools that depend on
>  kvm/debug.)
> 
> It's slightly better than the stack hack, but needs more code in kernel
> and the interface is in a gray compatibility zone, so I'd like to know
> why does userspace do that in the first place.
> 
> >  * Currently, only x86 supports the vcpu dir in sysfs, so
> >    we'd have to expand this to other archs (should be doable)
> > 
> > If we agree that this solution is feasible, I'll work on extending
> > the vcpu debug information to other archs for my next posting.
> > 
> > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> > ---
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > @@ -6273,6 +6273,7 @@ void kvm_arch_exit(void)
> >  
> >  int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
> >  {
> > +	kvm_vcpu_set_halted(vcpu);
> 
> There is no point to care about !lapic_in_kernel().  I'd move the logic
> into vcpu_block() to be shared among all architectures.
> 
> >  	++vcpu->stat.halt_exits;
> >  	if (lapic_in_kernel(vcpu)) {
> >  		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;

Regards,
Daniel
Daniel P. Berrangé Feb. 2, 2018, 12:49 p.m. UTC | #4
On Thu, Feb 01, 2018 at 12:54:41PM -0500, Luiz Capitulino wrote:
> 
> Libvirt needs to know when a vCPU is halted. To get this information,
> libvirt has started using the query-cpus command from QEMU. However,
> if in kernel irqchip is in use, query-cpus will force all vCPUs
> to user-space since they have to issue the KVM_GET_MP_STATE ioctl.
> This has catastrophic implications to low-latency workloads like
> KVM-RT and zero packet loss with DPDK. To make matters worse, there's
> an OpenStack service called ceilometer that causes libvirt to
> issue query-cpus every few minutes.
> 
> The solution proposed in this patch is to export the vCPU
> halted state in the already existing vcpu directory in sysfs.
> This way, libvirt can read the vCPU halted state from sysfs and avoid
> using the query-cpus command. This solution seems to be sufficient
> for libvirt needs, but it has the following cons:
> 
>  * vcpu information in sysfs lives in a debug directory, so
>    libvirt would be basing its API on debug info

Is this part of regular sysfs mount point, or does it require a
debug fs to be mounted separately at /sys/fs/debug ?

>  * Currently, only x86 supports the vcpu dir in sysfs, so
>    we'd have to expand this to other archs (should be doable)

Yep, that would be fairly important


Regards,
Daniel
Luiz Capitulino Feb. 2, 2018, 1:46 p.m. UTC | #5
On Fri, 2 Feb 2018 12:47:50 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:
> > 2018-02-01 12:54-0500, Luiz Capitulino:  
> > > 
> > > Libvirt needs to know when a vCPU is halted. To get this information,  
> > 
> > I don't see why upper level management should care about that, a single
> > bit about halted state that can be incorrect at the time it is processed
> > seems of very limited use.
> > 
> > (A much more sensible data point would be the fraction of time when VCPU
> >  was running or runnable, which is roughly what you get by sampling the
> >  halted state.)
> > 
> > A halted vCPU it might even be halted in guest mode, so KVM doesn't know
> > about that state (unless you force a VM exit), which would complicate
> > the collection a bit more ... but really, what is the data being used
> > for?  
> 
> I'm unclear what usage is being made of the halted state info, as it was an
> addition to libvirt from a 3rd party. My guess is that it was used to
> see if the vCPU had been brought online by the guest OS at all, after being
> hotplugged. As you say, the info is pretty useless once the vCPU is actually
> online & in use, becuase it changes so quickly.

I had assumed libvirt wanted to know if the vCPU was idle. Sure there
are other ways to do this, but I also assumed KVM/QEMU wanted to export
this state to upper management. If we don't want to do that, then we'll
have to deprecate the "halted" field in query-cpus in QEMU, since that's
how we got here.

So, I think this is blocked on Viktor feedback to clarify the halted
state usage. I'm CC'ing him again in this thread (although Eduardo
already did that).

> > User might care about the state, for obscure reasons, but that isn't a
> > performance problem.
> >   
> > > libvirt has started using the query-cpus command from QEMU. However,
> > > if in kernel irqchip is in use, query-cpus will force all vCPUs
> > > to user-space since they have to issue the KVM_GET_MP_STATE ioctl.  
> > 
> > Libvirt knows if KVM exits to userspace on halts, so it can just query
> > QEMU in that case and in the other case, there is a very dirty
> > "solution" that works on all architectures right now:
> > 
> >   grep kvm_vcpu_block /proc/$vcpu_task/stack
> > 
> > If you get something, the vcpu is halted in KVM.

Well, I'm still not sure it's a good idea to have the halted bit in
debugfs since that's for debug and not to build management APIs. What
you propose is even more problematic since now the API depends on
a KVM's internal function name...

> >   
> > > This has catastrophic implications to low-latency workloads like
> > > KVM-RT and zero packet loss with DPDK. To make matters worse, there's
> > > an OpenStack service called ceilometer that causes libvirt to
> > > issue query-cpus every few minutes.  
> > 
> > I'd expect that people running these workloads can setup the system. :(
> > 
> > I bet that ceilometer just mindlessly collects everything, so we should
> > be able to configure libvirt to collect only some stats.  Either libvirt
> > or upper layer would decide what is too expensive for its usefulness.
> >   
> > > The solution proposed in this patch is to export the vCPU
> > > halted state in the already existing vcpu directory in sysfs.
> > > This way, libvirt can read the vCPU halted state from sysfs and avoid
> > > using the query-cpus command. This solution seems to be sufficient
> > > for libvirt needs, but it has the following cons:
> > > 
> > >  * vcpu information in sysfs lives in a debug directory, so
> > >    libvirt would be basing its API on debug info  
> > 
> > (It pains me to say there probably already are tools that depend on
> >  kvm/debug.)
> > 
> > It's slightly better than the stack hack, but needs more code in kernel
> > and the interface is in a gray compatibility zone, so I'd like to know
> > why does userspace do that in the first place.
> >   
> > >  * Currently, only x86 supports the vcpu dir in sysfs, so
> > >    we'd have to expand this to other archs (should be doable)
> > > 
> > > If we agree that this solution is feasible, I'll work on extending
> > > the vcpu debug information to other archs for my next posting.
> > > 
> > > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> > > ---
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > @@ -6273,6 +6273,7 @@ void kvm_arch_exit(void)
> > >  
> > >  int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
> > >  {
> > > +	kvm_vcpu_set_halted(vcpu);  
> > 
> > There is no point to care about !lapic_in_kernel().  I'd move the logic
> > into vcpu_block() to be shared among all architectures.
> >   
> > >  	++vcpu->stat.halt_exits;
> > >  	if (lapic_in_kernel(vcpu)) {
> > >  		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;  
> 
> Regards,
> Daniel
Luiz Capitulino Feb. 2, 2018, 1:49 p.m. UTC | #6
On Fri, 2 Feb 2018 12:49:25 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Thu, Feb 01, 2018 at 12:54:41PM -0500, Luiz Capitulino wrote:
> > 
> > Libvirt needs to know when a vCPU is halted. To get this information,
> > libvirt has started using the query-cpus command from QEMU. However,
> > if in kernel irqchip is in use, query-cpus will force all vCPUs
> > to user-space since they have to issue the KVM_GET_MP_STATE ioctl.
> > This has catastrophic implications to low-latency workloads like
> > KVM-RT and zero packet loss with DPDK. To make matters worse, there's
> > an OpenStack service called ceilometer that causes libvirt to
> > issue query-cpus every few minutes.
> > 
> > The solution proposed in this patch is to export the vCPU
> > halted state in the already existing vcpu directory in sysfs.
> > This way, libvirt can read the vCPU halted state from sysfs and avoid
> > using the query-cpus command. This solution seems to be sufficient
> > for libvirt needs, but it has the following cons:
> > 
> >  * vcpu information in sysfs lives in a debug directory, so
> >    libvirt would be basing its API on debug info  
> 
> Is this part of regular sysfs mount point, or does it require a
> debug fs to be mounted separately at /sys/fs/debug ?

Yeah, it depends on debugfs being separately mounted at /sys/fs/debug.

> >  * Currently, only x86 supports the vcpu dir in sysfs, so
> >    we'd have to expand this to other archs (should be doable)  
> 
> Yep, that would be fairly important

That's doable. The code that exports the vcpu information to sysfs
is pretty arch-independent. Only x86 does it today because it's the
only arch that actually exports anything per vcpu (which is the
tsc-offset).
Viktor Mihajlovski Feb. 2, 2018, 1:53 p.m. UTC | #7
On 01.02.2018 21:26, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:
>> 2018-02-01 12:54-0500, Luiz Capitulino:
>>>
>>> Libvirt needs to know when a vCPU is halted. To get this information,
>>
>> I don't see why upper level management should care about that, a single
>> bit about halted state that can be incorrect at the time it is processed
>> seems of very limited use.
> 
> I don't see why, either.
> 
> I'm CCing libvir-list and the people involved in the code that
> added halt state to libvirt domain statistics.
> 
I'll try to explain the motivation for the "halted" state exposure and
why it ended int the libvirt domain stats.

s390 CPUs can be present in a system (e.g. after being hotplugged) but
be offline (disabled) in which case they are not used by the operating
system. In Linux disabled CPUs show a value of '0' in
/sys/devices/system/cpu/cpu<n>/online.

Higher level management software (on top of libvirt) can take advantage
of knowing whether a guest CPU is online and thus used or not.
Specifically it might not make sense to plug more CPUs if the guest OS
isn't using the CPUs at all.

A disabled guest CPU is represented as halted in the QEMU object model
and can therefore be identified by the QMP query-cpus command.

The initial patch proposal to expose this via virsh vcpuinfo was not
considered to be desirable because there was a concern that legacy
management software might be confused seeing halted vcpus. Therefore the
state information was added to the cpu domain statistics.

One issue we're facing is that the semantics of "halted" are different
between s390 and at least x86. The question might be whether they are
different enough to grant a specific "disabled" indicator.

[...]
Luiz Capitulino Feb. 2, 2018, 2:14 p.m. UTC | #8
On Fri, 2 Feb 2018 14:53:50 +0100
Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> wrote:

> On 01.02.2018 21:26, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:  
> >> 2018-02-01 12:54-0500, Luiz Capitulino:  
> >>>
> >>> Libvirt needs to know when a vCPU is halted. To get this information,  
> >>
> >> I don't see why upper level management should care about that, a single
> >> bit about halted state that can be incorrect at the time it is processed
> >> seems of very limited use.  
> > 
> > I don't see why, either.
> > 
> > I'm CCing libvir-list and the people involved in the code that
> > added halt state to libvirt domain statistics.
> >   
> I'll try to explain the motivation for the "halted" state exposure and
> why it ended int the libvirt domain stats.
> 
> s390 CPUs can be present in a system (e.g. after being hotplugged) but
> be offline (disabled) in which case they are not used by the operating
> system. In Linux disabled CPUs show a value of '0' in
> /sys/devices/system/cpu/cpu<n>/online.

If that's all you want, have you considered using the guest agent?

> Higher level management software (on top of libvirt) can take advantage
> of knowing whether a guest CPU is online and thus used or not.
> Specifically it might not make sense to plug more CPUs if the guest OS
> isn't using the CPUs at all.

OK, so what's the algorithm used by the higher level management
software where this all fits together? Something like:

1. Hotplug vCPU
2. Poll "halted" state
3. If "halted" becomes true, hotplug more vCPUs
4. If "halted" never becomes true, don't hotplug more CPUs

If that's the case, then I guess grepping for State in
/proc/qemu-pid/threadid/status will have the same end result, no?

> 
> A disabled guest CPU is represented as halted in the QEMU object model
> and can therefore be identified by the QMP query-cpus command.
> 
> The initial patch proposal to expose this via virsh vcpuinfo was not
> considered to be desirable because there was a concern that legacy
> management software might be confused seeing halted vcpus. Therefore the
> state information was added to the cpu domain statistics.
> 
> One issue we're facing is that the semantics of "halted" are different
> between s390 and at least x86. The question might be whether they are
> different enough to grant a specific "disabled" indicator.
> 
> [...]
>
Eduardo Habkost Feb. 2, 2018, 2:15 p.m. UTC | #9
On Fri, Feb 02, 2018 at 02:53:50PM +0100, Viktor Mihajlovski wrote:
> On 01.02.2018 21:26, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:
> >> 2018-02-01 12:54-0500, Luiz Capitulino:
> >>>
> >>> Libvirt needs to know when a vCPU is halted. To get this information,
> >>
> >> I don't see why upper level management should care about that, a single
> >> bit about halted state that can be incorrect at the time it is processed
> >> seems of very limited use.
> > 
> > I don't see why, either.
> > 
> > I'm CCing libvir-list and the people involved in the code that
> > added halt state to libvirt domain statistics.
> > 
> I'll try to explain the motivation for the "halted" state exposure and
> why it ended int the libvirt domain stats.
> 
> s390 CPUs can be present in a system (e.g. after being hotplugged) but
> be offline (disabled) in which case they are not used by the operating
> system. In Linux disabled CPUs show a value of '0' in
> /sys/devices/system/cpu/cpu<n>/online.
> 
> Higher level management software (on top of libvirt) can take advantage
> of knowing whether a guest CPU is online and thus used or not.
> Specifically it might not make sense to plug more CPUs if the guest OS
> isn't using the CPUs at all.

Wasn't this already represented on "vcpu.<n>.state"?  Why is
"vcpu.<n>.halted" needed?

> 
> A disabled guest CPU is represented as halted in the QEMU object model
> and can therefore be identified by the QMP query-cpus command.
> 
> The initial patch proposal to expose this via virsh vcpuinfo was not
> considered to be desirable because there was a concern that legacy
> management software might be confused seeing halted vcpus. Therefore the
> state information was added to the cpu domain statistics.
> 
> One issue we're facing is that the semantics of "halted" are different
> between s390 and at least x86. The question might be whether they are
> different enough to grant a specific "disabled" indicator.

From your description, it looks like they are completely
different.  On x86, a CPU that is online and in use can be moved
between halted and non-halted state many times a second.

If that's the case, we can probably fix this without breaking
existing code: explicitly documenting the semantics of
"vcpu.<n>.halted" at virConnectGetAllDomainStats() to mean "not
online" (i.e. the s390 semantics, not the x86 one), and making
qemuMonitorGetCpuHalted() s390-specific.

Possibly a better long-term solution is to deprecate
"vcpu.<n>.halted" and make "vcpu.<n>.state" work correctly on
s390.

It would be also interesting to update QEMU QMP documentation to
clarify the arch-specific semantics of "halted".
Daniel P. Berrangé Feb. 2, 2018, 2:19 p.m. UTC | #10
On Fri, Feb 02, 2018 at 12:15:54PM -0200, Eduardo Habkost wrote:
> On Fri, Feb 02, 2018 at 02:53:50PM +0100, Viktor Mihajlovski wrote:
> > On 01.02.2018 21:26, Eduardo Habkost wrote:
> > > On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:
> > >> 2018-02-01 12:54-0500, Luiz Capitulino:
> > >>>
> > >>> Libvirt needs to know when a vCPU is halted. To get this information,
> > >>
> > >> I don't see why upper level management should care about that, a single
> > >> bit about halted state that can be incorrect at the time it is processed
> > >> seems of very limited use.
> > > 
> > > I don't see why, either.
> > > 
> > > I'm CCing libvir-list and the people involved in the code that
> > > added halt state to libvirt domain statistics.
> > > 
> > I'll try to explain the motivation for the "halted" state exposure and
> > why it ended int the libvirt domain stats.
> > 
> > s390 CPUs can be present in a system (e.g. after being hotplugged) but
> > be offline (disabled) in which case they are not used by the operating
> > system. In Linux disabled CPUs show a value of '0' in
> > /sys/devices/system/cpu/cpu<n>/online.
> > 
> > Higher level management software (on top of libvirt) can take advantage
> > of knowing whether a guest CPU is online and thus used or not.
> > Specifically it might not make sense to plug more CPUs if the guest OS
> > isn't using the CPUs at all.
> 
> Wasn't this already represented on "vcpu.<n>.state"?  Why is
> "vcpu.<n>.halted" needed?
> 
> > 
> > A disabled guest CPU is represented as halted in the QEMU object model
> > and can therefore be identified by the QMP query-cpus command.
> > 
> > The initial patch proposal to expose this via virsh vcpuinfo was not
> > considered to be desirable because there was a concern that legacy
> > management software might be confused seeing halted vcpus. Therefore the
> > state information was added to the cpu domain statistics.
> > 
> > One issue we're facing is that the semantics of "halted" are different
> > between s390 and at least x86. The question might be whether they are
> > different enough to grant a specific "disabled" indicator.
> 
> From your description, it looks like they are completely
> different.  On x86, a CPU that is online and in use can be moved
> between halted and non-halted state many times a second.
> 
> If that's the case, we can probably fix this without breaking
> existing code: explicitly documenting the semantics of
> "vcpu.<n>.halted" at virConnectGetAllDomainStats() to mean "not
> online" (i.e. the s390 semantics, not the x86 one), and making
> qemuMonitorGetCpuHalted() s390-specific.
> 
> Possibly a better long-term solution is to deprecate
> "vcpu.<n>.halted" and make "vcpu.<n>.state" work correctly on
> s390.
> 
> It would be also interesting to update QEMU QMP documentation to
> clarify the arch-specific semantics of "halted".

Any also especially clarify the awful performance implications of running
this particular query command. In general I would not expect query-xxx
monitor commands to interrupt all vcpus, so we should clearly warn about
this !

Regards,
Daniel
Luiz Capitulino Feb. 2, 2018, 2:21 p.m. UTC | #11
On Fri, 2 Feb 2018 14:19:38 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Fri, Feb 02, 2018 at 12:15:54PM -0200, Eduardo Habkost wrote:
> > On Fri, Feb 02, 2018 at 02:53:50PM +0100, Viktor Mihajlovski wrote:  
> > > On 01.02.2018 21:26, Eduardo Habkost wrote:  
> > > > On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:  
> > > >> 2018-02-01 12:54-0500, Luiz Capitulino:  
> > > >>>
> > > >>> Libvirt needs to know when a vCPU is halted. To get this information,  
> > > >>
> > > >> I don't see why upper level management should care about that, a single
> > > >> bit about halted state that can be incorrect at the time it is processed
> > > >> seems of very limited use.  
> > > > 
> > > > I don't see why, either.
> > > > 
> > > > I'm CCing libvir-list and the people involved in the code that
> > > > added halt state to libvirt domain statistics.
> > > >   
> > > I'll try to explain the motivation for the "halted" state exposure and
> > > why it ended int the libvirt domain stats.
> > > 
> > > s390 CPUs can be present in a system (e.g. after being hotplugged) but
> > > be offline (disabled) in which case they are not used by the operating
> > > system. In Linux disabled CPUs show a value of '0' in
> > > /sys/devices/system/cpu/cpu<n>/online.
> > > 
> > > Higher level management software (on top of libvirt) can take advantage
> > > of knowing whether a guest CPU is online and thus used or not.
> > > Specifically it might not make sense to plug more CPUs if the guest OS
> > > isn't using the CPUs at all.  
> > 
> > Wasn't this already represented on "vcpu.<n>.state"?  Why is
> > "vcpu.<n>.halted" needed?
> >   
> > > 
> > > A disabled guest CPU is represented as halted in the QEMU object model
> > > and can therefore be identified by the QMP query-cpus command.
> > > 
> > > The initial patch proposal to expose this via virsh vcpuinfo was not
> > > considered to be desirable because there was a concern that legacy
> > > management software might be confused seeing halted vcpus. Therefore the
> > > state information was added to the cpu domain statistics.
> > > 
> > > One issue we're facing is that the semantics of "halted" are different
> > > between s390 and at least x86. The question might be whether they are
> > > different enough to grant a specific "disabled" indicator.  
> > 
> > From your description, it looks like they are completely
> > different.  On x86, a CPU that is online and in use can be moved
> > between halted and non-halted state many times a second.
> > 
> > If that's the case, we can probably fix this without breaking
> > existing code: explicitly documenting the semantics of
> > "vcpu.<n>.halted" at virConnectGetAllDomainStats() to mean "not
> > online" (i.e. the s390 semantics, not the x86 one), and making
> > qemuMonitorGetCpuHalted() s390-specific.
> > 
> > Possibly a better long-term solution is to deprecate
> > "vcpu.<n>.halted" and make "vcpu.<n>.state" work correctly on
> > s390.
> > 
> > It would be also interesting to update QEMU QMP documentation to
> > clarify the arch-specific semantics of "halted".  
> 
> Any also especially clarify the awful performance implications of running
> this particular query command. In general I would not expect query-xxx
> monitor commands to interrupt all vcpus, so we should clearly warn about
> this !

Or deprecate it...
Eduardo Habkost Feb. 2, 2018, 2:50 p.m. UTC | #12
(CCing qemu-devel)

On Fri, Feb 02, 2018 at 09:21:59AM -0500, Luiz Capitulino wrote:
> On Fri, 2 Feb 2018 14:19:38 +0000
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > On Fri, Feb 02, 2018 at 12:15:54PM -0200, Eduardo Habkost wrote:
[...]
> > > It would be also interesting to update QEMU QMP documentation to
> > > clarify the arch-specific semantics of "halted".  
> > 
> > Any also especially clarify the awful performance implications of running
> > this particular query command. In general I would not expect query-xxx
> > monitor commands to interrupt all vcpus, so we should clearly warn about
> > this !
> 
> Or deprecate it...

We could deprecate the expensive fields on query-cpus, and move
them to a more expensive query-cpu-state command.  I believe most
users of query-cpus are only interested in qom_path, thread_id,
and topology info.

Markus, Eric: from the QAPI point of view, is it OK to remove
fields between QEMU versions, as long as we follow our
deprecation policy?
Luiz Capitulino Feb. 2, 2018, 2:55 p.m. UTC | #13
On Fri, 2 Feb 2018 12:50:14 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> (CCing qemu-devel)
> 
> On Fri, Feb 02, 2018 at 09:21:59AM -0500, Luiz Capitulino wrote:
> > On Fri, 2 Feb 2018 14:19:38 +0000
> > Daniel P. Berrangé <berrange@redhat.com> wrote:  
> > > On Fri, Feb 02, 2018 at 12:15:54PM -0200, Eduardo Habkost wrote:  
> [...]
> > > > It would be also interesting to update QEMU QMP documentation to
> > > > clarify the arch-specific semantics of "halted".    
> > > 
> > > Any also especially clarify the awful performance implications of running
> > > this particular query command. In general I would not expect query-xxx
> > > monitor commands to interrupt all vcpus, so we should clearly warn about
> > > this !  
> > 
> > Or deprecate it...  
> 
> We could deprecate the expensive fields on query-cpus, and move
> them to a more expensive query-cpu-state command.  I believe most
> users of query-cpus are only interested in qom_path, thread_id,
> and topology info.

Agree. The only thing I'm unsure about is that, is the performance
issue only present in x86? If yes, then do we deprecate it only
for x86 or for all archs? Maybe for all archs, otherwise this has
the potential to turn into a mess.

> Markus, Eric: from the QAPI point of view, is it OK to remove
> fields between QEMU versions, as long as we follow our
> deprecation policy?

I guess we can't remove fields, but maybe we could always return
"running" and skip interrupting the vCPU threads.
Daniel P. Berrangé Feb. 2, 2018, 3:07 p.m. UTC | #14
On Fri, Feb 02, 2018 at 12:50:14PM -0200, Eduardo Habkost wrote:
> (CCing qemu-devel)
> 
> On Fri, Feb 02, 2018 at 09:21:59AM -0500, Luiz Capitulino wrote:
> > On Fri, 2 Feb 2018 14:19:38 +0000
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > On Fri, Feb 02, 2018 at 12:15:54PM -0200, Eduardo Habkost wrote:
> [...]
> > > > It would be also interesting to update QEMU QMP documentation to
> > > > clarify the arch-specific semantics of "halted".  
> > > 
> > > Any also especially clarify the awful performance implications of running
> > > this particular query command. In general I would not expect query-xxx
> > > monitor commands to interrupt all vcpus, so we should clearly warn about
> > > this !
> > 
> > Or deprecate it...
> 
> We could deprecate the expensive fields on query-cpus, and move
> them to a more expensive query-cpu-state command.  I believe most
> users of query-cpus are only interested in qom_path, thread_id,
> and topology info.
> 
> Markus, Eric: from the QAPI point of view, is it OK to remove
> fields between QEMU versions, as long as we follow our
> deprecation policy?

I would expect that to not be OK.  A fully backwards compatible way to
deal with this would just be to add a flag to the query-cpus command
eg something like

    query-cpus arch-specific=false

to turn off all this arch specific state, and just report the cheap
generic info. If it defaults to arch-specific=true when omitted, then
there's no compat problems.

Regards,
Daniel
Viktor Mihajlovski Feb. 2, 2018, 3:08 p.m. UTC | #15
On 02.02.2018 15:15, Eduardo Habkost wrote:
> On Fri, Feb 02, 2018 at 02:53:50PM +0100, Viktor Mihajlovski wrote:
>> On 01.02.2018 21:26, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:
>>>> 2018-02-01 12:54-0500, Luiz Capitulino:
>>>>>
>>>>> Libvirt needs to know when a vCPU is halted. To get this information,
>>>>
>>>> I don't see why upper level management should care about that, a single
>>>> bit about halted state that can be incorrect at the time it is processed
>>>> seems of very limited use.
>>>
>>> I don't see why, either.
>>>
>>> I'm CCing libvir-list and the people involved in the code that
>>> added halt state to libvirt domain statistics.
>>>
>> I'll try to explain the motivation for the "halted" state exposure and
>> why it ended int the libvirt domain stats.
>>
>> s390 CPUs can be present in a system (e.g. after being hotplugged) but
>> be offline (disabled) in which case they are not used by the operating
>> system. In Linux disabled CPUs show a value of '0' in
>> /sys/devices/system/cpu/cpu<n>/online.
>>
>> Higher level management software (on top of libvirt) can take advantage
>> of knowing whether a guest CPU is online and thus used or not.
>> Specifically it might not make sense to plug more CPUs if the guest OS
>> isn't using the CPUs at all.
> 
> Wasn't this already represented on "vcpu.<n>.state"?  Why is
> "vcpu.<n>.halted" needed?
The state would match that of vcpuinfo, and there was consensus not to
change it (on x86 the CPU is in state running, even if halted).
> 
>>
>> A disabled guest CPU is represented as halted in the QEMU object model
>> and can therefore be identified by the QMP query-cpus command.
>>
>> The initial patch proposal to expose this via virsh vcpuinfo was not
>> considered to be desirable because there was a concern that legacy
>> management software might be confused seeing halted vcpus. Therefore the
>> state information was added to the cpu domain statistics.
>>
>> One issue we're facing is that the semantics of "halted" are different
>> between s390 and at least x86. The question might be whether they are
>> different enough to grant a specific "disabled" indicator.
> 
> From your description, it looks like they are completely
> different.  On x86, a CPU that is online and in use can be moved
> between halted and non-halted state many times a second.
> 
> If that's the case, we can probably fix this without breaking
> existing code: explicitly documenting the semantics of
> "vcpu.<n>.halted" at virConnectGetAllDomainStats() to mean "not
> online" (i.e. the s390 semantics, not the x86 one), and making
> qemuMonitorGetCpuHalted() s390-specific.
> 
> Possibly a better long-term solution is to deprecate
> "vcpu.<n>.halted" and make "vcpu.<n>.state" work correctly on
> s390>
As it seems that nobody was ever *really* interested in x86.halted, one
could also return 0 unconditionally there (and for other
expensive-to-query arches)?
> It would be also interesting to update QEMU QMP documentation to
> clarify the arch-specific semantics of "halted".
>
Luiz Capitulino Feb. 2, 2018, 3:22 p.m. UTC | #16
On Fri, 2 Feb 2018 16:08:25 +0100
Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> wrote:

> >> A disabled guest CPU is represented as halted in the QEMU object model
> >> and can therefore be identified by the QMP query-cpus command.
> >>
> >> The initial patch proposal to expose this via virsh vcpuinfo was not
> >> considered to be desirable because there was a concern that legacy
> >> management software might be confused seeing halted vcpus. Therefore the
> >> state information was added to the cpu domain statistics.
> >>
> >> One issue we're facing is that the semantics of "halted" are different
> >> between s390 and at least x86. The question might be whether they are
> >> different enough to grant a specific "disabled" indicator.  
> > 
> > From your description, it looks like they are completely
> > different.  On x86, a CPU that is online and in use can be moved
> > between halted and non-halted state many times a second.
> > 
> > If that's the case, we can probably fix this without breaking
> > existing code: explicitly documenting the semantics of
> > "vcpu.<n>.halted" at virConnectGetAllDomainStats() to mean "not
> > online" (i.e. the s390 semantics, not the x86 one), and making
> > qemuMonitorGetCpuHalted() s390-specific.
> > 
> > Possibly a better long-term solution is to deprecate
> > "vcpu.<n>.halted" and make "vcpu.<n>.state" work correctly on  
> > s390>  
> As it seems that nobody was ever *really* interested in x86.halted, one
> could also return 0 unconditionally there (and for other
> expensive-to-query arches)?

The most important question I have is: does this solution satisfy the
needs of upper management? That is, if we implement the solution suggested
by Eduardo than the feature of automatically hotplugging more CPUs
will only work for s390. Is this OK?

If yes, then I think this is the best solution. And the next question
would be: Viktor, can you change this in libvirt while we fix query-cpus
in QEMU?

Btw, I guess OpenStack ran into this issue just because this field
slipped into domstats API and ceilometer issues that command...

> > It would be also interesting to update QEMU QMP documentation to
> > clarify the arch-specific semantics of "halted".
> >   
> 
>
Eduardo Habkost Feb. 2, 2018, 3:25 p.m. UTC | #17
On Fri, Feb 02, 2018 at 03:07:18PM +0000, Daniel P. Berrangé wrote:
> On Fri, Feb 02, 2018 at 12:50:14PM -0200, Eduardo Habkost wrote:
> > (CCing qemu-devel)
> > 
> > On Fri, Feb 02, 2018 at 09:21:59AM -0500, Luiz Capitulino wrote:
> > > On Fri, 2 Feb 2018 14:19:38 +0000
> > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > On Fri, Feb 02, 2018 at 12:15:54PM -0200, Eduardo Habkost wrote:
> > [...]
> > > > > It would be also interesting to update QEMU QMP documentation to
> > > > > clarify the arch-specific semantics of "halted".  
> > > > 
> > > > Any also especially clarify the awful performance implications of running
> > > > this particular query command. In general I would not expect query-xxx
> > > > monitor commands to interrupt all vcpus, so we should clearly warn about
> > > > this !
> > > 
> > > Or deprecate it...
> > 
> > We could deprecate the expensive fields on query-cpus, and move
> > them to a more expensive query-cpu-state command.  I believe most
> > users of query-cpus are only interested in qom_path, thread_id,
> > and topology info.
> > 
> > Markus, Eric: from the QAPI point of view, is it OK to remove
> > fields between QEMU versions, as long as we follow our
> > deprecation policy?
> 
> I would expect that to not be OK.  A fully backwards compatible way to
> deal with this would just be to add a flag to the query-cpus command
> eg something like
> 
>     query-cpus arch-specific=false
> 
> to turn off all this arch specific state, and just report the cheap
> generic info. If it defaults to arch-specific=true when omitted, then
> there's no compat problems.

This would work, too.  I would name it "full-state",
"extended-state" or something similar, though.  Not all
arch-specific data is expensive to fetch, and not all
non-arch-specific data is unexpensive.

But I'd like to confirm if it's OK to make existing non-optional
struct fields optional in the QAPI schema.  Markus, Eric?
Viktor Mihajlovski Feb. 2, 2018, 3:51 p.m. UTC | #18
On 02.02.2018 16:22, Luiz Capitulino wrote:
> On Fri, 2 Feb 2018 16:08:25 +0100
> Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> wrote:
> 
>>>> A disabled guest CPU is represented as halted in the QEMU object model
>>>> and can therefore be identified by the QMP query-cpus command.
>>>>
>>>> The initial patch proposal to expose this via virsh vcpuinfo was not
>>>> considered to be desirable because there was a concern that legacy
>>>> management software might be confused seeing halted vcpus. Therefore the
>>>> state information was added to the cpu domain statistics.
>>>>
>>>> One issue we're facing is that the semantics of "halted" are different
>>>> between s390 and at least x86. The question might be whether they are
>>>> different enough to grant a specific "disabled" indicator.  
>>>
>>> From your description, it looks like they are completely
>>> different.  On x86, a CPU that is online and in use can be moved
>>> between halted and non-halted state many times a second.
>>>
>>> If that's the case, we can probably fix this without breaking
>>> existing code: explicitly documenting the semantics of
>>> "vcpu.<n>.halted" at virConnectGetAllDomainStats() to mean "not
>>> online" (i.e. the s390 semantics, not the x86 one), and making
>>> qemuMonitorGetCpuHalted() s390-specific.
>>>
>>> Possibly a better long-term solution is to deprecate
>>> "vcpu.<n>.halted" and make "vcpu.<n>.state" work correctly on  
>>> s390>  
>> As it seems that nobody was ever *really* interested in x86.halted, one
>> could also return 0 unconditionally there (and for other
>> expensive-to-query arches)?
> 
> The most important question I have is: does this solution satisfy the
> needs of upper management? That is, if we implement the solution suggested
> by Eduardo than the feature of automatically hotplugging more CPUs
> will only work for s390. Is this OK?
> 
> If yes, then I think this is the best solution. And the next question
> would be: Viktor, can you change this in libvirt while we fix query-cpus
> in QEMU?
> 
The latest proposal was to use a flag for query-cpus (like full-state)
which would control the set of properties queried and reported. If this
is the way we decide to go, I can make the necessary changes in libvirt.
> Btw, I guess OpenStack ran into this issue just because this field
> slipped into domstats API and ceilometer issues that command...
> 
>>> It would be also interesting to update QEMU QMP documentation to
>>> clarify the arch-specific semantics of "halted".
>>>   
>>
>>
>
Daniel P. Berrangé Feb. 2, 2018, 3:54 p.m. UTC | #19
On Fri, Feb 02, 2018 at 04:51:23PM +0100, Viktor Mihajlovski wrote:
> On 02.02.2018 16:22, Luiz Capitulino wrote:
> > On Fri, 2 Feb 2018 16:08:25 +0100
> > Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> wrote:
> > 
> >>>> A disabled guest CPU is represented as halted in the QEMU object model
> >>>> and can therefore be identified by the QMP query-cpus command.
> >>>>
> >>>> The initial patch proposal to expose this via virsh vcpuinfo was not
> >>>> considered to be desirable because there was a concern that legacy
> >>>> management software might be confused seeing halted vcpus. Therefore the
> >>>> state information was added to the cpu domain statistics.
> >>>>
> >>>> One issue we're facing is that the semantics of "halted" are different
> >>>> between s390 and at least x86. The question might be whether they are
> >>>> different enough to grant a specific "disabled" indicator.  
> >>>
> >>> From your description, it looks like they are completely
> >>> different.  On x86, a CPU that is online and in use can be moved
> >>> between halted and non-halted state many times a second.
> >>>
> >>> If that's the case, we can probably fix this without breaking
> >>> existing code: explicitly documenting the semantics of
> >>> "vcpu.<n>.halted" at virConnectGetAllDomainStats() to mean "not
> >>> online" (i.e. the s390 semantics, not the x86 one), and making
> >>> qemuMonitorGetCpuHalted() s390-specific.
> >>>
> >>> Possibly a better long-term solution is to deprecate
> >>> "vcpu.<n>.halted" and make "vcpu.<n>.state" work correctly on  
> >>> s390>  
> >> As it seems that nobody was ever *really* interested in x86.halted, one
> >> could also return 0 unconditionally there (and for other
> >> expensive-to-query arches)?
> > 
> > The most important question I have is: does this solution satisfy the
> > needs of upper management? That is, if we implement the solution suggested
> > by Eduardo than the feature of automatically hotplugging more CPUs
> > will only work for s390. Is this OK?
> > 
> > If yes, then I think this is the best solution. And the next question
> > would be: Viktor, can you change this in libvirt while we fix query-cpus
> > in QEMU?
> > 
> The latest proposal was to use a flag for query-cpus (like full-state)
> which would control the set of properties queried and reported. If this
> is the way we decide to go, I can make the necessary changes in libvirt.

Regardless of whether we add that flag to query-cpus or not, we still have
the general problem of solving the cross-architecture semantics to be
more sane.

Regards,
Daniel
Luiz Capitulino Feb. 2, 2018, 3:55 p.m. UTC | #20
On Fri, 2 Feb 2018 16:51:23 +0100
Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> wrote:

> On 02.02.2018 16:22, Luiz Capitulino wrote:
> > On Fri, 2 Feb 2018 16:08:25 +0100
> > Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> wrote:
> >   
> >>>> A disabled guest CPU is represented as halted in the QEMU object model
> >>>> and can therefore be identified by the QMP query-cpus command.
> >>>>
> >>>> The initial patch proposal to expose this via virsh vcpuinfo was not
> >>>> considered to be desirable because there was a concern that legacy
> >>>> management software might be confused seeing halted vcpus. Therefore the
> >>>> state information was added to the cpu domain statistics.
> >>>>
> >>>> One issue we're facing is that the semantics of "halted" are different
> >>>> between s390 and at least x86. The question might be whether they are
> >>>> different enough to grant a specific "disabled" indicator.    
> >>>
> >>> From your description, it looks like they are completely
> >>> different.  On x86, a CPU that is online and in use can be moved
> >>> between halted and non-halted state many times a second.
> >>>
> >>> If that's the case, we can probably fix this without breaking
> >>> existing code: explicitly documenting the semantics of
> >>> "vcpu.<n>.halted" at virConnectGetAllDomainStats() to mean "not
> >>> online" (i.e. the s390 semantics, not the x86 one), and making
> >>> qemuMonitorGetCpuHalted() s390-specific.
> >>>
> >>> Possibly a better long-term solution is to deprecate
> >>> "vcpu.<n>.halted" and make "vcpu.<n>.state" work correctly on    
> >>> s390>    
> >> As it seems that nobody was ever *really* interested in x86.halted, one
> >> could also return 0 unconditionally there (and for other
> >> expensive-to-query arches)?  
> > 
> > The most important question I have is: does this solution satisfy the
> > needs of upper management? That is, if we implement the solution suggested
> > by Eduardo than the feature of automatically hotplugging more CPUs
> > will only work for s390. Is this OK?
> > 
> > If yes, then I think this is the best solution. And the next question
> > would be: Viktor, can you change this in libvirt while we fix query-cpus
> > in QEMU?
> >   
> The latest proposal was to use a flag for query-cpus (like full-state)
> which would control the set of properties queried and reported. If this
> is the way we decide to go, I can make the necessary changes in libvirt.

OK, I thought we were going to do both. Because, if libvirt only wants
the halted field for s390 then why issue query-cpus at all in other archs?

> > Btw, I guess OpenStack ran into this issue just because this field
> > slipped into domstats API and ceilometer issues that command...
> >   
> >>> It would be also interesting to update QEMU QMP documentation to
> >>> clarify the arch-specific semantics of "halted".
> >>>     
> >>
> >>  
> >   
> 
>
Luiz Capitulino Feb. 2, 2018, 4:01 p.m. UTC | #21
On Fri, 2 Feb 2018 15:54:15 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:

> > > The most important question I have is: does this solution satisfy the
> > > needs of upper management? That is, if we implement the solution suggested
> > > by Eduardo than the feature of automatically hotplugging more CPUs
> > > will only work for s390. Is this OK?
> > > 
> > > If yes, then I think this is the best solution. And the next question
> > > would be: Viktor, can you change this in libvirt while we fix query-cpus
> > > in QEMU?
> > >   
> > The latest proposal was to use a flag for query-cpus (like full-state)
> > which would control the set of properties queried and reported. If this
> > is the way we decide to go, I can make the necessary changes in libvirt.  
> 
> Regardless of whether we add that flag to query-cpus or not, we still have
> the general problem of solving the cross-architecture semantics to be
> more sane.

Let's the both then:

 o Make qemuDomainRefreshVcpuHalted() s390-only in libvirt. This by
   itself fixes the original performance issue

 o Deprecate the "halted" field in query-cpus in QEMU. This fixes new
   instances of this same problem
Luiz Capitulino Feb. 2, 2018, 4:07 p.m. UTC | #22
On Fri, 2 Feb 2018 11:01:37 -0500
Luiz Capitulino <lcapitulino@redhat.com> wrote:

> On Fri, 2 Feb 2018 15:54:15 +0000
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > > > The most important question I have is: does this solution satisfy the
> > > > needs of upper management? That is, if we implement the solution suggested
> > > > by Eduardo than the feature of automatically hotplugging more CPUs
> > > > will only work for s390. Is this OK?
> > > > 
> > > > If yes, then I think this is the best solution. And the next question
> > > > would be: Viktor, can you change this in libvirt while we fix query-cpus
> > > > in QEMU?
> > > >     
> > > The latest proposal was to use a flag for query-cpus (like full-state)
> > > which would control the set of properties queried and reported. If this
> > > is the way we decide to go, I can make the necessary changes in libvirt.    
> > 
> > Regardless of whether we add that flag to query-cpus or not, we still have
> > the general problem of solving the cross-architecture semantics to be
> > more sane.  
> 
> Let's the both then:
> 
>  o Make qemuDomainRefreshVcpuHalted() s390-only in libvirt. This by
>    itself fixes the original performance issue
> 
>  o Deprecate the "halted" field in query-cpus in QEMU. This fixes new
>    instances of this same problem

Btw, let me emphasize that I think the "halted" field has to be
deprecated from the default query-cpus output, otherwise new instances
of this issue are possible.
Viktor Mihajlovski Feb. 2, 2018, 4:19 p.m. UTC | #23
On 02.02.2018 17:01, Luiz Capitulino wrote:
> On Fri, 2 Feb 2018 15:54:15 +0000
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
>>>> The most important question I have is: does this solution satisfy the
>>>> needs of upper management? That is, if we implement the solution suggested
>>>> by Eduardo than the feature of automatically hotplugging more CPUs
>>>> will only work for s390. Is this OK?
>>>>
>>>> If yes, then I think this is the best solution. And the next question
>>>> would be: Viktor, can you change this in libvirt while we fix query-cpus
>>>> in QEMU?
>>>>   
>>> The latest proposal was to use a flag for query-cpus (like full-state)
>>> which would control the set of properties queried and reported. If this
>>> is the way we decide to go, I can make the necessary changes in libvirt.  
>>
>> Regardless of whether we add that flag to query-cpus or not, we still have
>> the general problem of solving the cross-architecture semantics to be
>> more sane.
> 
> Let's the both then:
> 
>  o Make qemuDomainRefreshVcpuHalted() s390-only in libvirt. This by
>    itself fixes the original performance issue
We are normally trying to avoid architecture-specific code in libvirt
(not always successfully). We could omit the call, based on a QEMU
Capability derived from the presence of said flag. This would change the
libvirt-client side default to not report halted. A client can the still
request the value via a tbd libvirt flag. Which is what an s390-aware
management app would have to do...
> 
>  o Deprecate the "halted" field in query-cpus in QEMU. This fixes new
>    instances of this same problem
> 
That is probably OK, if we can get a replacement (e.g. a new state).
Dr. David Alan Gilbert Feb. 2, 2018, 5:23 p.m. UTC | #24
* Eduardo Habkost (ehabkost@redhat.com) wrote:
> (CCing qemu-devel)
> 
> On Fri, Feb 02, 2018 at 09:21:59AM -0500, Luiz Capitulino wrote:
> > On Fri, 2 Feb 2018 14:19:38 +0000
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > On Fri, Feb 02, 2018 at 12:15:54PM -0200, Eduardo Habkost wrote:
> [...]
> > > > It would be also interesting to update QEMU QMP documentation to
> > > > clarify the arch-specific semantics of "halted".  
> > > 
> > > Any also especially clarify the awful performance implications of running
> > > this particular query command. In general I would not expect query-xxx
> > > monitor commands to interrupt all vcpus, so we should clearly warn about
> > > this !
> > 
> > Or deprecate it...
> 
> We could deprecate the expensive fields on query-cpus, and move
> them to a more expensive query-cpu-state command.  I believe most
> users of query-cpus are only interested in qom_path, thread_id,
> and topology info.

Would that data be available without the bql?  I ask because if it is
then a small advantage to having a separate command is that the command
could be marked OOB with Peter's new series and never take the lock.

Dave

> Markus, Eric: from the QAPI point of view, is it OK to remove
> fields between QEMU versions, as long as we follow our
> deprecation policy?
> 
> -- 
> Eduardo
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Eduardo Habkost Feb. 2, 2018, 5:38 p.m. UTC | #25
On Fri, Feb 02, 2018 at 05:23:59PM +0000, Dr. David Alan Gilbert wrote:
> * Eduardo Habkost (ehabkost@redhat.com) wrote:
> > (CCing qemu-devel)
> > 
> > On Fri, Feb 02, 2018 at 09:21:59AM -0500, Luiz Capitulino wrote:
> > > On Fri, 2 Feb 2018 14:19:38 +0000
> > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > On Fri, Feb 02, 2018 at 12:15:54PM -0200, Eduardo Habkost wrote:
> > [...]
> > > > > It would be also interesting to update QEMU QMP documentation to
> > > > > clarify the arch-specific semantics of "halted".  
> > > > 
> > > > Any also especially clarify the awful performance implications of running
> > > > this particular query command. In general I would not expect query-xxx
> > > > monitor commands to interrupt all vcpus, so we should clearly warn about
> > > > this !
> > > 
> > > Or deprecate it...
> > 
> > We could deprecate the expensive fields on query-cpus, and move
> > them to a more expensive query-cpu-state command.  I believe most
> > users of query-cpus are only interested in qom_path, thread_id,
> > and topology info.
> 
> Would that data be available without the bql?  I ask because if it is
> then a small advantage to having a separate command is that the command
> could be marked OOB with Peter's new series and never take the lock.

We would need a mechanism to safely walk the CPU list without the
BQL, so I wouldn't bother trying to make a OOB-capable version of
query-cpus unless really necessary.
Eduardo Habkost Feb. 2, 2018, 5:42 p.m. UTC | #26
On Fri, Feb 02, 2018 at 05:19:34PM +0100, Viktor Mihajlovski wrote:
> On 02.02.2018 17:01, Luiz Capitulino wrote:
[...]
> >  o Make qemuDomainRefreshVcpuHalted() s390-only in libvirt. This by
> >    itself fixes the original performance issue
> We are normally trying to avoid architecture-specific code in libvirt
> (not always successfully). We could omit the call, based on a QEMU
> Capability derived from the presence of said flag. This would change the
> libvirt-client side default to not report halted. A client can the still
> request the value via a tbd libvirt flag. Which is what an s390-aware
> management app would have to do...

The problem I see here is that the current semantics of the
"halted" field in QEMU is arch-specific, so either libvirt or
upper layers will necessarily need arch-specific code if they
want to support QEMU 2.11 or older.
Luiz Capitulino Feb. 2, 2018, 6:50 p.m. UTC | #27
On Fri, 2 Feb 2018 15:42:49 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Fri, Feb 02, 2018 at 05:19:34PM +0100, Viktor Mihajlovski wrote:
> > On 02.02.2018 17:01, Luiz Capitulino wrote:  
> [...]
> > >  o Make qemuDomainRefreshVcpuHalted() s390-only in libvirt. This by
> > >    itself fixes the original performance issue  
> > We are normally trying to avoid architecture-specific code in libvirt
> > (not always successfully). We could omit the call, based on a QEMU
> > Capability derived from the presence of said flag. This would change the
> > libvirt-client side default to not report halted. A client can the still
> > request the value via a tbd libvirt flag. Which is what an s390-aware
> > management app would have to do...  
> 
> The problem I see here is that the current semantics of the
> "halted" field in QEMU is arch-specific, so either libvirt or
> upper layers will necessarily need arch-specific code if they
> want to support QEMU 2.11 or older.

My understanding of this plan is:

1. Deprecate the "halted" field in query-cpus (that is, make it
   always return halted=false)

2. Add a new command, say query-cpu-state, which is arch dependent
   and is only available in archs that support sane "halted"
   semantics (I guess we can have per-arch QMP commands, right?)

3. Modify libvirt to use query-cpu-state if it's available,
   otherwise use query-cpus (in which case "halted" will be bogus,
   but that's a feature :) )

In essence, we're moving the arch-specific code from libvirt to
qemu.
Eduardo Habkost Feb. 2, 2018, 8:09 p.m. UTC | #28
On Fri, Feb 02, 2018 at 01:50:33PM -0500, Luiz Capitulino wrote:
> On Fri, 2 Feb 2018 15:42:49 -0200
> Eduardo Habkost <ehabkost@redhat.com> wrote:
> 
> > On Fri, Feb 02, 2018 at 05:19:34PM +0100, Viktor Mihajlovski wrote:
> > > On 02.02.2018 17:01, Luiz Capitulino wrote:  
> > [...]
> > > >  o Make qemuDomainRefreshVcpuHalted() s390-only in libvirt. This by
> > > >    itself fixes the original performance issue  
> > > We are normally trying to avoid architecture-specific code in libvirt
> > > (not always successfully). We could omit the call, based on a QEMU
> > > Capability derived from the presence of said flag. This would change the
> > > libvirt-client side default to not report halted. A client can the still
> > > request the value via a tbd libvirt flag. Which is what an s390-aware
> > > management app would have to do...  
> > 
> > The problem I see here is that the current semantics of the
> > "halted" field in QEMU is arch-specific, so either libvirt or
> > upper layers will necessarily need arch-specific code if they
> > want to support QEMU 2.11 or older.
> 
> My understanding of this plan is:
> 
> 1. Deprecate the "halted" field in query-cpus (that is, make it
>    always return halted=false)

I don't think we really need to do this.  If we do, this should
be the last step (after libvirt is already using the new
interfaces).


> 
> 2. Add a new command, say query-cpu-state, which is arch dependent
>    and is only available in archs that support sane "halted"
>    semantics (I guess we can have per-arch QMP commands, right?)

I don't see why we would make the new command arch-dependent.  We
need two new interfaces:

1) A lightweight version of query-cpus that won't interrupt the
   VCPUs.  This can be a new command, or a new parameter to
   query-cpus.
2) A arch-independent way to query "CPU is online" state, as the
   existing "halted" field has confusing arch-specific semantics.

> 
> 3. Modify libvirt to use query-cpu-state if it's available,
>    otherwise use query-cpus (in which case "halted" will be bogus,
>    but that's a feature :) )
> 
> In essence, we're moving the arch-specific code from libvirt to
> qemu.

Your plan above covers what will happen when using newer QEMU
versions, but libvirt still needs to work sanely if running QEMU
2.11.  My suggestion is that libvirt do not run query-cpus to ask
for the "halted" field on any architecture except s390.
Luiz Capitulino Feb. 2, 2018, 8:19 p.m. UTC | #29
On Fri, 2 Feb 2018 18:09:12 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Fri, Feb 02, 2018 at 01:50:33PM -0500, Luiz Capitulino wrote:
> > On Fri, 2 Feb 2018 15:42:49 -0200
> > Eduardo Habkost <ehabkost@redhat.com> wrote:
> >   
> > > On Fri, Feb 02, 2018 at 05:19:34PM +0100, Viktor Mihajlovski wrote:  
> > > > On 02.02.2018 17:01, Luiz Capitulino wrote:    
> > > [...]  
> > > > >  o Make qemuDomainRefreshVcpuHalted() s390-only in libvirt. This by
> > > > >    itself fixes the original performance issue    
> > > > We are normally trying to avoid architecture-specific code in libvirt
> > > > (not always successfully). We could omit the call, based on a QEMU
> > > > Capability derived from the presence of said flag. This would change the
> > > > libvirt-client side default to not report halted. A client can the still
> > > > request the value via a tbd libvirt flag. Which is what an s390-aware
> > > > management app would have to do...    
> > > 
> > > The problem I see here is that the current semantics of the
> > > "halted" field in QEMU is arch-specific, so either libvirt or
> > > upper layers will necessarily need arch-specific code if they
> > > want to support QEMU 2.11 or older.  
> > 
> > My understanding of this plan is:
> > 
> > 1. Deprecate the "halted" field in query-cpus (that is, make it
> >    always return halted=false)  
> 
> I don't think we really need to do this.  If we do, this should
> be the last step (after libvirt is already using the new
> interfaces).

Yeah, I've just started taking a look on how to implement this
plan I my first conclusion was to let current query-cpus alone.

> > 2. Add a new command, say query-cpu-state, which is arch dependent
> >    and is only available in archs that support sane "halted"
> >    semantics (I guess we can have per-arch QMP commands, right?)  
> 
> I don't see why we would make the new command arch-dependent.  We
> need two new interfaces:
> 
> 1) A lightweight version of query-cpus that won't interrupt the
>    VCPUs.  This can be a new command, or a new parameter to
>    query-cpus.

Exactly how I thought of doing it. I prefer a new command so that
we untangle this from query-cpus.

> 2) A arch-independent way to query "CPU is online" state, as the
>    existing "halted" field has confusing arch-specific semantics.

Honest question: is it at all possible for QEMU to know a CPU
is online? My impression is that the halted thing is the best
we can do. If we need better than this, then libvirt should use
the guest agent instead.

Btw, I haven't checked all archs yet, but I'm under the impression
the halted thing is confusing in x86 because of the in kernel irqchip.
Otherwise this state is maintained be qemu itself.

> > 3. Modify libvirt to use query-cpu-state if it's available,
> >    otherwise use query-cpus (in which case "halted" will be bogus,
> >    but that's a feature :) )
> > 
> > In essence, we're moving the arch-specific code from libvirt to
> > qemu.  
> 
> Your plan above covers what will happen when using newer QEMU
> versions, but libvirt still needs to work sanely if running QEMU
> 2.11.  My suggestion is that libvirt do not run query-cpus to ask
> for the "halted" field on any architecture except s390.

My current plan is to ask libvirt to completely remove query-cpus
usage, independent of the arch and use the new command instead.
Eduardo Habkost Feb. 2, 2018, 8:41 p.m. UTC | #30
On Fri, Feb 02, 2018 at 03:19:45PM -0500, Luiz Capitulino wrote:
> On Fri, 2 Feb 2018 18:09:12 -0200
> Eduardo Habkost <ehabkost@redhat.com> wrote:
[...]
> > Your plan above covers what will happen when using newer QEMU
> > versions, but libvirt still needs to work sanely if running QEMU
> > 2.11.  My suggestion is that libvirt do not run query-cpus to ask
> > for the "halted" field on any architecture except s390.
> 
> My current plan is to ask libvirt to completely remove query-cpus
> usage, independent of the arch and use the new command instead.

This would be a regression for people running QEMU 2.11 on s390.

(But maybe it would be an acceptable regression?  Viktor, what do
you think?  Are there production releases of management systems
that already rely on vcpu.<n>.halted?)
Luiz Capitulino Feb. 2, 2018, 9:49 p.m. UTC | #31
On Fri, 2 Feb 2018 18:41:44 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Fri, Feb 02, 2018 at 03:19:45PM -0500, Luiz Capitulino wrote:
> > On Fri, 2 Feb 2018 18:09:12 -0200
> > Eduardo Habkost <ehabkost@redhat.com> wrote:  
> [...]
> > > Your plan above covers what will happen when using newer QEMU
> > > versions, but libvirt still needs to work sanely if running QEMU
> > > 2.11.  My suggestion is that libvirt do not run query-cpus to ask
> > > for the "halted" field on any architecture except s390.  
> > 
> > My current plan is to ask libvirt to completely remove query-cpus
> > usage, independent of the arch and use the new command instead.  
> 
> This would be a regression for people running QEMU 2.11 on s390.

libvirt could use query-cpus on s390 if the new command is not
available.

> 
> (But maybe it would be an acceptable regression?  Viktor, what do
> you think?  Are there production releases of management systems
> that already rely on vcpu.<n>.halted?)
>
Luiz Capitulino Feb. 2, 2018, 9:54 p.m. UTC | #32
On Fri, 2 Feb 2018 16:49:54 -0500
Luiz Capitulino <lcapitulino@redhat.com> wrote:

> On Fri, 2 Feb 2018 18:41:44 -0200
> Eduardo Habkost <ehabkost@redhat.com> wrote:
> 
> > On Fri, Feb 02, 2018 at 03:19:45PM -0500, Luiz Capitulino wrote:  
> > > On Fri, 2 Feb 2018 18:09:12 -0200
> > > Eduardo Habkost <ehabkost@redhat.com> wrote:    
> > [...]  
> > > > Your plan above covers what will happen when using newer QEMU
> > > > versions, but libvirt still needs to work sanely if running QEMU
> > > > 2.11.  My suggestion is that libvirt do not run query-cpus to ask
> > > > for the "halted" field on any architecture except s390.    
> > > 
> > > My current plan is to ask libvirt to completely remove query-cpus
> > > usage, independent of the arch and use the new command instead.    
> > 
> > This would be a regression for people running QEMU 2.11 on s390.  
> 
> libvirt could use query-cpus on s390 if the new command is not
> available.

Btw, even on s390 query-cpus runs run_on_cpu(), which is what
causes vCPUs to go to user-space. So, s390 should suffer the same
performance problem. Pick your regression.

> 
> > 
> > (But maybe it would be an acceptable regression?  Viktor, what do
> > you think?  Are there production releases of management systems
> > that already rely on vcpu.<n>.halted?)
> >   
>
Viktor Mihajlovski Feb. 5, 2018, 1:43 p.m. UTC | #33
On 02.02.2018 21:41, Eduardo Habkost wrote:
> On Fri, Feb 02, 2018 at 03:19:45PM -0500, Luiz Capitulino wrote:
>> On Fri, 2 Feb 2018 18:09:12 -0200
>> Eduardo Habkost <ehabkost@redhat.com> wrote:
> [...]
>>> Your plan above covers what will happen when using newer QEMU
>>> versions, but libvirt still needs to work sanely if running QEMU
>>> 2.11.  My suggestion is that libvirt do not run query-cpus to ask
>>> for the "halted" field on any architecture except s390.
>>
>> My current plan is to ask libvirt to completely remove query-cpus
>> usage, independent of the arch and use the new command instead.
> 
> This would be a regression for people running QEMU 2.11 on s390.
> 
> (But maybe it would be an acceptable regression?  Viktor, what do
> you think?  Are there production releases of management systems
> that already rely on vcpu.<n>.halted?)
> 
Unfortunately, there's code out there looking at vcpu.<n>.halted. I've
informed the product team about the issue.

If we drop/deprecate vcpu.<n>.halted from the domain statistics, this
should be done for all arches, if there's a replacement mechanism (i.e.
new VCPU states). As a stop-gap measure we can make the call
arch-dependent until the new stuff is in place.

BTW: libvirt cannot eliminate query-cpus entirely because initial CPU
topology and changes due to hotplug must be queried.
Daniel P. Berrangé Feb. 5, 2018, 1:47 p.m. UTC | #34
On Mon, Feb 05, 2018 at 02:43:15PM +0100, Viktor Mihajlovski wrote:
> On 02.02.2018 21:41, Eduardo Habkost wrote:
> > On Fri, Feb 02, 2018 at 03:19:45PM -0500, Luiz Capitulino wrote:
> >> On Fri, 2 Feb 2018 18:09:12 -0200
> >> Eduardo Habkost <ehabkost@redhat.com> wrote:
> > [...]
> >>> Your plan above covers what will happen when using newer QEMU
> >>> versions, but libvirt still needs to work sanely if running QEMU
> >>> 2.11.  My suggestion is that libvirt do not run query-cpus to ask
> >>> for the "halted" field on any architecture except s390.
> >>
> >> My current plan is to ask libvirt to completely remove query-cpus
> >> usage, independent of the arch and use the new command instead.
> > 
> > This would be a regression for people running QEMU 2.11 on s390.
> > 
> > (But maybe it would be an acceptable regression?  Viktor, what do
> > you think?  Are there production releases of management systems
> > that already rely on vcpu.<n>.halted?)
> > 
> Unfortunately, there's code out there looking at vcpu.<n>.halted. I've
> informed the product team about the issue.
> 
> If we drop/deprecate vcpu.<n>.halted from the domain statistics, this
> should be done for all arches, if there's a replacement mechanism (i.e.
> new VCPU states). As a stop-gap measure we can make the call
> arch-dependent until the new stuff is in place.

Yes, I think libvirt should just restrict this 'halted' feature reporting
to s390 only, since the other archs have different semantics for this
item, and the s390 semantics are the ones we want.

If we figure out a better solution for gettign the data on s390 then
we can change that to not use query-cpus at all, and possibly extend
it to other archs too.

> BTW: libvirt cannot eliminate query-cpus entirely because initial CPU
> topology and changes due to hotplug must be queried.

At that point, libvirt has not started VCPUS, so we are ok wrt the
performance implications.

Regards,
Daniel
Luiz Capitulino Feb. 5, 2018, 3:37 p.m. UTC | #35
On Mon, 5 Feb 2018 13:47:27 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Mon, Feb 05, 2018 at 02:43:15PM +0100, Viktor Mihajlovski wrote:
> > On 02.02.2018 21:41, Eduardo Habkost wrote:  
> > > On Fri, Feb 02, 2018 at 03:19:45PM -0500, Luiz Capitulino wrote:  
> > >> On Fri, 2 Feb 2018 18:09:12 -0200
> > >> Eduardo Habkost <ehabkost@redhat.com> wrote:  
> > > [...]  
> > >>> Your plan above covers what will happen when using newer QEMU
> > >>> versions, but libvirt still needs to work sanely if running QEMU
> > >>> 2.11.  My suggestion is that libvirt do not run query-cpus to ask
> > >>> for the "halted" field on any architecture except s390.  
> > >>
> > >> My current plan is to ask libvirt to completely remove query-cpus
> > >> usage, independent of the arch and use the new command instead.  
> > > 
> > > This would be a regression for people running QEMU 2.11 on s390.
> > > 
> > > (But maybe it would be an acceptable regression?  Viktor, what do
> > > you think?  Are there production releases of management systems
> > > that already rely on vcpu.<n>.halted?)
> > >   
> > Unfortunately, there's code out there looking at vcpu.<n>.halted. I've
> > informed the product team about the issue.
> > 
> > If we drop/deprecate vcpu.<n>.halted from the domain statistics, this
> > should be done for all arches, if there's a replacement mechanism (i.e.
> > new VCPU states). As a stop-gap measure we can make the call
> > arch-dependent until the new stuff is in place.  
> 
> Yes, I think libvirt should just restrict this 'halted' feature reporting
> to s390 only, since the other archs have different semantics for this
> item, and the s390 semantics are the ones we want.

From this whole discussion, there's only one thing that I still don't
understand (in a very honest way): what makes s390 halted semantics
different?

By quickly looking at the code, it seems to be very like the x86 one
when in kernel irqchip is not used: if a guest vCPU executes HLT, the
vCPU exits to userspace and qemu will put the vCPU thread to sleep.
This is the semantics I'd expect for HLT, and maybe for all archs.

What makes x86 different, is when the in kernel irqchip is used (which
should be the default with libvirt). In this case, the vCPU thread avoids
exiting to user-space. So, qemu doesn't know the vCPU halted.

That's only one of the reasons why query-cpus forces vCPUs to user-space.
But there are other reasons, and that's why even on s390 query-cpus
will also force vCPUs to user-space, which means s390 has the same perf
issue but maybe this hasn't been detected yet.

For the immediate term, I still think we should have a query-cpus
replacement that doesn't cause vCPUs to go to userspace. I'll work this
this week.

However, IMHO, what we really want is to add an API to the guest agent
to export the CPU online bit from the guest userspace sysfs. This will
give the ultimate semantics and move us away from this halted mess.
Viktor Mihajlovski Feb. 5, 2018, 4:10 p.m. UTC | #36
On 05.02.2018 16:37, Luiz Capitulino wrote:
> On Mon, 5 Feb 2018 13:47:27 +0000
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
>> On Mon, Feb 05, 2018 at 02:43:15PM +0100, Viktor Mihajlovski wrote:
>>> On 02.02.2018 21:41, Eduardo Habkost wrote:  
>>>> On Fri, Feb 02, 2018 at 03:19:45PM -0500, Luiz Capitulino wrote:  
>>>>> On Fri, 2 Feb 2018 18:09:12 -0200
>>>>> Eduardo Habkost <ehabkost@redhat.com> wrote:  
>>>> [...]  
>>>>>> Your plan above covers what will happen when using newer QEMU
>>>>>> versions, but libvirt still needs to work sanely if running QEMU
>>>>>> 2.11.  My suggestion is that libvirt do not run query-cpus to ask
>>>>>> for the "halted" field on any architecture except s390.  
>>>>>
>>>>> My current plan is to ask libvirt to completely remove query-cpus
>>>>> usage, independent of the arch and use the new command instead.  
>>>>
>>>> This would be a regression for people running QEMU 2.11 on s390.
>>>>
>>>> (But maybe it would be an acceptable regression?  Viktor, what do
>>>> you think?  Are there production releases of management systems
>>>> that already rely on vcpu.<n>.halted?)
>>>>   
>>> Unfortunately, there's code out there looking at vcpu.<n>.halted. I've
>>> informed the product team about the issue.
>>>
>>> If we drop/deprecate vcpu.<n>.halted from the domain statistics, this
>>> should be done for all arches, if there's a replacement mechanism (i.e.
>>> new VCPU states). As a stop-gap measure we can make the call
>>> arch-dependent until the new stuff is in place.  
>>
>> Yes, I think libvirt should just restrict this 'halted' feature reporting
>> to s390 only, since the other archs have different semantics for this
>> item, and the s390 semantics are the ones we want.
> 
> From this whole discussion, there's only one thing that I still don't
> understand (in a very honest way): what makes s390 halted semantics
> different?One problem is that using the halted property to indicate that the CPU
has assumed the architected disabled wait state may not have been the
wisest decision (my fault). If the CPU enters disabled wait, it will
stay inactive until it is explicitly restarted which is different on x86.
> 
> By quickly looking at the code, it seems to be very like the x86 one
> when in kernel irqchip is not used: if a guest vCPU executes HLT, the
> vCPU exits to userspace and qemu will put the vCPU thread to sleep.
> This is the semantics I'd expect for HLT, and maybe for all archs.>
> What makes x86 different, is when the in kernel irqchip is used (which
> should be the default with libvirt). In this case, the vCPU thread avoids
> exiting to user-space. So, qemu doesn't know the vCPU halted.
> 
> That's only one of the reasons why query-cpus forces vCPUs to user-space.
> But there are other reasons, and that's why even on s390 query-cpus
> will also force vCPUs to user-space, which means s390 has the same perf
> issue but maybe this hasn't been detected yet.
> 
> For the immediate term, I still think we should have a query-cpus
> replacement that doesn't cause vCPUs to go to userspace. I'll work this
> this week.
FWIW: I currently exploring an extension to query-cpus to report
s390-specific information, allowing to ditch halted in the long run.
Further, I'm considering a new QAPI event along the lines of "CPU info
has changed" allowing QEMU to announce low-frequency changes of CPU
state (as is the case for s390) and finally wire up a handler in libvirt
to update a tbd. property (!= halted).
> 
> However, IMHO, what we really want is to add an API to the guest agent
> to export the CPU online bit from the guest userspace sysfs. This will
> give the ultimate semantics and move us away from this halted mess.
>
Luiz Capitulino Feb. 5, 2018, 4:36 p.m. UTC | #37
On Mon, 5 Feb 2018 17:10:18 +0100
Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> wrote:

> On 05.02.2018 16:37, Luiz Capitulino wrote:
> > On Mon, 5 Feb 2018 13:47:27 +0000
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> >   
> >> On Mon, Feb 05, 2018 at 02:43:15PM +0100, Viktor Mihajlovski wrote:  
> >>> On 02.02.2018 21:41, Eduardo Habkost wrote:    
> >>>> On Fri, Feb 02, 2018 at 03:19:45PM -0500, Luiz Capitulino wrote:    
> >>>>> On Fri, 2 Feb 2018 18:09:12 -0200
> >>>>> Eduardo Habkost <ehabkost@redhat.com> wrote:    
> >>>> [...]    
> >>>>>> Your plan above covers what will happen when using newer QEMU
> >>>>>> versions, but libvirt still needs to work sanely if running QEMU
> >>>>>> 2.11.  My suggestion is that libvirt do not run query-cpus to ask
> >>>>>> for the "halted" field on any architecture except s390.    
> >>>>>
> >>>>> My current plan is to ask libvirt to completely remove query-cpus
> >>>>> usage, independent of the arch and use the new command instead.    
> >>>>
> >>>> This would be a regression for people running QEMU 2.11 on s390.
> >>>>
> >>>> (But maybe it would be an acceptable regression?  Viktor, what do
> >>>> you think?  Are there production releases of management systems
> >>>> that already rely on vcpu.<n>.halted?)
> >>>>     
> >>> Unfortunately, there's code out there looking at vcpu.<n>.halted. I've
> >>> informed the product team about the issue.
> >>>
> >>> If we drop/deprecate vcpu.<n>.halted from the domain statistics, this
> >>> should be done for all arches, if there's a replacement mechanism (i.e.
> >>> new VCPU states). As a stop-gap measure we can make the call
> >>> arch-dependent until the new stuff is in place.    
> >>
> >> Yes, I think libvirt should just restrict this 'halted' feature reporting
> >> to s390 only, since the other archs have different semantics for this
> >> item, and the s390 semantics are the ones we want.  
> > 
> > From this whole discussion, there's only one thing that I still don't
> > understand (in a very honest way): what makes s390 halted semantics
> > different?One problem is that using the halted property to indicate that the CPU  
> has assumed the architected disabled wait state may not have been the
> wisest decision (my fault). If the CPU enters disabled wait, it will
> stay inactive until it is explicitly restarted which is different on x86.

Ah, OK. So, s390 does indeed have different semantics.

> > By quickly looking at the code, it seems to be very like the x86 one
> > when in kernel irqchip is not used: if a guest vCPU executes HLT, the
> > vCPU exits to userspace and qemu will put the vCPU thread to sleep.
> > This is the semantics I'd expect for HLT, and maybe for all archs.>
> > What makes x86 different, is when the in kernel irqchip is used (which
> > should be the default with libvirt). In this case, the vCPU thread avoids
> > exiting to user-space. So, qemu doesn't know the vCPU halted.
> > 
> > That's only one of the reasons why query-cpus forces vCPUs to user-space.
> > But there are other reasons, and that's why even on s390 query-cpus
> > will also force vCPUs to user-space, which means s390 has the same perf
> > issue but maybe this hasn't been detected yet.
> > 
> > For the immediate term, I still think we should have a query-cpus
> > replacement that doesn't cause vCPUs to go to userspace. I'll work this
> > this week.  
> FWIW: I currently exploring an extension to query-cpus to report
> s390-specific information, allowing to ditch halted in the long run.
> Further, I'm considering a new QAPI event along the lines of "CPU info
> has changed" allowing QEMU to announce low-frequency changes of CPU
> state (as is the case for s390) and finally wire up a handler in libvirt
> to update a tbd. property (!= halted).

I very much prefer adding a replacement for query-cpus, which works
for all archs and which doesn't have any performance impact.

> > 
> > However, IMHO, what we really want is to add an API to the guest agent
> > to export the CPU online bit from the guest userspace sysfs. This will
> > give the ultimate semantics and move us away from this halted mess.
> >   
>
Eduardo Habkost Feb. 5, 2018, 10:50 p.m. UTC | #38
On Mon, Feb 05, 2018 at 10:37:01AM -0500, Luiz Capitulino wrote:
[...]
> However, IMHO, what we really want is to add an API to the guest agent
> to export the CPU online bit from the guest userspace sysfs. This will
> give the ultimate semantics and move us away from this halted mess.

Can't this be detected on the host side without guest agent?  Why
do we need the guest to tell us what's the state of the virtual
hardware we're emulating?
Luiz Capitulino Feb. 6, 2018, 2:04 a.m. UTC | #39
On Mon, 5 Feb 2018 20:50:20 -0200
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Mon, Feb 05, 2018 at 10:37:01AM -0500, Luiz Capitulino wrote:
> [...]
> > However, IMHO, what we really want is to add an API to the guest agent
> > to export the CPU online bit from the guest userspace sysfs. This will
> > give the ultimate semantics and move us away from this halted mess.  
> 
> Can't this be detected on the host side without guest agent?  Why
> do we need the guest to tell us what's the state of the virtual
> hardware we're emulating?

I have no idea.
Viktor Mihajlovski Feb. 6, 2018, 10:29 a.m. UTC | #40
On 01.02.2018 21:26, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:
>> 2018-02-01 12:54-0500, Luiz Capitulino:
>>>
>>> Libvirt needs to know when a vCPU is halted. To get this information,
>>
>> I don't see why upper level management should care about that, a single
>> bit about halted state that can be incorrect at the time it is processed
>> seems of very limited use.
> 
> I don't see why, either.
> 
> I'm CCing libvir-list and the people involved in the code that
> added halt state to libvirt domain statistics.
> 
I've posted a libvirt patch to disable query-cpus in the domain stats
collection for all architectures other than s390. This should alleviate
the problem until we have a better method to obtain arch-specific cpu state.
Luiz Capitulino Feb. 6, 2018, 2:05 p.m. UTC | #41
On Tue, 6 Feb 2018 11:29:46 +0100
Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> wrote:

> On 01.02.2018 21:26, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 09:15:15PM +0100, Radim Krčmář wrote:  
> >> 2018-02-01 12:54-0500, Luiz Capitulino:  
> >>>
> >>> Libvirt needs to know when a vCPU is halted. To get this information,  
> >>
> >> I don't see why upper level management should care about that, a single
> >> bit about halted state that can be incorrect at the time it is processed
> >> seems of very limited use.  
> > 
> > I don't see why, either.
> > 
> > I'm CCing libvir-list and the people involved in the code that
> > added halt state to libvirt domain statistics.
> >   
> I've posted a libvirt patch to disable query-cpus in the domain stats
> collection for all architectures other than s390. This should alleviate
> the problem until we have a better method to obtain arch-specific cpu state.

This is cool, thanks a lot.

Btw, I'll follow up this week with a replacement for qemu-cpus in QEMU.
Then we can discuss again the best long term strategy for this issue.
diff mbox

Patch

diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index c19c7ede9bd6..056dd1c787bc 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -15,6 +15,15 @@  bool kvm_arch_has_vcpu_debugfs(void)
 	return true;
 }
 
+static int vcpu_get_halted(void *data, u64 *val)
+{
+	struct kvm_vcpu *vcpu = (struct kvm_vcpu *) data;
+	*val = vcpu->halted;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(vcpu_halted_fops, vcpu_get_halted, NULL, "%lld\n");
+
 static int vcpu_get_tsc_offset(void *data, u64 *val)
 {
 	struct kvm_vcpu *vcpu = (struct kvm_vcpu *) data;
@@ -51,6 +60,12 @@  int kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu)
 	if (!ret)
 		return -ENOMEM;
 
+	ret = debugfs_create_file("halted", 0444,
+				    vcpu->debugfs_dentry,
+				    vcpu, &vcpu_halted_fops);
+	if (!ret)
+		return -ENOMEM;
+
 	if (kvm_has_tsc_control) {
 		ret = debugfs_create_file("tsc-scaling-ratio", 0444,
 							vcpu->debugfs_dentry,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c13cd14c4780..9841841d186b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6273,6 +6273,7 @@  void kvm_arch_exit(void)
 
 int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
 {
+	kvm_vcpu_set_halted(vcpu);
 	++vcpu->stat.halt_exits;
 	if (lapic_in_kernel(vcpu)) {
 		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
@@ -7204,6 +7205,7 @@  static int vcpu_run(struct kvm_vcpu *vcpu)
 
 	for (;;) {
 		if (kvm_vcpu_running(vcpu)) {
+			kvm_vcpu_set_running(vcpu);
 			r = vcpu_enter_guest(vcpu);
 		} else {
 			r = vcpu_block(kvm, vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ac0062b74aed..430a4d06b0fb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -272,10 +272,21 @@  struct kvm_vcpu {
 	} spin_loop;
 #endif
 	bool preempted;
+	bool halted;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
 };
 
+static inline void kvm_vcpu_set_running(struct kvm_vcpu *vcpu)
+{
+    vcpu->halted = 0;
+}
+
+static inline void kvm_vcpu_set_halted(struct kvm_vcpu *vcpu)
+{
+    vcpu->halted = 1;
+}
+
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 {
 	/*