[2/2] kvm: Add ioctl for gathering debug counters

Message ID	20200115134303.30668-1-milanpa@amazon.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=aSHY=3E=vger.kernel.org=kvm-owner@kernel.org> IronPort-SDR: idFNB6zKcAox6Oguln3DkRDqPh+AyPXgyvXqArUndaAF9gd1pjj2DpyrfaMGXpKmvtdwi6h2uq M/Oitf6lnuWA== From: Milan Pandurov <milanpa@amazon.de> To: <kvm@vger.kernel.org> CC: <pbonzini@redhat.com>, <rkrcmar@redhat.com>, <graf@amazon.de>, <borntraeger@de.ibm.com> Subject: [PATCH 2/2] kvm: Add ioctl for gathering debug counters Date: Wed, 15 Jan 2020 14:43:03 +0100 Message-ID: <20200115134303.30668-1-milanpa@amazon.de> MIME-Version: 1.0 Content-Type: text/plain Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	None \| expand [2/2] kvm: Add ioctl for gathering debug counters

Milan Pandurov Jan. 15, 2020, 1:43 p.m. UTC

KVM exposes debug counters through individual debugfs files.
Monitoring these counters requires debugfs to be enabled/accessible for
the application, which might not be always the case.
Additionally, periodic monitoring multiple debugfs files from
userspace requires multiple file open/read/close + atoi conversion
operations, which is not very efficient.

Let's expose new interface to userspace for garhering these
statistics with one ioctl.

Two new ioctl methods are added:
 - KVM_GET_SUPPORTED_DEBUGFS_STAT : Returns list of available counter
 names. Names correspond to the debugfs file names
 - KVM_GET_DEBUGFS_VALUES : Returns list of u64 values each
 corresponding to a value described in KVM_GET_SUPPORTED_DEBUGFS_STAT.

Userspace application can read counter description once using
KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the
KVM_GET_DEBUGFS_VALUES to get value update.

Signed-off-by: Milan Pandurov <milanpa@amazon.de>

---
Current approach returns all available counters to userspace which might
be an overkill. This can be further extended with an interface in which
userspace provides indicies of counters it is interested in counters
will be filled accordingly.

NOTE: This patch is placed on top of:
https://www.spinics.net/lists/kvm/msg202599.html
---
 include/uapi/linux/kvm.h | 21 ++++++++++++
 virt/kvm/kvm_main.c      | 70 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

Alexander Graf Jan. 15, 2020, 2:04 p.m. UTC | #1

On 15.01.20 14:43, Milan Pandurov wrote:
> KVM exposes debug counters through individual debugfs files.
> Monitoring these counters requires debugfs to be enabled/accessible for
> the application, which might not be always the case.
> Additionally, periodic monitoring multiple debugfs files from
> userspace requires multiple file open/read/close + atoi conversion
> operations, which is not very efficient.
> 
> Let's expose new interface to userspace for garhering these
> statistics with one ioctl.
> 
> Two new ioctl methods are added:
>   - KVM_GET_SUPPORTED_DEBUGFS_STAT : Returns list of available counter
>   names. Names correspond to the debugfs file names
>   - KVM_GET_DEBUGFS_VALUES : Returns list of u64 values each
>   corresponding to a value described in KVM_GET_SUPPORTED_DEBUGFS_STAT.
> 
> Userspace application can read counter description once using
> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the
> KVM_GET_DEBUGFS_VALUES to get value update.
> 
> Signed-off-by: Milan Pandurov <milanpa@amazon.de>

Thanks a lot! I really love the idea to get stats directly from the kvm 
vm fd owner. This is so much better than poking at files from a random 
unrelated debug or stat file system.

I have a few comments overall though:

1)

This is an interface that requires a lot of logic and buffers from user 
space to retrieve individual, explicit counters. What if I just wanted 
to monitor the number of exits on every user space exit?

Also, we're suddenly making the debugfs names a full ABI, because 
through this interface we only identify the individual stats through 
their names. That means we can not remove stats or change their names, 
because people may rely on them, no? Thining about this again, maybe 
they already are an ABI because people rely on them in debugfs though?

I see two alternatives to this approach here:

a) ONE_REG

We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM as 
well (if we really have to?). That gives us explicit identifiers for 
each stat with an explicit path to introduce new ones with very unique 
identifiers.

That would give us a very easily structured approach to retrieve 
individual counters.

b) part of the mmap'ed vcpu struct

We could simply move the counters into the shared struct that we expose 
to user space via mmap. IIRC we only have that per-vcpu, but then again 
most counters are per-vcpu anyway, so we would push the aggregation to 
user space.

For per-vm ones, maybe we can just add another mmap'ed shared page for 
the vm fd?

2) vcpu counters

Most of the counters count on vcpu granularity, but debugfs only gives 
us a full VM view. Whatever we do to improve the situation, we should 
definitely try to build something that allows us to get the counters per 
vcpu (as well).

The main purpose of these counters is monitoring. It can be quite 
important to know that only a single vCPU is going wild, compared to all 
of them for example.

Thanks,

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Pandurov, Milan Jan. 15, 2020, 2:43 p.m. UTC | #2

On 1/15/20 3:04 PM, Alexander Graf wrote:
>
>
> On 15.01.20 14:43, Milan Pandurov wrote:
>> KVM exposes debug counters through individual debugfs files.
>> Monitoring these counters requires debugfs to be enabled/accessible for
>> the application, which might not be always the case.
>> Additionally, periodic monitoring multiple debugfs files from
>> userspace requires multiple file open/read/close + atoi conversion
>> operations, which is not very efficient.
>>
>> Let's expose new interface to userspace for garhering these
>> statistics with one ioctl.
>>
>> Two new ioctl methods are added:
>>   - KVM_GET_SUPPORTED_DEBUGFS_STAT : Returns list of available counter
>>   names. Names correspond to the debugfs file names
>>   - KVM_GET_DEBUGFS_VALUES : Returns list of u64 values each
>>   corresponding to a value described in KVM_GET_SUPPORTED_DEBUGFS_STAT.
>>
>> Userspace application can read counter description once using
>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the
>> KVM_GET_DEBUGFS_VALUES to get value update.
>>
>> Signed-off-by: Milan Pandurov <milanpa@amazon.de>
>
> Thanks a lot! I really love the idea to get stats directly from the 
> kvm vm fd owner. This is so much better than poking at files from a 
> random unrelated debug or stat file system.
>
> I have a few comments overall though:
>
>
> 1)
>
> This is an interface that requires a lot of logic and buffers from 
> user space to retrieve individual, explicit counters. What if I just 
> wanted to monitor the number of exits on every user space exit?

In case we want to cover such latency sensitive use cases solution b), 
with mmap'ed structs you suggested, would be a way to go, IMO.

>
> Also, we're suddenly making the debugfs names a full ABI, because 
> through this interface we only identify the individual stats through 
> their names. That means we can not remove stats or change their names, 
> because people may rely on them, no? Thining about this again, maybe 
> they already are an ABI because people rely on them in debugfs though?
>
> I see two alternatives to this approach here:
>
> a) ONE_REG
>
> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM 
> as well (if we really have to?). That gives us explicit identifiers 
> for each stat with an explicit path to introduce new ones with very 
> unique identifiers.
>
> That would give us a very easily structured approach to retrieve 
> individual counters.
>
> b) part of the mmap'ed vcpu struct
>
> We could simply move the counters into the shared struct that we 
> expose to user space via mmap. IIRC we only have that per-vcpu, but 
> then again most counters are per-vcpu anyway, so we would push the 
> aggregation to user space.
>
> For per-vm ones, maybe we can just add another mmap'ed shared page for 
> the vm fd?
>
>
> 2) vcpu counters
>
> Most of the counters count on vcpu granularity, but debugfs only gives 
> us a full VM view. Whatever we do to improve the situation, we should 
> definitely try to build something that allows us to get the counters 
> per vcpu (as well).
>
> The main purpose of these counters is monitoring. It can be quite 
> important to know that only a single vCPU is going wild, compared to 
> all of them for example.

I agree, exposing per vcpu counters can be useful. I guess it didn't 
make much sense exposing them through debugfs so aggregation was done in 
kernel. However if we chose to go with approach 1-b) mmap counters 
struct in userspace, we could do this.


Best,
Milan

>
>
> Thanks,
>
> Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Alexander Graf Jan. 15, 2020, 2:59 p.m. UTC | #3

On 15.01.20 15:43, milanpa@amazon.com wrote:
> On 1/15/20 3:04 PM, Alexander Graf wrote:
>>
>>
>> On 15.01.20 14:43, Milan Pandurov wrote:
>>> KVM exposes debug counters through individual debugfs files.
>>> Monitoring these counters requires debugfs to be enabled/accessible for
>>> the application, which might not be always the case.
>>> Additionally, periodic monitoring multiple debugfs files from
>>> userspace requires multiple file open/read/close + atoi conversion
>>> operations, which is not very efficient.
>>>
>>> Let's expose new interface to userspace for garhering these
>>> statistics with one ioctl.
>>>
>>> Two new ioctl methods are added:
>>>   - KVM_GET_SUPPORTED_DEBUGFS_STAT : Returns list of available counter
>>>   names. Names correspond to the debugfs file names
>>>   - KVM_GET_DEBUGFS_VALUES : Returns list of u64 values each
>>>   corresponding to a value described in KVM_GET_SUPPORTED_DEBUGFS_STAT.
>>>
>>> Userspace application can read counter description once using
>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the
>>> KVM_GET_DEBUGFS_VALUES to get value update.
>>>
>>> Signed-off-by: Milan Pandurov <milanpa@amazon.de>
>>
>> Thanks a lot! I really love the idea to get stats directly from the 
>> kvm vm fd owner. This is so much better than poking at files from a 
>> random unrelated debug or stat file system.
>>
>> I have a few comments overall though:
>>
>>
>> 1)
>>
>> This is an interface that requires a lot of logic and buffers from 
>> user space to retrieve individual, explicit counters. What if I just 
>> wanted to monitor the number of exits on every user space exit?
> 
> In case we want to cover such latency sensitive use cases solution b), 
> with mmap'ed structs you suggested, would be a way to go, IMO.
> 
>>
>> Also, we're suddenly making the debugfs names a full ABI, because 
>> through this interface we only identify the individual stats through 
>> their names. That means we can not remove stats or change their names, 
>> because people may rely on them, no? Thining about this again, maybe 
>> they already are an ABI because people rely on them in debugfs though?
>>
>> I see two alternatives to this approach here:
>>
>> a) ONE_REG
>>
>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM 
>> as well (if we really have to?). That gives us explicit identifiers 
>> for each stat with an explicit path to introduce new ones with very 
>> unique identifiers.
>>
>> That would give us a very easily structured approach to retrieve 
>> individual counters.
>>
>> b) part of the mmap'ed vcpu struct
>>
>> We could simply move the counters into the shared struct that we 
>> expose to user space via mmap. IIRC we only have that per-vcpu, but 
>> then again most counters are per-vcpu anyway, so we would push the 
>> aggregation to user space.
>>
>> For per-vm ones, maybe we can just add another mmap'ed shared page for 
>> the vm fd?
>>
>>
>> 2) vcpu counters
>>
>> Most of the counters count on vcpu granularity, but debugfs only gives 
>> us a full VM view. Whatever we do to improve the situation, we should 
>> definitely try to build something that allows us to get the counters 
>> per vcpu (as well).
>>
>> The main purpose of these counters is monitoring. It can be quite 
>> important to know that only a single vCPU is going wild, compared to 
>> all of them for example.
> 
> I agree, exposing per vcpu counters can be useful. I guess it didn't 
> make much sense exposing them through debugfs so aggregation was done in 
> kernel. However if we chose to go with approach 1-b) mmap counters 
> struct in userspace, we could do this.

We could do it in any approach, even with statfs if we do directories 
per vcpu :).

The reason I dislike the debugfs/statfs approach is that it generates a 
completely separate permission and access paths to the stats. That's 
great for full system monitoring, but really bad when you have multiple 
individual tenants on a single host.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Paolo Bonzini Jan. 17, 2020, 11:38 p.m. UTC | #4

On 15/01/20 15:59, Alexander Graf wrote:
> On 15.01.20 15:43, milanpa@amazon.com wrote:
>>>> Let's expose new interface to userspace for garhering these
>>>> statistics with one ioctl.
>>>>
>>>> Userspace application can read counter description once using
>>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the
>>>> KVM_GET_DEBUGFS_VALUES to get value update.
>>>
>>> This is an interface that requires a lot of logic and buffers from
>>> user space to retrieve individual, explicit counters. What if I just
>>> wanted to monitor the number of exits on every user space exit?
>>
>> In case we want to cover such latency sensitive use cases solution b),
>> with mmap'ed structs you suggested, would be a way to go, IMO.
>>
>>> Also, we're suddenly making the debugfs names a full ABI, because
>>> through this interface we only identify the individual stats through
>>> their names. That means we can not remove stats or change their
>>> names, because people may rely on them, no? Thining about this again,
>>> maybe they already are an ABI because people rely on them in debugfs
>>> though?

In theory not, in practice I have treated them as a kind of "soft" ABI:
if the meaning changes you should rename them, and removing them is
fine, but you shouldn't for example change the unit of measure (which is
not hard since they are all counters :) but perhaps you could have
nanoseconds vs TSC cycles in the future for some counters).

>>> I see two alternatives to this approach here:
>>>
>>> a) ONE_REG
>>>
>>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM
>>> as well (if we really have to?). That gives us explicit identifiers
>>> for each stat with an explicit path to introduce new ones with very
>>> unique identifiers.
ONE_REG would force us to define constants for each counter, and would
make it hard to retire them.  I don't like this.

>>> b) part of the mmap'ed vcpu struct

Same here.  Even if we say the semantics of the struct would be exposed
to userspace via KVM_GET_SUPPORTED_DEBUGFS_STAT, someone might end up
getting this wrong and expecting a particular layout.  Milan's proposed
API has the big advantage of being hard to get wrong for userspace.  And
pushing the aggregation to userspace is not a huge chore, but it's still
a chore.

So unless someone has a usecase for latency-sensitive monitoring I'd
prefer to keep it simple (usually these kind of stats can even make
sense if you gather them over relatively large period of time, because
then you'll probably use something else like tracepoints to actually
pinpoint what's going on).

>>> 2) vcpu counters
>>>
>>> Most of the counters count on vcpu granularity, but debugfs only
>>> gives us a full VM view. Whatever we do to improve the situation, we
>>> should definitely try to build something that allows us to get the
>>> counters per vcpu (as well).
>>>
>>> The main purpose of these counters is monitoring. It can be quite
>>> important to know that only a single vCPU is going wild, compared to
>>> all of them for example.
>>
>> I agree, exposing per vcpu counters can be useful. I guess it didn't
>> make much sense exposing them through debugfs so aggregation was done
>> in kernel. However if we chose to go with approach 1-b) mmap counters
>> struct in userspace, we could do this.
> 
> The reason I dislike the debugfs/statfs approach is that it generates a
> completely separate permission and access paths to the stats. That's
> great for full system monitoring, but really bad when you have multiple
> individual tenants on a single host.

I agree, anything in sysfs is complementary to vmfd/vcpufd access.

Paolo

Alexander Graf Jan. 20, 2020, 5:53 p.m. UTC | #5

On 18.01.20 00:38, Paolo Bonzini wrote:
> On 15/01/20 15:59, Alexander Graf wrote:
>> On 15.01.20 15:43, milanpa@amazon.com wrote:
>>>>> Let's expose new interface to userspace for garhering these
>>>>> statistics with one ioctl.
>>>>>
>>>>> Userspace application can read counter description once using
>>>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the
>>>>> KVM_GET_DEBUGFS_VALUES to get value update.
>>>>
>>>> This is an interface that requires a lot of logic and buffers from
>>>> user space to retrieve individual, explicit counters. What if I just
>>>> wanted to monitor the number of exits on every user space exit?
>>>
>>> In case we want to cover such latency sensitive use cases solution b),
>>> with mmap'ed structs you suggested, would be a way to go, IMO.
>>>
>>>> Also, we're suddenly making the debugfs names a full ABI, because
>>>> through this interface we only identify the individual stats through
>>>> their names. That means we can not remove stats or change their
>>>> names, because people may rely on them, no? Thining about this again,
>>>> maybe they already are an ABI because people rely on them in debugfs
>>>> though?
> 
> In theory not, in practice I have treated them as a kind of "soft" ABI:
> if the meaning changes you should rename them, and removing them is
> fine, but you shouldn't for example change the unit of measure (which is
> not hard since they are all counters :) but perhaps you could have
> nanoseconds vs TSC cycles in the future for some counters).
> 
>>>> I see two alternatives to this approach here:
>>>>
>>>> a) ONE_REG
>>>>
>>>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM
>>>> as well (if we really have to?). That gives us explicit identifiers
>>>> for each stat with an explicit path to introduce new ones with very
>>>> unique identifiers.
> ONE_REG would force us to define constants for each counter, and would
> make it hard to retire them.  I don't like this.

Why does it make it hard to retire them? We would just return -EINVAL on 
retrieval, like we do for any other non-supported ONE_REG.

It's the same as a file not existing in debugfs/statfs. Or an entry in 
the array of this patch to disappear.

> 
>>>> b) part of the mmap'ed vcpu struct
> 
> Same here.  Even if we say the semantics of the struct would be exposed
> to userspace via KVM_GET_SUPPORTED_DEBUGFS_STAT, someone might end up
> getting this wrong and expecting a particular layout.  Milan's proposed
> API has the big advantage of being hard to get wrong for userspace.  And
> pushing the aggregation to userspace is not a huge chore, but it's still
> a chore.
> 
> So unless someone has a usecase for latency-sensitive monitoring I'd
> prefer to keep it simple (usually these kind of stats can even make
> sense if you gather them over relatively large period of time, because
> then you'll probably use something else like tracepoints to actually
> pinpoint what's going on).

I tend to agree. Fetching them via an explicit call into the kernel is 
definitely the safer route.

> 
>>>> 2) vcpu counters
>>>>
>>>> Most of the counters count on vcpu granularity, but debugfs only
>>>> gives us a full VM view. Whatever we do to improve the situation, we
>>>> should definitely try to build something that allows us to get the
>>>> counters per vcpu (as well).
>>>>
>>>> The main purpose of these counters is monitoring. It can be quite
>>>> important to know that only a single vCPU is going wild, compared to
>>>> all of them for example.
>>>
>>> I agree, exposing per vcpu counters can be useful. I guess it didn't
>>> make much sense exposing them through debugfs so aggregation was done
>>> in kernel. However if we chose to go with approach 1-b) mmap counters
>>> struct in userspace, we could do this.
>>
>> The reason I dislike the debugfs/statfs approach is that it generates a
>> completely separate permission and access paths to the stats. That's
>> great for full system monitoring, but really bad when you have multiple
>> individual tenants on a single host.
> 
> I agree, anything in sysfs is complementary to vmfd/vcpufd access.

Cool :).

So we only need to agree on ONE_REG vs. this patch mostly?


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Pandurov, Milan Jan. 20, 2020, 6:57 p.m. UTC | #6

On 1/20/20 6:53 PM, Alexander Graf wrote:
>
>
> On 18.01.20 00:38, Paolo Bonzini wrote:
>> On 15/01/20 15:59, Alexander Graf wrote:
>>> On 15.01.20 15:43, milanpa@amazon.com wrote:
>>>>>> Let's expose new interface to userspace for garhering these
>>>>>> statistics with one ioctl.
>>>>>>
>>>>>> Userspace application can read counter description once using
>>>>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the
>>>>>> KVM_GET_DEBUGFS_VALUES to get value update.
>>>>>
>>>>> This is an interface that requires a lot of logic and buffers from
>>>>> user space to retrieve individual, explicit counters. What if I just
>>>>> wanted to monitor the number of exits on every user space exit?
>>>>
>>>> In case we want to cover such latency sensitive use cases solution b),
>>>> with mmap'ed structs you suggested, would be a way to go, IMO.
>>>>
>>>>> Also, we're suddenly making the debugfs names a full ABI, because
>>>>> through this interface we only identify the individual stats through
>>>>> their names. That means we can not remove stats or change their
>>>>> names, because people may rely on them, no? Thining about this again,
>>>>> maybe they already are an ABI because people rely on them in debugfs
>>>>> though?
>>
>> In theory not, in practice I have treated them as a kind of "soft" ABI:
>> if the meaning changes you should rename them, and removing them is
>> fine, but you shouldn't for example change the unit of measure (which is
>> not hard since they are all counters :) but perhaps you could have
>> nanoseconds vs TSC cycles in the future for some counters).
>>
>>>>> I see two alternatives to this approach here:
>>>>>
>>>>> a) ONE_REG
>>>>>
>>>>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM
>>>>> as well (if we really have to?). That gives us explicit identifiers
>>>>> for each stat with an explicit path to introduce new ones with very
>>>>> unique identifiers.
>> ONE_REG would force us to define constants for each counter, and would
>> make it hard to retire them.  I don't like this.
>
> Why does it make it hard to retire them? We would just return -EINVAL 
> on retrieval, like we do for any other non-supported ONE_REG.
>
> It's the same as a file not existing in debugfs/statfs. Or an entry in 
> the array of this patch to disappear.
>
>>
>>>>> b) part of the mmap'ed vcpu struct
>>
>> Same here.  Even if we say the semantics of the struct would be exposed
>> to userspace via KVM_GET_SUPPORTED_DEBUGFS_STAT, someone might end up
>> getting this wrong and expecting a particular layout.  Milan's proposed
>> API has the big advantage of being hard to get wrong for userspace.  And
>> pushing the aggregation to userspace is not a huge chore, but it's still
>> a chore.
>>
>> So unless someone has a usecase for latency-sensitive monitoring I'd
>> prefer to keep it simple (usually these kind of stats can even make
>> sense if you gather them over relatively large period of time, because
>> then you'll probably use something else like tracepoints to actually
>> pinpoint what's going on).
>
> I tend to agree. Fetching them via an explicit call into the kernel is 
> definitely the safer route.
>
>>
>>>>> 2) vcpu counters
>>>>>
>>>>> Most of the counters count on vcpu granularity, but debugfs only
>>>>> gives us a full VM view. Whatever we do to improve the situation, we
>>>>> should definitely try to build something that allows us to get the
>>>>> counters per vcpu (as well).
>>>>>
>>>>> The main purpose of these counters is monitoring. It can be quite
>>>>> important to know that only a single vCPU is going wild, compared to
>>>>> all of them for example.
>>>>
>>>> I agree, exposing per vcpu counters can be useful. I guess it didn't
>>>> make much sense exposing them through debugfs so aggregation was done
>>>> in kernel. However if we chose to go with approach 1-b) mmap counters
>>>> struct in userspace, we could do this.
>>>
>>> The reason I dislike the debugfs/statfs approach is that it generates a
>>> completely separate permission and access paths to the stats. That's
>>> great for full system monitoring, but really bad when you have multiple
>>> individual tenants on a single host.
>>
>> I agree, anything in sysfs is complementary to vmfd/vcpufd access.
>
> Cool :).
>
> So we only need to agree on ONE_REG vs. this patch mostly?

What about extending KVM_GET_SUPPORTED_DEBUGFS_STAT with some additional 
information like the data type and unit? ONE_REG exposes additional 
semantics about data.

>
> Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Alexander Graf Jan. 21, 2020, 3:38 p.m. UTC | #7

On 20.01.20 19:57, milanpa@amazon.com wrote:
> On 1/20/20 6:53 PM, Alexander Graf wrote:
>>
>>
>> On 18.01.20 00:38, Paolo Bonzini wrote:
>>> On 15/01/20 15:59, Alexander Graf wrote:
>>>> On 15.01.20 15:43, milanpa@amazon.com wrote:
>>>>>>> Let's expose new interface to userspace for garhering these
>>>>>>> statistics with one ioctl.
>>>>>>>
>>>>>>> Userspace application can read counter description once using
>>>>>>> KVM_GET_SUPPORTED_DEBUGFS_STAT and periodically invoke the
>>>>>>> KVM_GET_DEBUGFS_VALUES to get value update.
>>>>>>
>>>>>> This is an interface that requires a lot of logic and buffers from
>>>>>> user space to retrieve individual, explicit counters. What if I just
>>>>>> wanted to monitor the number of exits on every user space exit?
>>>>>
>>>>> In case we want to cover such latency sensitive use cases solution b),
>>>>> with mmap'ed structs you suggested, would be a way to go, IMO.
>>>>>
>>>>>> Also, we're suddenly making the debugfs names a full ABI, because
>>>>>> through this interface we only identify the individual stats through
>>>>>> their names. That means we can not remove stats or change their
>>>>>> names, because people may rely on them, no? Thining about this again,
>>>>>> maybe they already are an ABI because people rely on them in debugfs
>>>>>> though?
>>>
>>> In theory not, in practice I have treated them as a kind of "soft" ABI:
>>> if the meaning changes you should rename them, and removing them is
>>> fine, but you shouldn't for example change the unit of measure (which is
>>> not hard since they are all counters :) but perhaps you could have
>>> nanoseconds vs TSC cycles in the future for some counters).
>>>
>>>>>> I see two alternatives to this approach here:
>>>>>>
>>>>>> a) ONE_REG
>>>>>>
>>>>>> We can just add a new DEBUG arch in ONE_REG and expose ONE_REG per VM
>>>>>> as well (if we really have to?). That gives us explicit identifiers
>>>>>> for each stat with an explicit path to introduce new ones with very
>>>>>> unique identifiers.
>>> ONE_REG would force us to define constants for each counter, and would
>>> make it hard to retire them.  I don't like this.
>>
>> Why does it make it hard to retire them? We would just return -EINVAL 
>> on retrieval, like we do for any other non-supported ONE_REG.
>>
>> It's the same as a file not existing in debugfs/statfs. Or an entry in 
>> the array of this patch to disappear.
>>
>>>
>>>>>> b) part of the mmap'ed vcpu struct
>>>
>>> Same here.  Even if we say the semantics of the struct would be exposed
>>> to userspace via KVM_GET_SUPPORTED_DEBUGFS_STAT, someone might end up
>>> getting this wrong and expecting a particular layout.  Milan's proposed
>>> API has the big advantage of being hard to get wrong for userspace.  And
>>> pushing the aggregation to userspace is not a huge chore, but it's still
>>> a chore.
>>>
>>> So unless someone has a usecase for latency-sensitive monitoring I'd
>>> prefer to keep it simple (usually these kind of stats can even make
>>> sense if you gather them over relatively large period of time, because
>>> then you'll probably use something else like tracepoints to actually
>>> pinpoint what's going on).
>>
>> I tend to agree. Fetching them via an explicit call into the kernel is 
>> definitely the safer route.
>>
>>>
>>>>>> 2) vcpu counters
>>>>>>
>>>>>> Most of the counters count on vcpu granularity, but debugfs only
>>>>>> gives us a full VM view. Whatever we do to improve the situation, we
>>>>>> should definitely try to build something that allows us to get the
>>>>>> counters per vcpu (as well).
>>>>>>
>>>>>> The main purpose of these counters is monitoring. It can be quite
>>>>>> important to know that only a single vCPU is going wild, compared to
>>>>>> all of them for example.
>>>>>
>>>>> I agree, exposing per vcpu counters can be useful. I guess it didn't
>>>>> make much sense exposing them through debugfs so aggregation was done
>>>>> in kernel. However if we chose to go with approach 1-b) mmap counters
>>>>> struct in userspace, we could do this.
>>>>
>>>> The reason I dislike the debugfs/statfs approach is that it generates a
>>>> completely separate permission and access paths to the stats. That's
>>>> great for full system monitoring, but really bad when you have multiple
>>>> individual tenants on a single host.
>>>
>>> I agree, anything in sysfs is complementary to vmfd/vcpufd access.
>>
>> Cool :).
>>
>> So we only need to agree on ONE_REG vs. this patch mostly?
> 
> What about extending KVM_GET_SUPPORTED_DEBUGFS_STAT with some additional 
> information like the data type and unit? ONE_REG exposes additional 
> semantics about data.

It's not a problem of exposing the type information - we have that today 
by implicitly saying "every counter is 64bit".

The thing I'm worried about is that we keep inventing these special 
purpose interfaces that really do nothing but transfer numbers in one 
way or another. ONE_REG's purpose was to unify them. Debug counters 
really are the same story.

Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Paolo Bonzini Jan. 23, 2020, 12:08 p.m. UTC | #8

On 21/01/20 16:38, Alexander Graf wrote:
>>>> ONE_REG would force us to define constants for each counter, and would
>>>> make it hard to retire them.  I don't like this.
>>>
>>> Why does it make it hard to retire them? We would just return -EINVAL
>>> on retrieval, like we do for any other non-supported ONE_REG.
>>>
>>> It's the same as a file not existing in debugfs/statfs. Or an entry
>>> in the array of this patch to disappear.

The devil is in the details.  For example, would you retire uapi/
constants and cause programs to fail compilation?  Or do you keep the
obsolete constants forever?  Also, fixing the mapping from ONE_REG
number to stat would mean a switch statement (or loop of some kind---a
switch statement is basically an unrolled binary search) to access the
stats.  Instead returning the id in KVM_GET_SUPPORTED_DEBUGFS_STAT would
simplify returning the stats to a simple copy_to_user.

Of course, some of the complexity would be punted to userspace.  But
userspace is much closer to the humans that ultimately look at the
stats, so the question is: does userspace really care about knowing
which stat is which, or do they just care about having a name that will
ultimately be consumed by humans down the pipe?  If the latter (which is
also my gut feeling), that is also a point against ONE_REG.

> It's not a problem of exposing the type information - we have that today
> by implicitly saying "every counter is 64bit".
> 
> The thing I'm worried about is that we keep inventing these special
> purpose interfaces that really do nothing but transfer numbers in one
> way or another. ONE_REG's purpose was to unify them. Debug counters
> really are the same story.

See above: I am not sure they are the same story because their consumers
might be very different from registers.  Registers are generally
consumed by programs (to migrate VMs, for example) and only occasionally
by humans, while stats are meant to be consumed by humans.  We may
disagree on whether this justifies a completely different API...

Paolo

Alexander Graf Jan. 23, 2020, 12:32 p.m. UTC | #9

On 23.01.20 13:08, Paolo Bonzini wrote:
> On 21/01/20 16:38, Alexander Graf wrote:
>>>>> ONE_REG would force us to define constants for each counter, and would
>>>>> make it hard to retire them.  I don't like this.
>>>>
>>>> Why does it make it hard to retire them? We would just return -EINVAL
>>>> on retrieval, like we do for any other non-supported ONE_REG.
>>>>
>>>> It's the same as a file not existing in debugfs/statfs. Or an entry
>>>> in the array of this patch to disappear.
> 
> The devil is in the details.  For example, would you retire uapi/
> constants and cause programs to fail compilation?  Or do you keep the
> obsolete constants forever?  Also, fixing the mapping from ONE_REG
> number to stat would mean a switch statement (or loop of some kind---a
> switch statement is basically an unrolled binary search) to access the
> stats.  Instead returning the id in KVM_GET_SUPPORTED_DEBUGFS_STAT would
> simplify returning the stats to a simple copy_to_user.

If you look at the way RISC-V implemented ONE_REG, I think we can agree 
that it's possible with constant identifiers as well :). The only 
downside is of course that you may potentially end up with an identifier 
array to map from "ONE_REG id" to "offset in vcpu/vm struct". I fail to 
see how that's worse than the struct kvm_stats_debugfs_item[] we have today.

> Of course, some of the complexity would be punted to userspace.  But
> userspace is much closer to the humans that ultimately look at the
> stats, so the question is: does userspace really care about knowing
> which stat is which, or do they just care about having a name that will
> ultimately be consumed by humans down the pipe?  If the latter (which is
> also my gut feeling), that is also a point against ONE_REG.
> 
>> It's not a problem of exposing the type information - we have that today
>> by implicitly saying "every counter is 64bit".
>>
>> The thing I'm worried about is that we keep inventing these special
>> purpose interfaces that really do nothing but transfer numbers in one
>> way or another. ONE_REG's purpose was to unify them. Debug counters
>> really are the same story.
> 
> See above: I am not sure they are the same story because their consumers
> might be very different from registers.  Registers are generally
> consumed by programs (to migrate VMs, for example) and only occasionally
> by humans, while stats are meant to be consumed by humans.  We may
> disagree on whether this justifies a completely different API...

I don't fully agree on the "human" part here. At the end of the day, you 
want stats because you want to act on stats. Ideally, you want to do 
that fully automatically. Let me give you a few examples:

1) insn_emulation_fail triggers

You may want to feed all the failures into a database to check whether 
there is something wrong in the emulator.

2) (remote_)tlb_flush beyond certain threshold

If you see that you're constantly flushing remote TLBs, there's a good 
chance that you found a workload that may need tuning in KVM. You want 
to gather those stats across your full fleet of hosts, so that for the 
few occasions when you hit it, you can work with the actual VM owners to 
potentially improve their performance

3) exits beyond certain threshold

You know roughly how many exits your fleet would usually see, so you can 
configure an upper threshold on that. When you now have an automated way 
to notify you when the threshold is exceeded, you can check what that 
particular guest did to raise so many exits.

... and I'm sure there's room for a lot more potential stats that could 
be useful to gather to determine the health of a KVM environment, such 
as a "vcpu steal time" one or a "maximum time between two VMENTERS while 
the guest was in running state".

All of these should eventually feed into something bigger that collects 
the numbers across your full VM fleet, so that a human can take actions 
based on them. However, that means the values are no longer directly 
impacting a human, they need to feed into machines. And for that, exact, 
constant identifiers make much more sense :)

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Paolo Bonzini Jan. 23, 2020, 2:19 p.m. UTC | #10

On 23/01/20 13:32, Alexander Graf wrote:
>> See above: I am not sure they are the same story because their consumers
>> might be very different from registers.  Registers are generally
>> consumed by programs (to migrate VMs, for example) and only occasionally
>> by humans, while stats are meant to be consumed by humans.  We may
>> disagree on whether this justifies a completely different API...
> 
> I don't fully agree on the "human" part here.

I agree it's not entirely about humans, but in general it's going to be
rules and queries on monitoring tools, where 1) the monitoring tools'
output is generally not KVM-specific, 2) the rules and queries will be
written by humans.

So if the kernel produces insn_emulation_fail, the plugin for the
monitoring tool will just log kvm.insn_emulation_fail.  If the kernel
produces 0x10042, the plugin will have to convert it and then log it.
This is why I'm not sure that providing strings is actually less work
for userspace.

Paolo

> At the end of the day, you
> want stats because you want to act on stats. Ideally, you want to do
> that fully automatically. Let me give you a few examples:
> 
> 1) insn_emulation_fail triggers
> 
> You may want to feed all the failures into a database to check whether
> there is something wrong in the emulator.
> 
> 2) (remote_)tlb_flush beyond certain threshold
> 
> If you see that you're constantly flushing remote TLBs, there's a good
> chance that you found a workload that may need tuning in KVM. You want
> to gather those stats across your full fleet of hosts, so that for the
> few occasions when you hit it, you can work with the actual VM owners to
> potentially improve their performance
> 
> 3) exits beyond certain threshold
> 
> You know roughly how many exits your fleet would usually see, so you can
> configure an upper threshold on that. When you now have an automated way
> to notify you when the threshold is exceeded, you can check what that
> particular guest did to raise so many exits.
> 
> 
> ... and I'm sure there's room for a lot more potential stats that could
> be useful to gather to determine the health of a KVM environment, such
> as a "vcpu steal time" one or a "maximum time between two VMENTERS while
> the guest was in running state".
> 
> All of these should eventually feed into something bigger that collects
> the numbers across your full VM fleet, so that a human can take actions
> based on them. However, that means the values are no longer directly
> impacting a human, they need to feed into machines. And for that, exact,
> constant identifiers make much more sense

Alexander Graf Jan. 23, 2020, 2:45 p.m. UTC | #11

On 23.01.20 15:19, Paolo Bonzini wrote:
> On 23/01/20 13:32, Alexander Graf wrote:
>>> See above: I am not sure they are the same story because their consumers
>>> might be very different from registers.  Registers are generally
>>> consumed by programs (to migrate VMs, for example) and only occasionally
>>> by humans, while stats are meant to be consumed by humans.  We may
>>> disagree on whether this justifies a completely different API...
>>
>> I don't fully agree on the "human" part here.
> 
> I agree it's not entirely about humans, but in general it's going to be
> rules and queries on monitoring tools, where 1) the monitoring tools'
> output is generally not KVM-specific, 2) the rules and queries will be
> written by humans.
> 
> So if the kernel produces insn_emulation_fail, the plugin for the
> monitoring tool will just log kvm.insn_emulation_fail.  If the kernel
> produces 0x10042, the plugin will have to convert it and then log it.
> This is why I'm not sure that providing strings is actually less work
> for userspace.

I think we're in agreement then, just leaning onto the other side of the 
same fence. My take is that if I don't know whether a string is 
necessary, I'd rather not have a string :).

I guess as long as we do get stat information out per-vm as well as 
per-vcpu through vmfd and vcpufd, I'm happy overall.

So how strongly do you feel about the string based approach?

Alex

PS: You could btw easily add a "give me the string for a ONE_REG id" 
interface in KVM to translate from 0x10042 to "insn_emulation_fail" :).

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Paolo Bonzini Jan. 23, 2020, 2:50 p.m. UTC | #12

On 23/01/20 15:45, Alexander Graf wrote:
> I think we're in agreement then, just leaning onto the other side of the
> same fence. My take is that if I don't know whether a string is
> necessary, I'd rather not have a string :).

And for me it's if I don't know whether a #define is necessary, I'd
rather not have a #define.  So yeah we agree on everything except the
userspace API (which is no small thing, but it's a start).

> I guess as long as we do get stat information out per-vm as well as
> per-vcpu through vmfd and vcpufd, I'm happy overall.
> 
> So how strongly do you feel about the string based approach?

I like it, of course.

> PS: You could btw easily add a "give me the string for a ONE_REG id"
> interface in KVM to translate from 0x10042 to "insn_emulation_fail" :).

That could actually be somewhat useful for VCPU registers as well (give
me the string and type, and a list of valid ONE_REG ids).  If that was
the case, of course it would be fine for me to use ONE_REG on a VM.  The
part which I don't like is having to make all ONE_REG part of the
userspace ABI/API.

Paolo

Alexander Graf Jan. 23, 2020, 2:58 p.m. UTC | #13

On 23.01.20 15:50, Paolo Bonzini wrote:
> On 23/01/20 15:45, Alexander Graf wrote:
>> I think we're in agreement then, just leaning onto the other side of the
>> same fence. My take is that if I don't know whether a string is
>> necessary, I'd rather not have a string :).
> 
> And for me it's if I don't know whether a #define is necessary, I'd
> rather not have a #define.  So yeah we agree on everything except the
> userspace API (which is no small thing, but it's a start).
> 
>> I guess as long as we do get stat information out per-vm as well as
>> per-vcpu through vmfd and vcpufd, I'm happy overall.
>>
>> So how strongly do you feel about the string based approach?
> 
> I like it, of course.
> 
>> PS: You could btw easily add a "give me the string for a ONE_REG id"
>> interface in KVM to translate from 0x10042 to "insn_emulation_fail" :).
> 
> That could actually be somewhat useful for VCPU registers as well (give
> me the string and type, and a list of valid ONE_REG ids).  If that was

You don't need the type explicitly, it's part of the ONE_REG identifier :).

> the case, of course it would be fine for me to use ONE_REG on a VM.  The
> part which I don't like is having to make all ONE_REG part of the
> userspace ABI/API.

We don't have all of ONE_REG as part of the user space ABI today. On 
Book3S PPC you can not access BookE ONE_REG variables for obvious 
reasons. Neither can you access ARM ONE_REGs. Not implementing a ONE_REG 
is perfectly ok.

But I like the idea of a ONE_REG query interface that gives you the list 
of available registers and a string representation of them. It would 
make programming kvm from Python so easy!

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Paolo Bonzini Jan. 23, 2020, 3:05 p.m. UTC | #14

On 23/01/20 15:58, Alexander Graf wrote:
>> the case, of course it would be fine for me to use ONE_REG on a VM.  The
>> part which I don't like is having to make all ONE_REG part of the
>> userspace ABI/API.
> 
> We don't have all of ONE_REG as part of the user space ABI today.

Still those that exist cannot change their id.  This makes them a part
of the userspace ABI.

> But I like the idea of a ONE_REG query interface that gives you the list
> of available registers and a string representation of them. It would
> make programming kvm from Python so easy!

Yeah, wouldn't it?  Milan, what do you think about it?

Paolo

Pandurov, Milan Jan. 23, 2020, 3:27 p.m. UTC | #15

On 1/23/20 4:05 PM, Paolo Bonzini wrote:
> On 23/01/20 15:58, Alexander Graf wrote:
>>> the case, of course it would be fine for me to use ONE_REG on a VM.  The
>>> part which I don't like is having to make all ONE_REG part of the
>>> userspace ABI/API.
>> We don't have all of ONE_REG as part of the user space ABI today.
> Still those that exist cannot change their id.  This makes them a part
> of the userspace ABI.
>
>> But I like the idea of a ONE_REG query interface that gives you the list
>> of available registers and a string representation of them. It would
>> make programming kvm from Python so easy!
> Yeah, wouldn't it?  Milan, what do you think about it?
>
> Paolo
>
I agree, extending the API with GET_AVAILABLE_ONE_REGS (and possibly a 
bitmask argument to narrow down which type of registers userspace is 
interested in) is a clean solution. We won't require userspace to rely 
on constants in compile time if it doesn't need to.

Only concern is that now we need to have some kind of datastructure for 
keeping the mappings between all available ONE_REG IDs and their 
strings/descriptions. Additionally enforcing that newly added ONE_REGs 
always get added to that mapping, is also necessary.

Milan

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Paolo Bonzini Jan. 23, 2020, 4:15 p.m. UTC | #16

On 23/01/20 16:27, milanpa@amazon.com wrote:
>>
>>
>> Paolo
>>
> I agree, extending the API with GET_AVAILABLE_ONE_REGS (and possibly a
> bitmask argument to narrow down which type of registers userspace is
> interested in) is a clean solution. We won't require userspace to rely
> on constants in compile time if it doesn't need to.
> 
> Only concern is that now we need to have some kind of datastructure for
> keeping the mappings between all available ONE_REG IDs and their
> strings/descriptions. Additionally enforcing that newly added ONE_REGs
> always get added to that mapping, is also necessary.

For now just do the implementation for VM ONE_REGs.  We'll worry about
the existing VCPU registers later.

Paolo

Pandurov, Milan Jan. 23, 2020, 6:31 p.m. UTC | #17

On 1/23/20 5:15 PM, Paolo Bonzini wrote:
> On 23/01/20 16:27, milanpa@amazon.com wrote:
>>>
>>> Paolo
>>>
>> I agree, extending the API with GET_AVAILABLE_ONE_REGS (and possibly a
>> bitmask argument to narrow down which type of registers userspace is
>> interested in) is a clean solution. We won't require userspace to rely
>> on constants in compile time if it doesn't need to.
>>
>> Only concern is that now we need to have some kind of datastructure for
>> keeping the mappings between all available ONE_REG IDs and their
>> strings/descriptions. Additionally enforcing that newly added ONE_REGs
>> always get added to that mapping, is also necessary.
> For now just do the implementation for VM ONE_REGs.  We'll worry about
> the existing VCPU registers later.
>
> Paolo
>
Sounds good.

Thanks for the help, I will update the patch soon.

Milan




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

[2/2] kvm: Add ioctl for gathering debug counters

Commit Message

Comments

Patch